Tales From the Help Desk: The NaN Trap Part I

After escaping help desk duty for almost 2 years, thanks to the list manager forgetting to add my name to the roster (darn), I received my first ticket regarding an error with our FMS-IO module. Now, I could write a novel on the problems with our I/O library, and how we are trying (again) to solve them, but the gist is that FMS-IO is:

  1. complicated
  2. error-prone
  3. users only use bits and pieces of the library where absolutely necessary for reasons 1 and 2

Okay, back to the ticket. The user tried run a debug-compiled (Intel v16) simulation with the FV3GFS atmosphere code, meaning that the compiler optimizations were turned off, and flags were set to trap, rather than ignore, weird things like NaNs and underflow. The simulation crashed in the call to the fms_io:save_restart subroutine with the following error:

forrtl: error (65): floating invalid

Since this is a debug run, I had a stack trace to work with that pointed me toward the offending line in fms_io:

fms_io_mod_mp_sav  3279  fms_io.F90

The user noted that the simulation completed if the code was compiled in “prod” or “repro”, workflow presets that turn on optimizations -O3 and -O2, respectively. Prod and repro also exclude the flag -ftrapuv, which is an Intel fortran option that “initializes stack local variables to an unusual value to aid error detection”. Normally, these local variables should be initialized in the application. It also unmasks the floating-point invalid exception .” In other words, prod- and repro-compiled executables will not necessarily stop if results yield a floating-point invalid value (i.e., a NaN), but debug-compiled runs will. As a general rule, our group runs tests with all optimizations to catch bugs. In contrast, most scientists run their simulations in prod only, and thus tend to catch issues later in the game when their runs crash, or they notice bad data in their QC and/or analysis.

So, what in line 3279 is causing the trouble? Well, a call to mpp_write, which resides in the mpp_io library, by save_default_restart:

subroutine save_default_restart(fileObj,restartpath)
type(restart_file_type), intent(inout) :: fileObj
character(len=336) :: restartpath ! The restart file path (dir/file).
                           ...
call mpp_write(unit, cur_var%field, array_domain(cur_var%domain_idx), fileObj%p2dr8(k,j)%p, tlev_r8, &
default_data=real(cur_var%default_data,kind=DOUBLE_KIND))

Okay, this isn’t particularly helpful. The offending variable could be any one of the arguments passed to mpp_write, so it’s time to throw the code in the debugger. We have access to Allinea DDT, which is debugging tool that lets you step through and evaluate debug-compiled code. I grab an interactive session, set up the environment, and launch the executable with ddt. I set a breakpoint in save_default_restart at line 2379, and wait a couple of minutes for the run to pause. Here are the stack trace and local variables save_default _restart

I open up the cur_var structure (rightmost panel), which contains the data for the first variable to write to the restart file, and see that the attribute default_data has a NaN value. I press the ‘play’ button to continue the run, and it crashes with the same error message that showed up in the output log. All right, it seems like I’ve found the trigger: cur_var%default_data.

Now, where, exactly is default_data defined? This is where things get tricky. To start, I  do a ctrl+F ‘default_data’ in fms_io, and stop at the first result, which shows that it is defined in the var_type derived type:

type var_type
   private
   character(len=128)                     :: name = ''
   character(len=128)                     :: longname = ''
   character(len=128)                     :: units = ''
   real, dimension(:,:,:,:), _ALLOCATABLE :: buffer _NULL
   logical                                :: domain_present = .FALSE.
   integer                                :: domain_idx = -1
   logical                                :: is_dimvar = .FALSE.
   logical                                :: read_only = .FALSE.
   logical                                :: owns_data = .FALSE. ! if true, restart owns the data and will deallocate them when freed
   type(fieldtype)                        :: field
   type(axistype)                         :: axis
   integer                                :: position
   integer                                :: ndim
   integer                                :: siz(5) ! X/Y/Z/T/A extent of fields (data domain
                                                         ! size for distributed writes;global size for reads)
   integer                                :: gsiz(4)     ! global X/Y/Z/A extent of fields
   integer                                :: id_axes(4)  ! store index for x/y/z/a axistype.
   logical                                :: initialized ! indicate if the field is read or not in routine save_state.
   logical                                :: mandatory   ! indicate if the field is mandatory to be when restart.
   integer                                :: is, ie, js, je  ! index of the data in compute domain
   real                                   :: default_data
   character(len=8)                       :: compressed_axis !< If on a compressed axis, which axis
   integer, dimension(:), allocatable     :: pelist
   integer                                :: ishift, jshift ! can be used to shift indices when no_domain=T  integer                                :: x_halo, y_halo ! can be used to indicate halo size when no_domain=T
!----------
!ug support
    type(domainUG),pointer            :: domain_ug => null()   !

I need more information, so I head to one of the next occurrences of default_data, one of the routines in the fms_io::write_data interface:

subroutine write_data_i3d_new(filename, fieldname, data, domain, &                 no_domain, position, tile_count, data_default)

  character(len=*), intent(in) :: filename, fieldname
  integer, dimension(:,:,:), intent(in) :: data
  type(domain2d), intent(in), optional :: domain
  logical, intent(in), optional :: no_domain
  integer, intent(in), optional :: position, tile_count, data_default
  real :: default_data

  default_data = TRANSFER(MPP_FILL_INT,default_data)
  if(present(data_default)) default_data = real(data_default)

  call write_data_3d_new(filename, fieldname, real(data), domain,  &
                         no_domain, .false., position, tile_count, data_default=default_data)
end subroutine write_data_i3d_new

data_default is passed as an optional argument (an integer in this particular subroutine) to write_data. By default (remember what I said about the I/O being confusing?), default_data is set to the parameter MPP_FILL_INT via the intrinsic TRANSFER function. A quick search yields the following description:

Interprets the bitwise representation of SOURCE in memory as if it is the representation of a variable or array of the same type and type parameters as MOLD. This is approximately equivalent to the C concept of casting one type to another

RESULT = TRANSFER(SOURCE, MOLD [, SIZE])

Arguments:

SOURCE: Shall be a scalar or an array of any type.

MOLD: Shall be a scalar or an array of any type.

SIZE: (Optional) shall be a scalar of type INTEGER.

Huh. Now THAT’s interesting. Essentially, the user is telling the compiler to apply the same bit pattern defined by the source function type (e.g., 0010101) which is MPP_FILL_INT in our example, to the MOLD type—default data, which is a real. This is weird–not type casting itself–but the way fortran tries to approximate it. Then again, fortran was designed to crunch numbers, not do all of the fancy type magic like C.

So what does TRANSFER do, exactly? Let’s read the result (I’ve put the scary parts in bold):

The result has the same type as MOLD, with the bit level representation of SOURCE. If SIZE is present, the result is a one-dimensional array of length SIZE. If SIZE is absent but MOLD is an array (of any size or shape), the result is a one- dimensional array of the minimum length needed to contain the entirety of the bitwise representation of SOURCE. If SIZE is absent and MOLD is a scalar, the result is a scalar.

If the bitwise representation of the result is longer than that of SOURCE, then the leading bits of the result correspond to those of SOURCE and any trailing bits are filled arbitrarily.

When the resulting bit representation does not correspond to a valid representation of a variable of the same type as MOLD, the results are undefined, and subsequent operations on the result

In theory,  TRANSFER is a land mine for non-reproducible answers because the output is (a) compiler-dependent and (b) yields undefined results for invalid behavior–bad news for the folks who rely on bit-for-bit reproducibility to verify code changes that are not expected to affect calculations.

All right, now that I’ve gotten a bit side-tracked by one of the MANY reasons to refactor the fms_io code, it’s time to re-group. I’ve determined that default_data is initialized in a call to write_data which, I assume, the user’s code does not do. This is a good stopping point, so I’ll follow up with another post (or maybe two) about how fv3gfs-io module uses fms-io, and (hopefully) the resolution to this issue.

2 thoughts on “Tales From the Help Desk: The NaN Trap Part I

  1. Pingback: Tales From the Help Desk: The NaN Trap Part II | Legacy Code Keeper

  2. Pingback: Tales from the Help Desk: The NaN Trap Part III | Legacy Code Keeper

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s