After escaping help desk duty for almost 2 years, thanks to the list manager forgetting to add my name to the roster (darn), I received my first ticket regarding an error with our FMS-IO module. Now, I could write a novel on the problems with our I/O library, and how we are trying (again) to solve them, but the gist is that FMS-IO is:
- users only use bits and pieces of the library where absolutely necessary for reasons 1 and 2
Okay, back to the ticket. The user tried run a debug-compiled (Intel v16) simulation with the FV3GFS atmosphere code, meaning that the compiler optimizations were turned off, and flags were set to trap, rather than ignore, weird things like NaNs and underflow. The simulation crashed in the call to the fms_io:save_restart subroutine with the following error:
forrtl: error (65): floating invalid
Since this is a debug run, I had a stack trace to work with that pointed me toward the offending line in fms_io:
fms_io_mod_mp_sav 3279 fms_io.F90
The user noted that the simulation completed if the code was compiled in “prod” or “repro”, workflow presets that turn on optimizations -O3 and -O2, respectively. Prod and repro also exclude the flag -ftrapuv, which is an Intel fortran option that “initializes stack local variables to an unusual value to aid error detection”. Normally, these local variables should be initialized in the application. It also unmasks the floating-point invalid exception .” In other words, prod- and repro-compiled executables will not necessarily stop if results yield a floating-point invalid value (i.e., a NaN), but debug-compiled runs will. As a general rule, our group runs tests with all optimizations to catch bugs. In contrast, most scientists run their simulations in prod only, and thus tend to catch issues later in the game when their runs crash, or they notice bad data in their QC and/or analysis.
So, what in line 3279 is causing the trouble? Well, a call to mpp_write, which resides in the mpp_io library, by save_default_restart:
subroutine save_default_restart(fileObj,restartpath) type(restart_file_type), intent(inout) :: fileObj character(len=336) :: restartpath ! The restart file path (dir/file). ... call mpp_write(unit, cur_var%field, array_domain(cur_var%domain_idx), fileObj%p2dr8(k,j)%p, tlev_r8, & default_data=real(cur_var%default_data,kind=DOUBLE_KIND))
Okay, this isn’t particularly helpful. The offending variable could be any one of the arguments passed to mpp_write, so it’s time to throw the code in the debugger. We have access to Allinea DDT, which is debugging tool that lets you step through and evaluate debug-compiled code. I grab an interactive session, set up the environment, and launch the executable with ddt. I set a breakpoint in save_default_restart at line 2379, and wait a couple of minutes for the run to pause. Here are the stack trace and local variables save_default _restart
I open up the cur_var structure (rightmost panel), which contains the data for the first variable to write to the restart file, and see that the attribute default_data has a NaN value. I press the ‘play’ button to continue the run, and it crashes with the same error message that showed up in the output log. All right, it seems like I’ve found the trigger: cur_var%default_data.
Now, where, exactly is default_data defined? This is where things get tricky. To start, I do a ctrl+F ‘default_data’ in fms_io, and stop at the first result, which shows that it is defined in the var_type derived type:
type var_type private character(len=128) :: name = '' character(len=128) :: longname = '' character(len=128) :: units = '' real, dimension(:,:,:,:), _ALLOCATABLE :: buffer _NULL logical :: domain_present = .FALSE. integer :: domain_idx = -1 logical :: is_dimvar = .FALSE. logical :: read_only = .FALSE. logical :: owns_data = .FALSE. ! if true, restart owns the data and will deallocate them when freed type(fieldtype) :: field type(axistype) :: axis integer :: position integer :: ndim integer :: siz(5) ! X/Y/Z/T/A extent of fields (data domain ! size for distributed writes;global size for reads) integer :: gsiz(4) ! global X/Y/Z/A extent of fields integer :: id_axes(4) ! store index for x/y/z/a axistype. logical :: initialized ! indicate if the field is read or not in routine save_state. logical :: mandatory ! indicate if the field is mandatory to be when restart. integer :: is, ie, js, je ! index of the data in compute domain real :: default_data character(len=8) :: compressed_axis !< If on a compressed axis, which axis integer, dimension(:), allocatable :: pelist integer :: ishift, jshift ! can be used to shift indices when no_domain=T integer :: x_halo, y_halo ! can be used to indicate halo size when no_domain=T !---------- !ug support type(domainUG),pointer :: domain_ug => null() !
I need more information, so I head to one of the next occurrences of default_data, one of the routines in the fms_io::write_data interface:
subroutine write_data_i3d_new(filename, fieldname, data, domain, & no_domain, position, tile_count, data_default) character(len=*), intent(in) :: filename, fieldname integer, dimension(:,:,:), intent(in) :: data type(domain2d), intent(in), optional :: domain logical, intent(in), optional :: no_domain integer, intent(in), optional :: position, tile_count, data_default real :: default_data default_data = TRANSFER(MPP_FILL_INT,default_data) if(present(data_default)) default_data = real(data_default) call write_data_3d_new(filename, fieldname, real(data), domain, & no_domain, .false., position, tile_count, data_default=default_data) end subroutine write_data_i3d_new
data_default is passed as an optional argument (an integer in this particular subroutine) to write_data. By default (remember what I said about the I/O being confusing?), default_data is set to the parameter MPP_FILL_INT via the intrinsic TRANSFER function. A quick search yields the following description:
Interprets the bitwise representation of SOURCE in memory as if it is the representation of a variable or array of the same type and type parameters as MOLD. This is approximately equivalent to the C concept of casting one type to another
RESULT = TRANSFER(SOURCE, MOLD [, SIZE])
SOURCE: Shall be a scalar or an array of any type.
MOLD: Shall be a scalar or an array of any type.
SIZE: (Optional) shall be a scalar of type INTEGER.
Huh. Now THAT’s interesting. Essentially, the user is telling the compiler to apply the same bit pattern defined by the source function type (e.g., 0010101) which is MPP_FILL_INT in our example, to the MOLD type—default data, which is a real. This is weird–not type casting itself–but the way fortran tries to approximate it. Then again, fortran was designed to crunch numbers, not do all of the fancy type magic like C.
So what does TRANSFER do, exactly? Let’s read the result (I’ve put the scary parts in bold):
The result has the same type as MOLD, with the bit level representation of SOURCE. If SIZE is present, the result is a one-dimensional array of length SIZE. If SIZE is absent but MOLD is an array (of any size or shape), the result is a one- dimensional array of the minimum length needed to contain the entirety of the bitwise representation of SOURCE. If SIZE is absent and MOLD is a scalar, the result is a scalar.
If the bitwise representation of the result is longer than that of SOURCE, then the leading bits of the result correspond to those of SOURCE and any trailing bits are filled arbitrarily.
When the resulting bit representation does not correspond to a valid representation of a variable of the same type as MOLD, the results are undefined, and subsequent operations on the result
In theory, TRANSFER is a land mine for non-reproducible answers because the output is (a) compiler-dependent and (b) yields undefined results for invalid behavior–bad news for the folks who rely on bit-for-bit reproducibility to verify code changes that are not expected to affect calculations.
All right, now that I’ve gotten a bit side-tracked by one of the MANY reasons to refactor the fms_io code, it’s time to re-group. I’ve determined that default_data is initialized in a call to write_data which, I assume, the user’s code does not do. This is a good stopping point, so I’ll follow up with another post (or maybe two) about how fv3gfs-io module uses fms-io, and (hopefully) the resolution to this issue.