Per Boss R’s suggestion, I insert print statements for the variable name and default_data attributes in setup_one_field that will dump the values into separate logs and make the debugging process a bit easier. The logs show that all of the surface restart variables are defined as NaNs by setup_one_field. I’m still not sure where this NaN is coming from , so I pull up the debugger, step through register_restart_field, and catch something I missed before: the FV3GFS::restart_read routine makes the first call to register_restart_field, which passes the surface restart data structure to the fms_io interface routine register_restart_field_r2d8. This subroutine, and it’s 3-D counterpart, are located inside an #ifdefconditional statement that overloads the real kind register_restart_field routine when the OVERLOAD_R8 C pre-processor flag is included at compile time, which is the case for this experiment configuration.
I notice that the *r2d8 and *r3d8 routines are recent additions to fms_io and, after comparing them to the other register_restart_field routines, that is passed to setup_one_field instead of data_default
Furthermore, default_data_r4 is uninitialized, and passed to setup_one_fieldwhether it is set to data_default or not, which is the case here. Thus, setup_one_field assigns a NaN to the restart variable default_data attribute. Mystery solved.
The solution is simple. I add strict typing to the assignment of data_default to data_default_r4, and the following conditional statement in register_restart_field_r2d8 and *_r3d8 routines:
I hypothesized that the user’s code did not use the fms-io::write_data interface because the variable attribute default_data is uninitialized. Combing through the FV3GFS code base confirms my assumption; the FV3GFS uses the following fms-io routines in the FV3GFS-IO module:
default_data should be set in register_restart_field via the call to setup_one_field:
default_data = MPP_FILL_DOUBLE
Here, data_default is an optional argument; if it is not present, default data is set to the parameter MPP_FILL_DOUBLE, which is defined in mpp_parameter.F90 as 9.9692099683868690e+36. I’m still not sure where default_data is getting set to a NaN, so it’s time for another date with ddt.
I set a breakpoint in the first fv3gfs_io::sfc_prop_restart_write call to register_restart_field:
do num = 1,nvar2m
var2_p => sfc_var2(:,:,num)
if (trim(sfc_name2(num)) == 'sncovr') then
id_restart = register_restart_field(Sfc_restart, fn_srf, sfc_name2(num), var2_p, domain=fv_domain, mandatory=.false.)
id_restart = register_restart_field(Sfc_restart, fn_srf, sfc_name2(num), var2_p, domain=fv_domain)
Stepping through setup_one_field shows that default_data is set to, you guessed it, NaN. I’m wondering if I just have to initialize default_data to something before assigning MPP_FILL_DOUBLE, so I set it to 0.0, recompile, and open the new executable in the debugger. No dice.
After some more mucking around, I am unable to determine the exact cause of the NaN, but suspect that it may have to do with the fact that the sfc_restart data type is declared as a module variable in fv3gfs_io, rather than locally in each of the subroutines, which means that the NaN could have originated from another call to register_restart_field that passes sfc_restart as the fileObject argument.
Since I don’t want to mess with the FV3GFS code, I opt to implement a workaround in save_default_restart in case the user just wants a short-term fix so he can continue his experiment. I declare a local variable called save_restart, and add the following lines before the loop that calls mpp_write on the restart variables:
! set default data attribute to MPP_FILL_DOUBLE if it is undefined
if (isnan(cur_var%default_data)) then
default_data = MPP_FILL_DOUBLE
default_data = cur_var%default_data
I then replace default_data=cur_var%default_data with default_data=default_data in the mpp_write calls.
Unfortunately, this solution is unsatisfactory, since it corrects the error after the fact. I consult with Boss R, who suggests that the NaN may indicate memory corruption. He points out where to put some print statements that will write the default_data values for each restart variable in the output logs. I’ll elaborate on the (hopefully) final steps I take to identify the NaN source in the final part of this debugging trilogy.
After escaping help desk duty for almost 2 years, thanks to the list manager forgetting to add my name to the roster (darn), I received my first ticket regarding an error with our FMS-IO module. Now, I could write a novel on the problems with our I/O library, and how we are trying (again) to solve them, but the gist is that FMS-IO is:
users only use bits and pieces of the library where absolutely necessary for reasons 1 and 2
Okay, back to the ticket. The user tried run a debug-compiled (Intel v16) simulation with the FV3GFS atmosphere code, meaning that the compiler optimizations were turned off, and flags were set to trap, rather than ignore, weird things like NaNs and underflow. The simulation crashed in the call to the fms_io:save_restart subroutine with the following error:
forrtl: error (65): floating invalid
Since this is a debug run, I had a stack trace to work with that pointed me toward the offending line in fms_io:
fms_io_mod_mp_sav 3279 fms_io.F90
The user noted that the simulation completed if the code was compiled in “prod” or “repro”, workflow presets that turn on optimizations -O3 and -O2, respectively. Prod and repro also exclude the flag -ftrapuv, which is an Intel fortran option that “initializes stack local variables to an unusual value to aid error detection”. Normally, these local variables should be initialized in the application. It also unmasks the floating-point invalid exception .” In other words, prod- and repro-compiled executables will not necessarily stop if results yield a floating-point invalid value (i.e., a NaN), but debug-compiled runs will. As a general rule, our group runs tests with all optimizations to catch bugs. In contrast, most scientists run their simulations in prod only, and thus tend to catch issues later in the game when their runs crash, or they notice bad data in their QC and/or analysis.
So, what in line 3279 is causing the trouble? Well, a call to mpp_write, which resides in the mpp_io library, by save_default_restart:
Okay, this isn’t particularly helpful. The offending variable could be any one of the arguments passed to mpp_write, so it’s time to throw the code in the debugger. We have access to Allinea DDT, which is debugging tool that lets you step through and evaluate debug-compiled code. I grab an interactive session, set up the environment, and launch the executable with ddt. I set a breakpoint in save_default_restart at line 2379, and wait a couple of minutes for the run to pause. Here are the stack trace and local variables save_default _restart
I open up the cur_var structure (rightmost panel), which contains the data for the first variable to write to the restart file, and see that the attribute default_data has a NaN value. I press the ‘play’ button to continue the run, and it crashes with the same error message that showed up in the output log. All right, it seems like I’ve found the trigger: cur_var%default_data.
Now, where, exactly is default_data defined? This is where things get tricky. To start, I do a ctrl+F ‘default_data’ in fms_io, and stop at the first result, which shows that it is defined in the var_type derived type:
character(len=128) :: name = ''
character(len=128) :: longname = ''
character(len=128) :: units = ''
real, dimension(:,:,:,:), _ALLOCATABLE :: buffer _NULL
logical :: domain_present = .FALSE.
integer :: domain_idx = -1
logical :: is_dimvar = .FALSE.
logical :: read_only = .FALSE.
logical :: owns_data = .FALSE. ! if true, restart owns the data and will deallocate them when freed
type(fieldtype) :: field
type(axistype) :: axis
integer :: position
integer :: ndim
integer :: siz(5) ! X/Y/Z/T/A extent of fields (data domain
! size for distributed writes;global size for reads)
integer :: gsiz(4) ! global X/Y/Z/A extent of fields
integer :: id_axes(4) ! store index for x/y/z/a axistype.
logical :: initialized ! indicate if the field is read or not in routine save_state.
logical :: mandatory ! indicate if the field is mandatory to be when restart.
integer :: is, ie, js, je ! index of the data in compute domain
real :: default_data
character(len=8) :: compressed_axis !< If on a compressed axis, which axis
integer, dimension(:), allocatable :: pelist
integer :: ishift, jshift ! can be used to shift indices when no_domain=T integer :: x_halo, y_halo ! can be used to indicate halo size when no_domain=T
type(domainUG),pointer :: domain_ug => null() !
I need more information, so I head to one of the next occurrences of default_data, one of the routines in the fms_io::write_data interface:
data_default is passed as an optional argument (an integer in this particular subroutine) to write_data. By default (remember what I said about the I/O being confusing?), default_data is set to the parameter MPP_FILL_INT via the intrinsic TRANSFER function. A quick search yields the following description:
Interprets the bitwise representation of SOURCE in memory as if it is the representation of a variable or array of the same type and type parameters as MOLD. This is approximately equivalent to the C concept of casting one type to another
RESULT = TRANSFER(SOURCE, MOLD [, SIZE])
SOURCE: Shall be a scalar or an array of any type.
MOLD: Shall be a scalar or an array of any type.
SIZE: (Optional) shall be a scalar of type INTEGER.
Huh. Now THAT’s interesting. Essentially, the user is telling the compiler to apply the same bit pattern defined by the source function type (e.g., 0010101) which is MPP_FILL_INT in our example, to the MOLD type—default data, which is a real. This is weird–not type casting itself–but the way fortran tries to approximate it. Then again, fortran was designed to crunch numbers, not do all of the fancy type magic like C.
So what does TRANSFER do, exactly? Let’s read the result (I’ve put the scary parts in bold):
The result has the same type as MOLD, with the bit level representation of SOURCE. If SIZE is present, the result is a one-dimensional array of length SIZE. If SIZE is absent but MOLD is an array (of any size or shape), the result is a one- dimensional array of the minimum length needed to contain the entirety of the bitwise representation of SOURCE. If SIZE is absent and MOLD is a scalar, the result is a scalar.
If the bitwise representation of the result is longer than that of SOURCE, then the leading bits of the result correspond to those of SOURCE and any trailing bits are filled arbitrarily.
When the resulting bit representation does not correspond to a valid representation of a variable of the same type as MOLD, the results are undefined, and subsequent operations on the result
In theory, TRANSFER is a land mine for non-reproducible answers because the output is (a) compiler-dependent and (b) yields undefined results for invalid behavior–bad news for the folks who rely on bit-for-bit reproducibility to verify code changes that are not expected to affect calculations.
All right, now that I’ve gotten a bit side-tracked by one of the MANY reasons to refactor the fms_io code, it’s time to re-group. I’ve determined that default_data is initialized in a call to write_data which, I assume, the user’s code does not do. This is a good stopping point, so I’ll follow up with another post (or maybe two) about how fv3gfs-io module uses fms-io, and (hopefully) the resolution to this issue.