Per Boss R’s suggestion, I insert print statements for the variable name and default_data attributes in setup_one_field that will dump the values into separate logs and make the debugging process a bit easier. The logs show that all of the surface restart variables are defined as NaNs by setup_one_field. I’m still not sure where this NaN is coming from , so I pull up the debugger, step through register_restart_field, and catch something I missed before: the FV3GFS::restart_read routine makes the first call to register_restart_field, which passes the surface restart data structure to the fms_io interface routine register_restart_field_r2d8. This subroutine, and it’s 3-D counterpart, are located inside an #ifdefconditional statement that overloads the real kind register_restart_field routine when the OVERLOAD_R8 C pre-processor flag is included at compile time, which is the case for this experiment configuration.
I notice that the *r2d8 and *r3d8 routines are recent additions to fms_io and, after comparing them to the other register_restart_field routines, that is passed to setup_one_field instead of data_default
Furthermore, default_data_r4 is uninitialized, and passed to setup_one_fieldwhether it is set to data_default or not, which is the case here. Thus, setup_one_field assigns a NaN to the restart variable default_data attribute. Mystery solved.
The solution is simple. I add strict typing to the assignment of data_default to data_default_r4, and the following conditional statement in register_restart_field_r2d8 and *_r3d8 routines:
WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.
I hypothesized that the user’s code did not use the fms-io::write_data interface because the variable attribute default_data is uninitialized. Combing through the FV3GFS code base confirms my assumption; the FV3GFS uses the following fms-io routines in the FV3GFS-IO module:
default_data should be set in register_restart_field via the call to setup_one_field:
default_data = MPP_FILL_DOUBLE
Here, data_default is an optional argument; if it is not present, default data is set to the parameter MPP_FILL_DOUBLE, which is defined in mpp_parameter.F90 as 9.9692099683868690e+36. I’m still not sure where default_data is getting set to a NaN, so it’s time for another date with ddt.
I set a breakpoint in the first fv3gfs_io::sfc_prop_restart_write call to register_restart_field:
do num = 1,nvar2m
var2_p => sfc_var2(:,:,num)
if (trim(sfc_name2(num)) == 'sncovr') then
id_restart = register_restart_field(Sfc_restart, fn_srf, sfc_name2(num), var2_p, domain=fv_domain, mandatory=.false.)
id_restart = register_restart_field(Sfc_restart, fn_srf, sfc_name2(num), var2_p, domain=fv_domain)
Stepping through setup_one_field shows that default_data is set to, you guessed it, NaN. I’m wondering if I just have to initialize default_data to something before assigning MPP_FILL_DOUBLE, so I set it to 0.0, recompile, and open the new executable in the debugger. No dice.
After some more mucking around, I am unable to determine the exact cause of the NaN, but suspect that it may have to do with the fact that the sfc_restart data type is declared as a module variable in fv3gfs_io, rather than locally in each of the subroutines, which means that the NaN could have originated from another call to register_restart_field that passes sfc_restart as the fileObject argument.
Since I don’t want to mess with the FV3GFS code, I opt to implement a workaround in save_default_restart in case the user just wants a short-term fix so he can continue his experiment. I declare a local variable called save_restart, and add the following lines before the loop that calls mpp_write on the restart variables:
! set default data attribute to MPP_FILL_DOUBLE if it is undefined
if (isnan(cur_var%default_data)) then
default_data = MPP_FILL_DOUBLE
default_data = cur_var%default_data
I then replace default_data=cur_var%default_data with default_data=default_data in the mpp_write calls.
Unfortunately, this solution is unsatisfactory, since it corrects the error after the fact. I consult with Boss R, who suggests that the NaN may indicate memory corruption. He points out where to put some print statements that will write the default_data values for each restart variable in the output logs. I’ll elaborate on the (hopefully) final steps I take to identify the NaN source in the final part of this debugging trilogy.
After escaping help desk duty for almost 2 years, thanks to the list manager forgetting to add my name to the roster (darn), I received my first ticket regarding an error with our FMS-IO module. Now, I could write a novel on the problems with our I/O library, and how we are trying (again) to solve them, but the gist is that FMS-IO is:
users only use bits and pieces of the library where absolutely necessary for reasons 1 and 2
Okay, back to the ticket. The user tried run a debug-compiled (Intel v16) simulation with the FV3GFS atmosphere code, meaning that the compiler optimizations were turned off, and flags were set to trap, rather than ignore, weird things like NaNs and underflow. The simulation crashed in the call to the fms_io:save_restart subroutine with the following error:
forrtl: error (65): floating invalid
Since this is a debug run, I had a stack trace to work with that pointed me toward the offending line in fms_io:
fms_io_mod_mp_sav 3279 fms_io.F90
The user noted that the simulation completed if the code was compiled in “prod” or “repro”, workflow presets that turn on optimizations -O3 and -O2, respectively. Prod and repro also exclude the flag -ftrapuv, which is an Intel fortran option that “initializes stack local variables to an unusual value to aid error detection”. Normally, these local variables should be initialized in the application. It also unmasks the floating-point invalid exception .” In other words, prod- and repro-compiled executables will not necessarily stop if results yield a floating-point invalid value (i.e., a NaN), but debug-compiled runs will. As a general rule, our group runs tests with all optimizations to catch bugs. In contrast, most scientists run their simulations in prod only, and thus tend to catch issues later in the game when their runs crash, or they notice bad data in their QC and/or analysis.
So, what in line 3279 is causing the trouble? Well, a call to mpp_write, which resides in the mpp_io library, by save_default_restart:
Okay, this isn’t particularly helpful. The offending variable could be any one of the arguments passed to mpp_write, so it’s time to throw the code in the debugger. We have access to Allinea DDT, which is debugging tool that lets you step through and evaluate debug-compiled code. I grab an interactive session, set up the environment, and launch the executable with ddt. I set a breakpoint in save_default_restart at line 2379, and wait a couple of minutes for the run to pause. Here are the stack trace and local variables save_default _restart
I open up the cur_var structure (rightmost panel), which contains the data for the first variable to write to the restart file, and see that the attribute default_data has a NaN value. I press the ‘play’ button to continue the run, and it crashes with the same error message that showed up in the output log. All right, it seems like I’ve found the trigger: cur_var%default_data.
Now, where, exactly is default_data defined? This is where things get tricky. To start, I do a ctrl+F ‘default_data’ in fms_io, and stop at the first result, which shows that it is defined in the var_type derived type:
character(len=128) :: name = ''
character(len=128) :: longname = ''
character(len=128) :: units = ''
real, dimension(:,:,:,:), _ALLOCATABLE :: buffer _NULL
logical :: domain_present = .FALSE.
integer :: domain_idx = -1
logical :: is_dimvar = .FALSE.
logical :: read_only = .FALSE.
logical :: owns_data = .FALSE. ! if true, restart owns the data and will deallocate them when freed
type(fieldtype) :: field
type(axistype) :: axis
integer :: position
integer :: ndim
integer :: siz(5) ! X/Y/Z/T/A extent of fields (data domain
! size for distributed writes;global size for reads)
integer :: gsiz(4) ! global X/Y/Z/A extent of fields
integer :: id_axes(4) ! store index for x/y/z/a axistype.
logical :: initialized ! indicate if the field is read or not in routine save_state.
logical :: mandatory ! indicate if the field is mandatory to be when restart.
integer :: is, ie, js, je ! index of the data in compute domain
real :: default_data
character(len=8) :: compressed_axis !< If on a compressed axis, which axis
integer, dimension(:), allocatable :: pelist
integer :: ishift, jshift ! can be used to shift indices when no_domain=T integer :: x_halo, y_halo ! can be used to indicate halo size when no_domain=T
type(domainUG),pointer :: domain_ug => null() !
I need more information, so I head to one of the next occurrences of default_data, one of the routines in the fms_io::write_data interface:
data_default is passed as an optional argument (an integer in this particular subroutine) to write_data. By default (remember what I said about the I/O being confusing?), default_data is set to the parameter MPP_FILL_INT via the intrinsic TRANSFER function. A quick search yields the following description:
Interprets the bitwise representation of SOURCE in memory as if it is the representation of a variable or array of the same type and type parameters as MOLD. This is approximately equivalent to the C concept of casting one type to another
RESULT = TRANSFER(SOURCE, MOLD [, SIZE])
SOURCE: Shall be a scalar or an array of any type.
MOLD: Shall be a scalar or an array of any type.
SIZE: (Optional) shall be a scalar of type INTEGER.
Huh. Now THAT’s interesting. Essentially, the user is telling the compiler to apply the same bit pattern defined by the source function type (e.g., 0010101) which is MPP_FILL_INT in our example, to the MOLD type—default data, which is a real. This is weird–not type casting itself–but the way fortran tries to approximate it. Then again, fortran was designed to crunch numbers, not do all of the fancy type magic like C.
So what does TRANSFER do, exactly? Let’s read the result (I’ve put the scary parts in bold):
The result has the same type as MOLD, with the bit level representation of SOURCE. If SIZE is present, the result is a one-dimensional array of length SIZE. If SIZE is absent but MOLD is an array (of any size or shape), the result is a one- dimensional array of the minimum length needed to contain the entirety of the bitwise representation of SOURCE. If SIZE is absent and MOLD is a scalar, the result is a scalar.
If the bitwise representation of the result is longer than that of SOURCE, then the leading bits of the result correspond to those of SOURCE and any trailing bits are filled arbitrarily.
When the resulting bit representation does not correspond to a valid representation of a variable of the same type as MOLD, the results are undefined, and subsequent operations on the result
In theory, TRANSFER is a land mine for non-reproducible answers because the output is (a) compiler-dependent and (b) yields undefined results for invalid behavior–bad news for the folks who rely on bit-for-bit reproducibility to verify code changes that are not expected to affect calculations.
All right, now that I’ve gotten a bit side-tracked by one of the MANY reasons to refactor the fms_io code, it’s time to re-group. I’ve determined that default_data is initialized in a call to write_data which, I assume, the user’s code does not do. This is a good stopping point, so I’ll follow up with another post (or maybe two) about how fv3gfs-io module uses fms-io, and (hopefully) the resolution to this issue.
While this may not be the case for younger scientists and students in my field who have grown up in a culture that emphasizes learning to code, climate scientists my age and older are largely self-taught programmers. They, and I (a former scientist) learned Fortran, C, or some other compiled language to write, or run, weather/climate models for research. Most of them use scripted languages, such as Matlab, NCAR’s NCL, R, or Python, for performing statistical analysis on observational data and model output. Unlike computer scientists, who (hopefully) draw up design plans, optimize algorithms, and review code in their development processes, scientist-programmers write code primarily with an end in mind. That is, they write and modify code on the fly depending what they need to accomplish during the course of a specific project. Rarely is scientific code version-controlled, nor is it consistently tested for reproducibility. Comments are often an afterthought, and provide little information for users (usually grad students and postdocs) tasked with running programs written by someone else
I was lucky enough to learned Fortran 77 (yes, Fortran. No, I am not 60 years old) in a classroom setting as an undergraduate meteorology student, which helped instill good coding practices like comments, writing readable code, and, above all, not using GOTO statements. My final project, which I wish I had saved, was some of the best code I have ever written. Unfortunately, I did not save it, but I recall that I wrote a program to read in monthly time series of the NAO and either temperature or precipitation data from ascii files, calculate the correlation coefficients for each month (I think), and write formatted output to a text file. I did all of this via a terminal (c shell) and the Gedit text editor on old Sun systems in the lab, or in my dorm room on my now ancient laptop (running Windows XP), remotely accessing the lab workstations with Portable puTTY.
Some people are planners, and try to map out their programs before coding them up like a proper computer scientist. I, on the other hand, coded like a scientist from the get-go, preferring to write a program, compile it, wait for the crash, and debug-lather, rinse, repeat. There are, advantages to both approaches, with the former being a better option for long-term projects, and the latter working better when you need to solve a problem NOW. Unfortunately, the nature of graduate research (and, research in general) promotes bad coding habits, as scientists are rewarded for output in the form of publications, grants, and attention-grabbing news blurbs that bring in the $$$$. It doesn’t matter (at least in the short term) if your code is sloppy, or only runs on a specific machine with a specific compiler when the moon is waxing, so long as you complete your thesis/paper before your project funding dries up. In the end, though, buggy code will come back to bite you, whether it involves having to redo your analysis (if you’re lucky like me), or having to retract your work (e.g.,  and  ).
My coding sins included copy-pasting code blocks, re-writing the same program create new figures rather than learning to pass in arguments for things like titles, labels, and file names, failing to name programs clearly, and having to open several to determine if they were the ones I needed. These are all fairly benign problems; I mean, I completed my work satisfactorily, and was (usually) able to back up and fix problems when they arose. Grad school is just as much about learning to recover from falls as it is a way to level up your career. Still, I wasted a lot of time redoing work, and engaged in way to much of what I call “reactive analysis”–the rushed, frantic, and error-prone analysis performed to meet a deadline. Reactive analysis generates figures quickly, but does not provide sufficient time to interpret them, meaning that you’ll have to redo them later when your advisor points out that your error bars make no sense, or the units are incorrect. Reactive analysis provides soundbites for your grant application, and plenty of stupid typos you probably would have caught if you’d had time to proofread the document one more time before you submitted it.
Now, less-than-stellar code is not the only cause of reactive analysis, but is one that the programmer can control to a degree, unlike shifting deadlines, personal crises, or system outages. Thankfully, I have managed to quell some of my bad habits at my current job out of necessity. No, the scripts I write aren’t paragons of Python, Fortran, or Bash, but I do take the time to comment and improve the code I use frequently, or plan on sharing with others.