Data Coding Guidelines

 

Guidelines for Integrating Data Products and Algorithms to ARM Data Libraries

The Atmospheric Radiation Measurement (ARM) user facility manages data produced from 350+ instruments and ~1500 datastreams across the various ARM sites and produces nearly 70+ value-added products (VAPs) for the research community. As algorithms become more complex and require integrating multiple datastreams, ARM is relying more on algorithms developed by principal investigators (PIs). Coding guidelines have been developed to help streamline the process of integrating PI-developed algorithms with ARM processes and libraries. Guidelines for PIs, instrument mentors, and students of PIs are available here to consult when writing and developing data products and code to simplify software integration into existing ARM libraries.

The benefit for PIs, mentors, and their associates to follow these guidelines is not only to streamline the process for ARM, but also to improve coding practices and expedite the development and integration process for scientists, which will quickly increase the impact of their algorithms and data products.

The ARM development team at the ARM Data Center has created a tool called the ARM Data Integrator (ADI; Gaustad, et al., 2014*), which helps ARM developers to integrate and transform multiple diverse datastreams for input into VAPS, and provides a mechanism for generating output that meets ARM standards for data file formats and metadata. Implementing these coding guidelines will allow PIs to produce modular code for easy integration into ADI.

The advantages of ADI are:

  • Applies ARM standards—automatically—enabling easy search and discovery of products and on-the-fly data integration
  • Documents dependencies, metrics, status, and logs
  • Automates reprocessing
  • Captures provenance.

The following recommendations and coding practices will enable seamless integration into ADI. Limiting programming languages and establishing common coding practices and techniques also facilitates software maintainability and reduces the time to transition to operational products.

Languages

The following languages are highly recommended since they have a native ADI interface. The time to integrate code in these languages to ADI is minimal.

  • C (gcc 4.4.7+)
  • Python 2.7+
  • IDL, version 8.2+.

The ADI libraries are written natively in C, so it is possible to support code in languages that are interoperable with C, and specifically with the gcc compiler. Thus, Fortran and C++ routines can be hooked into ADI, at the cost of additional development effort to write C language “wrapper” routines to interface with ADI and call the routines appropriately.

To simplify interoperability issues with gcc, Fortran code should compile with gfortran, and C++ code should be compile with g++ (both of which are based on gcc), or additional development effort may be required.

It is possible that other languages may be made to work, but generally at significant development costs and delays. If data merging and transformation were performed in a modular way, integration with ADI would be more straightforward. One additional option may be to install a standalone executable in the ARM Data Center and have the ADI wrapper routine spawn out to run it, but that method involves substantial development to handle the I/O of the standalone executable and to do proper logging and error-checking of its runs.

At this time MATLAB is not supported operationally due to the high cost of support and maintenance.

Organization of Code

ADI automates repetitive data preparation and production tasks, thereby decreasing the time and costs to implement and support the algorithm. In order to easily integrate with ADI, separate functions should be used to code the main elements of the logic flow. The goal is to separate the science from functions, such as reading data, data preparation, quality checks, writing and storing data. Moving existing code to ADI mostly involves stripping out all the things that ADI already does and hooking the remaining parts into the ADI framework. Therefore, it is vital that the code to be moved be modular and that routines that perform the following functions are isolated from each other as much as possible:

  1. Initialize variables
  2. Read input variables
  3. Perform data quality on input variables
  4. Transform inputs to common grid
  5. Implement scientific algorithm/calculations
  6. Perform quality checks on output variables
  7. Create output data sets
  8. Store output.

ADI is designed to do all of the above steps internally except for step 5, the actual science or algorithm. Therefore, it is critical when developing a standalone code that you break out these tasks into modular functions or routines, at which point moving to ADI becomes a matter of lifting the science module and putting it into an ADI framework that performs the other steps. The more the science is integrated with the setup and I/O, the harder it is to extract the parts we need to port to ADI.

It is important to read and check input data early and abort execution if serious issues are encountered. For instance, check that the number of valid samples found makes sense and is expected.

Before applying an algorithm, input data may need to be consolidated or “transformed” onto a new coordinate grid. For instance, the input data may be at a time resolution of 10Hz, but the output data is at 1-minute. If input data requires transformation, explicitly indicate the following:

  • Are data being averaged? If yes, specify the input bin width (e.g., 1-minute, 0.15km, or 10um) and bin location (e.g., start, middle, or end of bin). Also, specify the output averaging interval and reporting frequency. For example, you may want to produce hourly averages of temperature reported every 15-minutes. The example is in terms of time, but the same would apply to any other type of dimension, such as vertical resolution (e.g., vertically pointing radar range gates) or bin-resolved data (e.g., aerosol size distribution).
  • Are data being interpolated? If yes, specify the type of interpolation (e.g., linear, nearest neighbor, etc.). As above, also specify the output interval in terms of time or any other applicable dimension.
  • Identify the source(s) for all input (e.g., datastream, datastream and variable, or algorithm).

Development of Code

Please adhere to the following guidelines during algorithm development:

  • Keep the functions relatively short. One or two printed pages of code is a good rule.
  • Include comments and documentation throughout the code.
  • Declare, initialize, and provide units for all variables and constants.
  • Avoid the use of single character variable names; provide meaningful variable names that represent the values.
  • Check input for validity and make appropriate logic choices if invalid inputs are detected.
  • Check computed values to be sure results are sensible.
  • Explicitly check for NaN (not a number) and Inf (infinite) results, and, if detected, set to missing (-9999) in final output.
  • If a value can be missing, include a companion quality control (QC) variable to indicate why the value is missing (e.g., no data, value less than valid minimum, value greater than valid maximum).
  • Include references in the code where appropriate.
  • Be sure to test the algorithm for multiple dates, and, where appropriate, at different sites during different seasons.

The following are suggested coding practices to facilitate a smooth integration with ADI:

  • Modularize code that is used multiple times.
  • Try to limit coupling between functions (i.e., functions should be fairly independent).
  • Use a consistent code style (e.g., indentation, tabs versus spaces, and bracket location).

It is recommended, when possible, that output variable names and units adhere to ARM Data Standards.

Reference

Gaustad, KL, TR Shippert, BD Ermold, SJ Beus, A Borsholm, and KM Fox. 2014. “A scientific data processing framework for time series NetCDF data.” Environmental Modeling and Software, 60, doi:10.1016/j.envsoft.2014.06.005.