From 221247406add9d517431163c712e1b017a3b5cd9 Mon Sep 17 00:00:00 2001 From: Veronica Martinez <39746325+vmartinez-cu@users.noreply.github.com> Date: Thu, 29 Aug 2024 11:53:48 -0600 Subject: [PATCH 1/2] Add new guide on netCDF file format (#24) * New guide on netCDF file format. Additional content needed * Add hyperlinks * Minor edits to provide some clarity. Still needs an introduction for each section * Add guide to read the docs. Incorporate feedback from PR --- .../data_management/file_formats/index.rst | 8 ++ .../data_management/file_formats/netcdf.md | 112 ++++++++++++++++++ docs/source/_static/data_management/index.rst | 8 ++ docs/source/index.rst | 1 + 4 files changed, 129 insertions(+) create mode 100644 docs/source/_static/data_management/file_formats/index.rst create mode 100644 docs/source/_static/data_management/file_formats/netcdf.md create mode 100644 docs/source/_static/data_management/index.rst diff --git a/docs/source/_static/data_management/file_formats/index.rst b/docs/source/_static/data_management/file_formats/index.rst new file mode 100644 index 0000000..bde309f --- /dev/null +++ b/docs/source/_static/data_management/file_formats/index.rst @@ -0,0 +1,8 @@ +File Formats +============ + + +.. toctree:: + :maxdepth: 1 + + netcdf.md \ No newline at end of file diff --git a/docs/source/_static/data_management/file_formats/netcdf.md b/docs/source/_static/data_management/file_formats/netcdf.md new file mode 100644 index 0000000..23a6165 --- /dev/null +++ b/docs/source/_static/data_management/file_formats/netcdf.md @@ -0,0 +1,112 @@ +# NetCDF +>**Warning** +> This guide needs additional information + +NetCDF (Network Common Data Form), is a file format that stores scientific data in arrays. Array values may be accessed +directly, without knowing how the data are stored, and metadata information may be stored with the data. + +* Binary file format commonly used for scientific data +* Self-describing, includes metadata +* Multi-dimensional array data model + +The [netCDF data model](https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html) consists of the following: +* variable + * Multi-dimensional array + * Column-oriented: each variable as a separate entity +* dimension + * Usually temporal, spatial, spectral, ... + * Can be unlimited length. One, at most, is recommended for a growing time dimension +* attribute + * Metadata: global and variable level +* group + * Akin to directories + * Avoid unless you really need the complex structure + + +## Why use NetCDF +NetCDF is a file format commonly used at LASP as it is the "highly preferred" format for NASA Earth Observing System +Data and Information System data products, per their Data Product Development Guide for Data Producers. +This affects all NASA Earth Science missions. + +NetCDF features: +* Self-describing + * structure captures coordinate system (functional relationship) + * includes metadata +* Efficient storage + * packing + * compression +* Efficient access + * chunking + * http byte range + * parallel IO +* Open specification (unlike IDL save files) + +## Options available +There are two netCDF data models: +* NetCDF-3 classic +* NetCDF-4 built on HDF5 + * recommended but prefer classic constructs + +## How to use this data format + +#### NetCDF Files +* Binary format with open specification +* Requires software libraries to read and write C, Fortran, Java, python, IDL, ... +* Internal compression, don't bother to compress NetCDF files externally +* HTTP byte range requests +* Parallel IO +* nc file extension +* Don't be afraid of big files + +#### Coordinate System +* Dimensions should be used to define a coordinate system + * e.g. temporal, spatial, spectral + * Avoid using dimensions to group data + * Think "functional relationship". Each independent variable should represent a dimension. +* coordinate variable + * 1D variable with dimension of the same name + * strictly monotonic (ordered) + * no missing values + * Independent variable of functional relationship + * Every dimension should have one +* shared dimensions + * Each variable should reuse dimensions to indicate that they share the same coordinates (domain set) + +#### Time as Coordinate Variable +* If the data are a function of a single time dimension then there should be a single time variable + * avoid breaking time up by date and time of day +* Prefer numeric time units + * time unit since an epoch + * e.g. "seconds since 1970-01-01", "microseconds since 1980-01-06" + +#### Metadata +* Optional but useful to make NetCDF file self-describing +* attribute + * global (dataset level) + * title + * history (provenance) + * variable + * long_name + * units +* Conventions + * [Climate and Forecast (CF)](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html) + * [Attribute Convention for Data Discovery (ACDD)](https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3) + * [udunits](https://www.unidata.ucar.edu/software/udunits/): standard units + +#### Other useful variable attributes +* _FillValue + * missing_value is considered deprecated and is not recommended by the NetCDF Users Group. + * NaN is another option, however, NaNs in files are handled differently in every language and so it may + be better to pick a value for official data products that many users will be using +* valid_range, valid_min, valid_max +* scale_factor, add_offset (packed values) +* [cell_methods](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#_data_representative_of_cells): standards for representing data cells (bins) + * e.g. daily average, wavelength bins + +## Useful Links +* [NetCDF User's Guide](https://docs.unidata.ucar.edu/nug/current/) +* [NetCDF ToolsUI](https://docs.unidata.ucar.edu/netcdf-java/current/userguide/toolsui_ref.html) +* [NetCDF Workshop Materials](https://www.unidata.ucar.edu/software/netcdf/workshops/2011/index.html) + + +Credit: Content taken from a Confluence guide written by Doug Lindholm \ No newline at end of file diff --git a/docs/source/_static/data_management/index.rst b/docs/source/_static/data_management/index.rst new file mode 100644 index 0000000..aa55ee4 --- /dev/null +++ b/docs/source/_static/data_management/index.rst @@ -0,0 +1,8 @@ +Data Management +=============== + + +.. toctree:: + :maxdepth: 1 + + file_formats/index \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 0203a47..6593bea 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -7,3 +7,4 @@ Welcome to the LASP Developer's Guide! :maxdepth: 1 licensing + data_management/index From 7986e3630710095fd410a75972916e4e86fcd377 Mon Sep 17 00:00:00 2001 From: Keira Brooks Date: Thu, 29 Aug 2024 11:57:02 -0600 Subject: [PATCH 2/2] Update README.md (#28) --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 53327ce..026060b 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # developer-guide +This repository provides guidelines for software developers at LASP. You can view the developer guide here: http://lasp-developer-guide.readthedocs.io/ + ## Purpose The purpose of this repository is to provide a collection of documented guidelines for software development workflows