This is the git repository for the DURIN project. The goal of this repo is to organize and streamline the data management in the project and beyond.
The overview of all the datasets is here.
The raw and clean datasets from are stored and will be made available after the end of the project on OSF. For now the data is only available to the project partners.
All R code for the cleaning the raw data is available on the Durin GitHub.
The data documentation, including a draft for the data paper is available here (only available for authors), and the data dictionaries are in this below in this readme file.
We use snake_case style for names and coding. Snake_case means that we use lower case letters separated by an underscore, for example biomass_g or age_class. The one exceptionare siteID, blockID, plotID (legacy from previous projects).
File and variable names should be meaningful. Do not use my_data.csv or var1. Follow the naming convention for the main variables (see full description of the main variables in the table below).
We use redundancy in variable names to avoid mistakes. PlotID should contain the higher hierarchical levels such as siteID, blockID, treatment, habitat, species etc. For example plotID contains siteID, habitat, species and plot number: LY_O_VV_1, LY_F_VM_3
Files or variable | Naming convention | Example |
---|---|---|
File name | Project_Status_Experiment_Response_Year(s).Extension | DURIN_clean_gradient_cflux_2023-2025.csv |
Project | DURIN | |
Experiment | 4 corner = 4 main Durin sites, drought = drought experiment at Lygra and Tjotta, gradient = gradient study, nutrient = nutrient experiment | |
4 corner study | ||
date | Date of data collection | yyyy-mm-dd; do not split year, month and day into several columns |
year | Year of data collection | yyyy; sometimes there is no specific date, then year can be used |
site_ID_name | Unique full site name | Lygra, Sogndal, Senja, Kautokeino and Tjotta |
siteID | Unique siteID, first 2 letters of site_ID_name. | LY, SO, SE, and KA, TJ |
habitat | Open versus forested habitat | O, F |
species | Vascular plant taxon names follow for Norway Lid & Lid (Lid J & Lid, 2010). We use full species names. For field sheets the names can be abbreviated (see speciesID), but the clean data should contain the full species name | Vaccinium myrtillus |
speciesID | 2 letter abbreviation of species | VM, VV, CV, EN, BN |
plot_nr | Unique plot number, numeric value from 1-5. | 1-5 |
plotID | Unique plot ID as a combination of siteID, habitat, speciesID and plot number | LY_O_VV_1, LY_F_VM_1 |
variable | Response variable(s) | cover, biomass, Reco |
value | Value of response variable(s) | numeric value |
unit | Unit for response variable(s) | %, µmol m−2 s−1 |
other variables | Other important variables in the dataset | plant nr, remark, data collector, weather, flag |
DroughtNet experiment | ||
site_ID_name | Site name | Lygra, Tjotta |
siteID | Unique siteID, first 2 letters of site_ID_name. | LY, TJ |
age_class | Vegetation representing post-fire successional stages. | pioneer, building, mature |
drought_treatment | Drought treatment using rain-out shelters that reduce precipitation by 0 = ambient, 60 = moderate, or 90% = extreme | ambient, moderate, extreme |
plotID | Unique plot ID as a combination of siteID, age_class, drought_treatment and plot number | LY_P_PIO_1, LY_M_EXT_3 |
droughtNet_plot | Unique plotID from the DroughtNet frames in the field. Correspond with Landpress naming. | 1.1,1.2,1.3 - 9.1,9.2,9.3 |
species | Vascular plant taxon names follow for Norway Lid & Lid (Lid J & Lid, 2010). We use full species names. For field sheets the names can be abbreviated (see speciesID), but the clean data should contain the full species name | Vaccinium myrtillus |
speciesID | 2 letter abbreviation of species | VM, VV, CV, EN |
Nutrient experiment | ||
date | Date of data collection | yyyy-mm-dd; do not split year, month and day into several columns |
site_ID_name | Site name | Lygra |
siteID | Unique siteID, first 2 letters of site_ID_name. | LY |
habitat | Open versus forested habitat | O, F |
age_class | Vegetation representing post-fire successional stages. | pioneer, building, mature |
nitrogen_addition | Added level of nitrogen | 5, 10, 25, 30 |
plot_nr | Unique plot number | 1-5 |
plotID | Combination of siteID, habitat, age_class, nitrogen addition and plot number. | LY_O_BUI_10_3 |
Gradient study | ||
date | ?? | ? |
siteID | ?? | ? |
habitat | Open versus forested habitat | O, F |
plot_nr | Unique plot number | 1-5 |
plotID | Combination of siteID, habitat, and plot number. | ?? |
Each dataset should contain only one response variable, or several if they are closely related. E.g. biomass and carbon flux should be two separate datasets. But the functional trait dataset can contain several traits, and the carbon flux dataset can contain GPP, NEE and Reco.
Datasets should be in a long format. If you have several response
variables (e.g. traits, flux measurements), then use
pivot_long(cols = ..., names_to = "variable", values_to = "value")
to
conver the data to a long format. Using the names variable and value
for the columns is a standard. If you have very different response
variables, it can be useful to have a column called unit.
When dealing with many different datasets it can be useful to structure them in a similar way. Arrange the dataset so that the important variables come first, preferable use this order:
- Year and date
- siteID
- plotID
- habitat
- treatment
- plantID
- leafID
- variable (diversity, biomass, flux)
- value
- (unit)
- predictor_variables (temperature level, oceanity)
- other_variables (remark, data collector, weather)
How to make a data dictionary?
The R package dataDocumentation that will help you to make the data dictionary. You can install and load the package as follows:
# if needed install the remotes package
install.packages("remotes")
# then install the dataDocumentation package
remotes::install_github("audhalbritter/dataDocumentation")
# and load it
library(dataDocumentation)
Make data description table
Find the file R/data_dic/data_description.xlsx. Enter all the variables into that table, including variable name, description, unit/treatment level and how measured. If the variables are global for all of Funder, leave TableID blank (e.g. siteID). If the variable is unique for a specific dataset, create a TableID and use it consistently for one specific dataset. Make sure you have described all variables.
Make data dictionary
Then run the function make_data_dic().
data_dic <- make_data_dictionary(data = biomass,
description_table = description_table,
table_ID = "biomass",
keep_table_ID = FALSE)
Check that the function produces the correct data dictionary.
Add data dictionary to readme file
Finally, add the data dictionary below to be displayed in this readme
file. Add a title, and a code chunk using kable()
to display the data
dictionary.
For more details go to the dataDocumentation readme file.