Thank you for helping us help learners!
There are 4 main steps to submit a dataset:
Find a dataset that would be good for TidyTuesday: either one that is already ready for analysis, or one that you can clean so that it meets the criteria. These are the requirements for a dataset:
-
Data can be saved as one or more CSV files.
-
The whole dataset (all files) is less than 20MB.
-
You can describe each variable (either using an existing data dictionary or by creating your own dictionary).
-
The data is publicly available and free for reuse, either with or without attribution.
You will also need:
-
The source of the dataset
-
An article about the dataset or that uses the dataset
-
At least one image related to or using the dataset
You'll need to perform this step the first time you submit a pull request to this repository. A "pull request" is a submission of code to a git repository. If you have never worked with git before, that's fine! We'll help you get set up.
- Set up git, github, and your IDE (such as RStudio). We have step-by-step instructions for setting up things to work with the Data Science Learning Community.
- Fork the tidytuesday repository. In R, you can use
usethis::create_from_github("rfordatascience/tidytuesday")
to create your personal fork on GitHub and copy it to your computer. Note: This requires about 8 GB of space on disk.
We use a fork/branch approach to pull requests, meaning you'll create a version of the repo specifically for your changes, and then ask us to merge those changes into the main tidytuesday repository.
-
If you are on anything other than the
master
branch of your local repository, switch back to master. In R, you can useusethis::pr_pause()
(if your previous submission is still pending), orusethis::pr_finish()
(if we've accepted your submission). -
Pull the latest version of the repository to your computer. In R, use
usethis::pr_merge_main()
-
Create a new branch, with something similar to the name of the dataset you're submitting. In R, you can create this branch using
usethis::pr_init(BRANCHNAME)
. For instance if it's a dataset on American baseball, something like "american-baseball" usingusethis::pr_init("american-baseball")
. -
Navigate to the
data/curated
folder in your branch of the repository. -
Make a copy of the
template
folder for your dataset, inside thecurated
folder. Name it something descriptive -- the same name as your branch would work, so "american-baseball" not "my_dataset". -
Inside the folder you just created is where you're going to do your work.
A copy the following instructions is also available in the folder you've created, as instructions.md
.
These instructions are for preparing a dataset using the R programming language, but we hope to provide instructions for other programming languages eventually.
-
cleaning.R
: Modify thecleaning.R
file to get and clean the data.- Write the code to download and clean the data in
cleaning.R
. - If you're getting the data from a github repo, remember to use the 'raw' version of the URL.
- This script should result in one or more data.frames, with descriptive variable names (eg
players
andteams
, notdf1
anddf2
).
- Write the code to download and clean the data in
-
saving.R
: Usesaving.R
to save your datasets. This process creates both the.csv
file(s) and the data dictionary template file(s) for your datasets. Don't save the CSV files using a separate process because we also need the data dictionaries.- Run the first line of
saving.R
to create the functions we'll use to save your dataset. - Provide the name of your directory as
dir_name
. - Use
ttsave()
for each dataset you created incleaning.R
, substituting the name for the dataset forYOUR_DATASET_DF
.
- Run the first line of
-
{dataset}.md
: Edit the{dataset}.md
files to describe your datasets (where{dataset}
is the name of the dataset). These files are created bysaving.R
. There should be one file for each of your datasets. You most likely only need to edit the "description" column to provide a description of each variable. -
intro.md
: Edit theintro.md
file to describe your dataset. You don't need to add a# Title
at the top; this is just a paragraph or two to introduce the week. -
Find at least one image for your dataset. These often come from the article about your dataset. If you can't find an image, create an example data visualization, and save the images in your folder as
png
files. -
meta.yaml
: Editmeta.yaml
to provide information about your dataset and how we can credit you. You can delete lines from thecredit
block that do not apply to you.
-
Commit the changes with this folder to your branch. In RStudio, you can do this on the "Git" tab (the "Commit" button).
-
Submit a pull request to https://github.com/rfordatascience/tidytuesday. In R, you can do this with
usethis::pr_push()
, and then follow the instructions in your browser.