These are my rough thoughts. I expect this document to change over time. I also greatly value community feedback on these thoughts. This includes both content and organization. Providing feedback via issues, pull requests, or discussions is welcomed and encouraged.
There is a lot written about the benefits of code review. There is also a lot written about code review processes for software engineers or people who work with code in production and can take advantage of CI/CD pipelines or testing environments. I think there is less written about code review for analysts. If you know of any existing resources along these lines, please share.
Even though this repository is named Code Review For Analysts
, your job title does not need to be analyst
to benefit from this document; this document is for teams who write code for the purposes of data analysis and employ code review. This document does not apply to
code review of R packages, as that follows as a separate process. Other job titles that may find this document useful include data scientist, (bio)statistician, and analytics engineer.
- Code Review in the Lab, or … How do you review code that accompanies a research project or paper?, ROpenSci 2018, recording, blog post
- Opionated Analysis Development, Hillary Parker 2017, paper, slides, recording
- R Code Review Checklist, Travis Gerke 2021, repository
- Code smells and feels, Jenny Bryan 2018, slides, recording
My team works with clinical trials data and programs in R using tidyverse syntax. The specifications written here are tailored to that context; I am happy to consider making it more generalizable in the future.
My team also works in a specific project template. For now, some of those details may go in here as well.
The aims of this document are to:
- Provide a team with a clear set of expectations and guidelines to follow before they submit code for review.
- Empower a team to learn how to review code.
.github/pull_request_template.md
- Make sure you (the person submitting the code) have a clear understanding of the end user's request. A reliable way to do this is to write the specifications as you understand them and send your written description to the requester to confirm you are on the same page.
- Check the associated project repository for existing code you can build upon.
- Check the associated project repository for any open pull requests. If there are open PRs, assess if they contain either related work or work that could cause a downstream merge conflict. If so, discuss a plan on how to handle.
- Similarly, check the project repository for open issues that may relate to the request and/or be resolved through completion of the request.
Code review is not required if:
- you are an experienced employee and you are re-executing existing work with minimal changes (i.e., updated data).
- you are an experienced employee and you are creating new work, but it has been agreed that there will be an alternative review process such as the clinical operations or data management team reviewing the results.
Code review is required if you are a newer employee (<4 months) and for all other circumstances.
Depending on the scope of the code in progress, it may be beneficial to have some initial conversation and feedback prior to officially requesting a review. This is especially important for newer employees or newer bodies of work, and can be very beneficial to ensure that data context, areas of high code-complexity or risk, and request nuances are appropriately accounted for. Having this initial conversation can help ensure that both the code submitter and reviewer are on the same page and greatly streamline the review process.
After that:
- Confirm that your code follows the guidelines below.
- Restart R and run
renv::status()
. Correct any problems (e.g. updates to renv.lock or package version changes viarenv::restore()
). - Restart R and re-execute your script in a fresh environment. Alternatively, re-render your quarto or rmarkdown document. The results should render warning and error free prior to submission.
- In the PR, inform the reviewer when the deliverable is due or a suggested timeline for review that it can be prioritized accordingly.
We started writing code before we had a team with code review. We have a lot of older code that does not conform to our new standards. This means:
- If you use any exisiting code as a starting point note that the existing code might not have gone through this code review process. If this a new body of work, you are still responsible for updating it according to these guidelines.
- The point of this is not to update our entire code base according to these standards. The point is to raise the bar for new work, or for existing high impact high friction work.
- Standard
- Scoped: This might be appropriate for pre-existing, legacy code that requires some updates. This body of code may not satisfy the guidelines below; however, the gains in re-factoring the code is low impact and therefore not worthwhile.
- Expedited: This is a review for work that requires a quick turnaround. The reviewer may be lenient on review specifications below; however, note that shortcuts taken in specifications will likely lengthen the review process.
High level documentation should be added in the top of the script / quarto document regarding the aims of this body of work. This documentation should also include links to relevant internal documents when applicable (i.e. protocols, data request forms, etc.). If conversations occurred to clarify or confirm any portion of that request, then that should be added to the high level description as well (e.g., "Confirmed with Susan that the dates are for the 2023 calendar year."). If the scope of the request evolves, then the documentation should be aligned with the updated scope.
Comments are also expected throughout the script.
- New data sets should have a brief comment about expected form and content (i.e., 1 row per subject containing demographic information).
- Joins should have comments explaining type of join used.
- New variables should have comments about what it is and notes about nuances in derivation.
- Grouping operations should have comments about what level the grouping is at and why.
- If during the review process an inquiry is raised and resolved via a comment, please document that as a comment in the script (and not just leave in the review conversation on GitHub).
- Probably more types of comments expected.
We follow the tidyverse style guide.
Objects in R such as data frames should be preceded by the object type, e.g. lst_medidata_data_raw
, df_main
and vec_subjects
.
Sometimes variable names must be defined according to a prespecified data model (e.g. CDISC/SDTM/ADaM).
When we do have control over naming conventions, employ the Column Names as Contracts framework concept.
The general heirarchy is type_contents_detail
(where detail
is optional), for example, ID_PATIENT
, DT_PSA_BASELINE
.
Binary variables (i.e., coded as TRUE
, FALSE
) should have the name and value in alignment. For example, a variable such as
flag_psa_invalid
should have a value of TRUE
when the record is invalid; alternative confirm_psa_valid
would
have a value of TRUE
when the record is valid.
Ideally we would like to create the fewest possible relevant data sets and also retain data sets that allow one to fully check new variable derivations.
- Data sets should be created in terms of unit of observation (1 row per subject, 1 row per imaging scan, etc.) and scope (demographics, endpoints, etc).
- If the data sets have the same unit of observation (i.e., 1 row per subject) consider if they should be merged together or if they should be separate.
- Operations/mutations on data sets should be done in a single data step when possible (i.e., don't create age category in one data step and then bmi category in separate data sets).
- Left joins are generally preferred for readability, but make sure rows from the right-hand data frames that you wish to keep aren't silently dropped. The tidylog package can be helpful to provide verbose feedback about the implications of various joins.
- Data steps should generally be organized with all joins followed by all filters followed by mutating statements.
df_subjects <- df_1 |>
left_join(df_2, ...) |>
left_join(df_3, ...) |>
filter(...) |>
mutate(
...
)
This is as opposed to patterns like this, which is harder for a reviwer to follow.
# please don't do this
df_subjects <- df_1 |>
filter(...) |>
left_join(df_2, ...) |>
mutate(...) |>
filter(...) |>
left_join(df_3, ...) |>
mutate(...)
- Use
filter
statements sparingly/carefully. Sometimes they are appropriate, for example, to remove screen fail subjects. More often than not, you would be better served to create new derived variables indicating subjects of interest as we are often interested in not only observations that meet a certain criteria but also the full context of how many subjects did not meet that criteria. Flags are encouraged for this purpose, and this practice aligns with SDTM standards, e.g.ENRLFL
,SAFFL
, andITTFL
. - When grouping or reducing data, you will likely need to retain one data step to inspect derivations and a separate data step for reducing / refining the observations of interest.
- Always confirm that there is 1 row per expected unit of observation (i.e., 1 row per subject, or 1 row per scan per date).
This can be done via
janitor::get_dupes()
.
In general, it is best practice to create new variables with new names and not overwrite existing variables. This is because in order to inspect the new variable creation, you need to cross check the previous values against the new values. When you overwrite variables, you lose the ability to compare new values against the old values to verify the derivation.
The only exception I make for this is when converting a character variable to a factor when all original character values are retained.
This rule applies to objects / data frames in memory as well: best not to overwrite df_main
with a new df_main
, rather, rename or create intermediate data frames if necessary.
When creating binary variables, there is a preference to code value as the logical TRUE
/FALSE
. If there
is a strong rationale behind character coding (i.e., yes
, no
) the go forth with that; however, be
consistent in coded values (i.e., Yes
vs yes
).
It is also very important to consider if NA
is a valid coded option. In general, the difference between
coding an observation as NA
vs FALSE
is whether or not that observation should count in the denominator
when tabulating the variable (values of NA
by default generally do not contribute towards denominator, values
of FALSE
do).
Do not shoehorn complex logic into a case_when
statement - this makes it very hard to inspect results.
Instead, derive simpler variables that can be easily inspected and then built upon for more complex logic.
For example, consider date variables imported from excel that need to be converted to proper date formats with dates marked as year 1900 set to missing.
df_raw <- tribble(
~id, ~dt_excel,
1, 42467,
2, NA,
3, 2
)
In this first data step, computational statements are employed on both the left hand side of the evaluation and the right and side of the assigment.
df_1 <- df_raw |>
mutate(
dt_clean = case_when(
year(dt_excel) <= 1900 ~ NA_Date_,
.default = janitor::convert_to_date(dt_excel)
)
)
If results are unexpected here, it is hard to determine if it is because
year(dt_excel)
failed,year(dt_excel) <= 1900
failed, orjanitor::convert_to_date(dt_excel)
failed
Instead of embedding computational assignments inside of case_when
statement, create simpler
variables that build up to this logic.
df_2 <- df_raw |>
mutate(
dt_proper = janitor::convert_to_date(dt_excel),
dt_year = year(dt_proper),
dt_year_valid = dt_year > 1900,
dt_clean = case_when(
dt_year_valid ~ dt_proper,
.default = NA_Date_
)
)
All new variables should have labels describing their meaning.
In general, the best way to inspect derived variables is to tabulate the newly derived variable against the original variable, as in
data |> count(variable_new, variable_old)
.
Often, these span more than the default print method. To see all results,
you can use View()
or print(n = Inf)
.
# does not work well in Rmarkdown/quarto, can be disruptive when submitting entire scripts
# but also has easy sorting and filtering features to examine results
data |> count(variable_new, variable_old) |> View()
# prints all results to console, no easy sorting/filtering features
data |> count(variable_new, variable_old) |> print(n = Inf)
Each newly created variable must be tabulated against original variables used in the derivation to confirm the new variable is coded as intended.
Missing values are always of special interest. Verify the presence, if any, of missing values in original variables and confirm that those observations are coded correctly in newly derived variables.
In df_1
above, results can be inspected but it would be hard to diagnose where code went wrong if results were unexpected.
df_1 |> count(dt_excel, dt_clean)
In this case, the 1900 condition failed to result in a missing value.
In df_2
, each individual variable derived can be inspected to pinpoint where logic fails.
df_2 |> count(dt_proper, dt_excel)
df_2 |> count(dt_year, dt_proper)
df_2 |> count(dt_year_valid, dt_year)
df_2 |> count(dt_clean, dt_year_valid, dt_proper)
Commented out code in your submission is rarely valuable and best omitted. It most often reflects older work that is no longer relevant. If you do really want to leave in commented out code, be sure to add a comment describing why it is important to retain.
If you are removing commented out code and you are concerned about losing work, use of regular GitHub releases to mark merged work for specific deliverables or data changes is recommended to allow you to quickly return to this point in time.
Code should execute without warnings. If you have encountered a warning you are not sure how to resolve, ask about it in a pre- code review discussion.
Explore what of this can be detected automatically via lintr
:
- Commented code linter https://lintr.r-lib.org/reference/commented_code_linter.html.
- Join/filter/mutate patterns in data steps??
Explore husky for linting https://typicode.github.io/husky/.