Skip to content

Developing a QA framework for the bdverse

Thiloshon Nagarajah edited this page Mar 5, 2019 · 1 revision

Background

The bdverse is an ensemble of R packages which allow users to conveniently employ R, for biodiversity data exploration, quality assessment (QA), data cleaning, and standardization. It comprised of six unique packages in a hierarchical structure - representing different functionality levels. The bdverse was designed to support different user needs and programming capabilities. To help enable the longevity of the bdverse, care needs to be taken to ensure that best-practices with software engineering are followed.

Related work

Recently, rOpenSci organization published its gitbook on recommendations for R packages development and maintenance. We plan to submit the released bdverse packages to rOpenSci software review as soon as the packages meet core requirements. Nonetheless, we plan to develop a robust QA infrastructure, tailored to address most of bdverse vulnerabilities. The bdverse packages system is envisaged as a comprehensive and long-term solution to issues related to biodiversity data. As such, its QA framework must also address the unique aspects of biodiversity data (e.g. DwC data standard; different taxonomic, spatial and temporal properties; techniques for biodiversity data cleaning).

Details of your coding project

We identified five key elements that need to be developed and integrated into the bdverse packages:

  • Unit testing: The best way to ensure all functions work as expected is to use a unit testing framework to assist with writing testing functions, specifying inputs and checkout outputs. In R, the most widely used testing framework is testthat, which comes with great documentation and is integrated with R studio and many other downstream testing frameworks. One approach that can be helpful is Test-driven development whereby the expected inputs and outputs of a given function are determined before any of the function logic is actually coded up. This forces you to think ahead of time about the interface to the logic, what data will be needed as well as ensuring that a unit test is written for the new code. Testing R shiny packages is a little bit more difficult than regular functions, given they typically rely on running as part of a web-app. The shinytest package has been developed to test this functionality and may be beneficial in this instance given some key functionality (e.g. user questionnaires) is integrated as part of the R Shiny apps.
  • Continuous integration (CI): CI enables an alert system, running all unit tests whenever code is committed to a repository and hence indicating whether the code we are going to push onto our master branch of repository contains any possible bugs that can break the system. It can also provide services such as estimating the amount of code that is actually covered by unit tests, to make sure that all functionality is being tested and ensuring that code style is maintained by new commits. We plan to implement rOpenSci CI best practices.
  • Defensive programming: Defensive programming is a practice where you add supporting code to detect, isolate, and in some cases recover from anticipated failures in your code. Such failures can occur due to malformed inputs or from bugs within the code. Defensive programming is about letting the user know when an error has occurred and providing some context about what caused the error. One of the main tools is Assertions, which assert that a given statement is true at a given point in the code, are a common tool in this regard. A common example is checking input parameters to a function. In R this is often implemented using the stopifnot() function, though packages such as assertthat also exist.
  • Reduce dependencies: heavy dependencies on other packages have been described as “an invitation for other people to break your code”. Although, this risk can be minimized when implementing CI with high code coverage, reduce dependencies is especially important for difficult to install packages.
  • Documentation and comments: developing high-quality package documentation and ensuring that code comments are sufficient and clear.

Skills Required

R, shiny, QA best practices, testthat, covr, Travis CI, R + AppVeyor, tic, bookdown.

Advantage: experience in working with biodiversity big-data.

Expected impact

Developing the bdverse is like building a ‘house of cards’, in a sense, our biggest challenge is how to safeguard this house of cards. Developing and implementing a robust QA infrastructure, as challenging as it may be, is not a prerogative but a necessity. This developed framework will ensure that the bdverse features are held to a high standard, thus, retaining its evolutionary potential.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

  • Tomer Gueta [email protected] is leading the bdverse project. He is a postdoctoral fellow at the Faculty of Civil and Environmental Engineering at the Technion, working with Prof. Yohay Carmel. His research deals with developing tools and methodologies for data-intensive biodiversity research. During the last two years, Tomer served as a GSoC mentor with the R project organization.
  • Thiloshon Nagarajah [email protected] is a key member in bdverse development team. He was past GSoC and GCI student for Fedora Project, Sahana Foundation and R Language.
  • Vijay Barve [email protected] is the author and maintainer of bdvis and a key member in bdverse development team. Vijay is a biodiversity data scientist who has been a GSoC student and mentor since 2012 with the R project organization. Vijay has contributed to several packages on CRAN.

Join our Slack Channel for immediate clarifications: bd-r-group.slack.com

Tests

Students, please do one or more of the following tests before contacting the mentors above.

  • Easy: review rOpenSci gitbook (chapters 1 and 2) and please find at least 5 instances bdchecks doesn't obey standards in gitbook.
  • Easy: Fork the bdchecks package (latest version) and develop unit tests for at least two data checks.
  • Medium: Review the bdchecks package (latest version) and suggest a unit testing mechanism for all data checks.
  • Medium: Review bdchecks shiny app (latest version) and develop two unit tests using shinytest.
  • Hard: Formulate an R markdown report on QA best practices relevant to the bdverse, try to give suggestions on how to enforce them.
  • Hard: Fork bdchecks (latest version) and implement the suggested mechanism you developed in the second test (hint: the YAML file is in bdchecks/inst/extdata).

Solutions of tests

Students, please post here a github link to your test solutions in the format,

  1. Name - Email - University - Link to solutions
Clone this wiki locally