Update contributing documentation

mir-dataset-loaders · Nov 4, 2024 · 5f7fad2 · 5f7fad2
1 parent 62c08b9
commit 5f7fad2
Showing 1 changed file with 113 additions and 21 deletions.
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -20,22 +20,14 @@ Installing mirdata for development purposes
 
 To install ``mirdata`` for development purposes:
 
-    * First run:
+    - First, run ``git clone https://github.com/mir-dataset-loaders/mirdata.git``
 
-      .. code-block:: console
+    - Then, after opening source data library you have to install all the dependencies:
 
-          git clone https://github.com/mir-dataset-loaders/mirdata.git
-
-    * Then, after opening source data library you have to install the dependencies for updating the documentation
-      and running tests:
-
-      .. code-block:: console
-
-          pip install .
-          pip install ."[tests]"
-          pip install ."[docs]"
-          pip install ."[dali]"
-          pip install ."[haydn_op20]"
+      - Install Core dependencies with ``pip install .``
+      - Install Testing dependencies with ``pip install ."[tests]"``
+      - Install Docs dependencies with ``pip install ."[docs]"``
+      - Install dataset-specific dependencies with ``pip install ."[dataset]"`` where ``dataset`` can be ``dali | haydn_op20 | cipi ...`` 
 
 
 We recommend to install `pyenv <https://github.com/pyenv/pyenv#installation>`_ to manage your Python versions
@@ -77,7 +69,9 @@ The steps to add a new dataset loader to ``mirdata`` are:
 1. `Create an index <create_index_>`_
 2. `Create a module <create_module_>`_
 3. `Add tests <add_tests_>`_
-4. `Submit your loader <submit_loader_>`_
+4. `Update Mirdata documentation <update_docs_>`_
+5. `Upload index to Zenodo <upload_index_>`_
+6. `Create a Pull Request on GitHub <create_pr_>`_
 
 
 Before starting, check if your dataset falls into one of these non-standard cases:
@@ -100,8 +94,12 @@ information about the files included in the dataset, their location and checksum
 1. To create an index, first create a script in ``scripts/``, as ``make_dataset_index.py``, which generates an index file.
 2. Then run the script on the dataset and save the index in ``mirdata/datasets/indexes/`` as ``dataset_index_<version>.json``.
    where <version> indicates which version of the dataset was used (e.g. 1.0).
+3. When the dataloader is completed and the PR is accepted, upload the index in our `Zenodo community <https://zenodo.org/communities/audio-data-loaders/>`_. See more details `here <upload_index_>`_.
 
 
+The function ``make_<datasetname>_index.py`` should automate the generation of an index by computing the MD5 checksums for given files in a dataset located at data_path. 
+Users can adapt this function to create an index for their dataset by adding their file paths and using the md5 function to generate checksums for their files.
+
 .. _index example:
 
 Here there is an example of an index to use as guideline:
@@ -114,6 +112,9 @@ Here there is an example of an index to use as guideline:
 
 More examples of scripts used to create dataset indexes can be found in the `scripts <https://github.com/mir-dataset-loaders/mirdata/tree/master/scripts>`_ folder.
 
+.. note::
+    Users should be able to create the dataset indexes without the need for additional dependencies that are not included in soundata by default. Should you need an additional dependency for a specific reason, please open an issue to discuss with the Soundata maintainers the need for it.
+
 tracks
 ^^^^^^
 
@@ -302,6 +303,65 @@ You may find these examples useful as references:
 For many more examples, see the `datasets folder <https://github.com/mir-dataset-loaders/mirdata/tree/master/mirdata/datasets>`_.
 
 
+Declare constant variables
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Please, include the variables ``BIBTEX``, ``INDEXES``, ``REMOTES``, and ``LICENSE_INFO`` at the beginning of your module.
+While ``BIBTEX`` (including the bibtex-formatted citation of the dataset), ``INDEXES`` (indexes urls, checksums and versions),
+and ``LICENSE_INFO`` (including the license that protects the dataset in the dataloader) are mandatory, ``REMOTES`` is only defined if the dataset is openly downloadable.
+
+``INDEXES``
+    As seen in the example, we have two ways to define an index:
+    providing a URL to download the index file, or by providing the filename of the index file, assuming it is available locally (like sample indexes).
+
+    * The full indexes for each version of the dataset should be retrieved from our Zenodo community. See more details `here <upload_index_>`_.
+    * The sample indexes should be locally stored in the ``tests/indexes/`` folder, and directly accessed through filename. See more details `here <add_tests_>`_.
+
+    **Important:** We do recommend to set the highest version of the dataset as the default version in the ``INDEXES`` variable.
+    However, if there is a reason for having a different version as the default, please do so.
+
+``REMOTES``
+    Should be a list of ``RemoteFileMetadata`` objects, which are used to download the dataset files. See an example below:
+
+    .. code-block:: javascript
+        REMOTES = {
+            "all": download_utils.RemoteFileMetadata(
+                filename="UrbanSound8K.tar.gz",
+                url="https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz?download=1",
+                checksum="9aa69802bbf37fb986f71ec1483a196e",
+                unpack_directories=["UrbanSound8K"],
+            ),
+        }
+        #TODO Change UrbanSound8K by another dataset in Mirdata
+    Add more ``RemoteFileMetadata`` objects to the ``REMOTES`` dictionary if the dataset is split into multiple files.
+    Please use ``download_utils.RemoteFileMetadata`` to parse the dataset from an online repository, which takes cares of the download process and the checksum validation, and addresses corner carses.
+    Please do NOT use specific functions like ``download_zip_file`` or ``download_and_extract`` individually in your loader.
+
+.. note::
+    Direct url for download and checksum can be found in the Zenodo entries of the dataset and index. Bear in mind that the url and checksum for the index will be available once a maintainer of the Audio Data Loaders Zenodo community has accepted the index upload.
+    For other repositories, you may need to generate the checksum yourself.
+    You may use the function provided in ``soundata.validate.py``.
+
+
+Make sure to include, in the docstring of the dataloader, information about the following list of relevant aspects about the dataset you are integrating:
+
+* The dataset name.
+* A general purpose description, the task it is used for.
+* Details about the coverage: how many clips, how many hours of audio, how many classes, the annotations available, etc.
+* The license of the dataset (even if you have included the ``LICENSE_INFO`` variable already).
+* The authors of the dataset, the organization in which it was created, and the year of creation (even if you have included the ``BIBTEX`` variable already).
+* Please reference also any relevant link or website that users can check for more information.
+.. note::  
+    In addition to the module docstring, you should write docstrings for every new class and function you write. See :ref:`the documentation tutorial <documentation_tutorial>` for practical information on best documentation practices.
+This docstring is important for users to understand the dataset and its purpose.
+Having proper documentation also enhances transparency, and helps users to understand the dataset better.
+Please do not include complicated tables, big pieces of text, or unformatted copy-pasted text pieces. 
+It is important that the docstring is clean, and the information is very clear to users.
+This will also engage users to use the dataloader!
+For many more examples, see the `datasets folder <https://github.com/mir-dataset-loaders/mirdata/tree/master/mirdata/datasets>`_.
+.. note::  
+    If the dataset you are trying to integrate stores every clip in a separated compressed file, it cannot be currently supported by soundata. Feel free to open and issue to discuss a solution (hopefully for the near future!)
+
+
 .. _add_tests:
 
 3. Add tests
@@ -399,9 +459,7 @@ kindly ask the contributors to **reduce the size of the testing data** if possib
 csv files).
 
 
-.. _submit_loader:
-
-4. Submit your loader
+4. Update Mirdata documentation
 ---------------------
 
 Before you submit your loader make sure to:
@@ -433,16 +491,50 @@ An example of this for the ``Beatport EDM key`` dataset:
 (you can check that this was done correctly by clicking on the readthedocs check when you open a PR). You can find license
 badges images and links `here <https://gist.github.com/lukas-h/2a5d00690736b4c3a7ba>`_.
 
-Pull Request template
-^^^^^^^^^^^^^^^^^^^^^
 
-When starting your PR please use the `new_loader.md template <https://github.com/mir-dataset-loaders/mirdata/blob/master/.github/PULL_REQUEST_TEMPLATE/new_loader.md>`_,
+. _upload_index:
+
+5. Uploading the index to Zenodo
+--------------------------------
+
+We store all dataset indexes in an online repository on Zenodo.
+To use a dataloader, users may retrieve the index running the ``dataset.download()`` function that is also used to download the dataset.
+To download only the index, you may run ``.download(["index"])``. The index will be automatically downloaded and stored in the expected folder in Soundata.
+
+From a contributor point of view, you may create the index, store it locally, and develop the dataloader.
+All JSON files in ``soundata/indexes/`` are included in the .gitignore file, 
+therefore there is no need to remove it when pushing to the remote branch during development, since it will be ignored by git.
+
+**Important!** When creating the PR, please `submit your index to our Zenodo community <https://zenodo.org/communities/audio-data-loaders/>`_:
+
+* First, click on ``New upload``. 
+* Add your index in the ``Upload files`` section.
+* Let Zenodo create a DOI for your index, so click *No*.
+* Resource type is *Other*.
+* Title should be *mirdata-<dataset-id>_index_<version>*, e.g. mirdata-baf_index_1.0.
+* Add yourself as the Creator of this entry.
+* The license of the index should be the `same as Soundata <https://github.com/mir-dataset-loaders/mirdata/blob/main/LICENSE>`_. 
+* Visibility should be set as *Public*.
+
+.. note::
+    *<dataset-id>* is the identifier we use to initialize the dataset using ``mirdata.initialize()``. It's also the filename of your dataset module.
+
+
+.. _create_pr:
+
+6. Create a Pull Request
+------------------------
+
+Please, create a Pull Request with all your development. When starting your PR please use the `new_loader.md template <https://github.com/mir-dataset-loaders/mirdata/blob/master/.github/PULL_REQUEST_TEMPLATE/new_loader.md>`_,
 it will simplify the reviewing process and also help you make a complete PR. You can do that by adding
 ``&template=new_loader.md`` at the end of the url when you are creating the PR :
 
 ``...mir-dataset-loaders/mirdata/compare?expand=1`` will become
 ``...mir-dataset-loaders/mirdata/compare?expand=1&template=new_loader.md``.
 
+.. _update_docs:
+
+
 Docs
 ^^^^