Consider spreading the data into multiple directories #9

cesarsouza · 2017-10-12T19:12:05Z

Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with ls can become burdensome after around 10,000 files.

But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.

Regards,
Cesar

The text was updated successfully, but these errors were encountered:

Jakobovski · 2017-10-13T08:17:07Z

How would you like to organize the recordings?

cesarsouza · 2017-10-13T18:52:02Z

I would say that the simplest way would be to organize them hierarchically as recordings/<digit>/<speaker>/<digit>_<speaker>_<variation>.wav.

Mistobaan · 2019-08-14T19:20:44Z

+1, are the files also added using git lfs?

dansuh17 · 2019-10-02T07:20:22Z

+1, I suggest a more general structure commonly used in many computer vision datasets (like ImageNet), as: recordings/<digit>/<speaker>_<variation>.wav, following the structure <data_root>/<class_label>/<id>.<ext>.

Jakobovski · 2019-10-02T07:30:10Z

@dansuh17 Feel free to contribute and I will accept the MR

cesarsouza mentioned this issue Oct 13, 2017

Normalize the recordings to have the same number of channels #10

Closed

cesarsouza mentioned this issue Oct 21, 2017

Add the Free Spoken Digits Dataset to Accord.DataSets accord-net/framework#949

Closed

Jakobovski added the enhancement label Sep 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider spreading the data into multiple directories #9

Consider spreading the data into multiple directories #9

cesarsouza commented Oct 12, 2017

Jakobovski commented Oct 13, 2017

cesarsouza commented Oct 13, 2017

Mistobaan commented Aug 14, 2019

dansuh17 commented Oct 2, 2019

Jakobovski commented Oct 2, 2019

Consider spreading the data into multiple directories #9

Consider spreading the data into multiple directories #9

Comments

cesarsouza commented Oct 12, 2017

Jakobovski commented Oct 13, 2017

cesarsouza commented Oct 13, 2017

Mistobaan commented Aug 14, 2019

dansuh17 commented Oct 2, 2019

Jakobovski commented Oct 2, 2019