-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider spreading the data into multiple directories #9
Comments
How would you like to organize the recordings? |
I would say that the simplest way would be to organize them hierarchically as |
+1, are the files also added using git lfs? |
+1, I suggest a more general structure commonly used in many computer vision datasets (like ImageNet), as: |
@dansuh17 Feel free to contribute and I will accept the MR |
Right now the entire dataset is contained in a single directory (https://github.com/Jakobovski/free-spoken-digit-dataset/tree/master/recordings). This will not scale once the dataset becomes larger. Depending on the file system, even listing the directory contents with
ls
can become burdensome after around 10,000 files.But another reason to do so is that the current layout may prevent the files from being queried using GitHub's developer API in the future. I am building an interface to the dataset that can automatically download, query and organize the dataset into training and testing sets without having to first clone the dataset using git. However, there is a limit on the number of files that can be retrieved using this API, and after this limit, the only method would be to clone the repository and retrieve the files manually.
Regards,
Cesar
The text was updated successfully, but these errors were encountered: