Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine how to handle subdataset level type fields #6

Open
edwardleardi opened this issue Oct 7, 2020 · 3 comments
Open

Determine how to handle subdataset level type fields #6

edwardleardi opened this issue Oct 7, 2020 · 3 comments

Comments

@edwardleardi
Copy link
Collaborator

Information on how to locate all the files belonging to a certain subdataset is important for the DAX API and how it handles loading in subdatasets. Note, this is different from a subdataset's format, which is simply the file format of the subdataset.

Examples of subdataset types:

  • a simple file = it's path name (e.g. txt, csv)
  • a directory (e.g. a directory of image files, a directory of subdirectories)
  • a list of files (e.g. a txt with paths to all files in the validation set)
  • a regex (e.g. all train files have train_ appended at the start of the filename)

We need to determine what subdataset types there are and how to include this information. The current proposal is this:

Simple file:

  - file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
    ...
    format: CSV  # rename from type to format
    ...
    type: path_name
       value: noaa-weather-data-jfk-airport/jfk_weather.csv

Regex:

  - file_name: publaynet/train
    ...
    type: regex  
      value: "train/*"

List of files:

  - file_name: tfsc/train_list.txt
    ...
    type: list_of_files  
      value: tfsc/train_list.txt

There's probably a better way of structuring this that avoids the file_name being the same as the value field in some cases, but it's a start.

@ptitzler
Copy link
Contributor

ptitzler commented Oct 8, 2020

How about using a pattern label that serves as an umbrella. Technically everything you have listed is a pattern, which might or might not include a wildcard.

  - pattern: /path/to/dir/file.csv
    ...
    format: CSV  
    ...
    type: file
  - pattern: /path/to/dir/*.csv
    ...
    format: CSV  
    ...
    type: regex
  - pattern: /path/to/dir/*
    ...
    format: 
    ...
    type: regex
  - pattern: /path/to/dir/file.txt
    ...
    format: CSV
    ...
    type: listing   # CSV file containing one column only, which has a special meaning

@edwardleardi
Copy link
Collaborator Author

Makes sense to me, I think that would work. @xuhdev what do you think?

@xuhdev
Copy link
Collaborator

xuhdev commented Oct 9, 2020

Makes sense. Semantically it might make more sense if we move type under pattern instead of at the same level (e.g., we may have more fields to describe the pattern itself in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants