Determine how to handle subdataset level type fields #6

edwardleardi · 2020-10-07T22:21:59Z

Information on how to locate all the files belonging to a certain subdataset is important for the DAX API and how it handles loading in subdatasets. Note, this is different from a subdataset's format, which is simply the file format of the subdataset.

Examples of subdataset types:

a simple file = it's path name (e.g. txt, csv)
a directory (e.g. a directory of image files, a directory of subdirectories)
a list of files (e.g. a txt with paths to all files in the validation set)
a regex (e.g. all train files have train_ appended at the start of the filename)

We need to determine what subdataset types there are and how to include this information. The current proposal is this:

Simple file:

  - file_name: noaa-weather-data-jfk-airport/jfk_weather.csv
    ...
    format: CSV  # rename from type to format
    ...
    type: path_name
       value: noaa-weather-data-jfk-airport/jfk_weather.csv

Regex:

  - file_name: publaynet/train
    ...
    type: regex  
      value: "train/*"

List of files:

  - file_name: tfsc/train_list.txt
    ...
    type: list_of_files  
      value: tfsc/train_list.txt

There's probably a better way of structuring this that avoids the file_name being the same as the value field in some cases, but it's a start.

The text was updated successfully, but these errors were encountered:

ptitzler · 2020-10-08T20:48:39Z

How about using a pattern label that serves as an umbrella. Technically everything you have listed is a pattern, which might or might not include a wildcard.

  - pattern: /path/to/dir/file.csv
    ...
    format: CSV  
    ...
    type: file
  - pattern: /path/to/dir/*.csv
    ...
    format: CSV  
    ...
    type: regex
  - pattern: /path/to/dir/*
    ...
    format: 
    ...
    type: regex
  - pattern: /path/to/dir/file.txt
    ...
    format: CSV
    ...
    type: listing   # CSV file containing one column only, which has a special meaning

edwardleardi · 2020-10-08T21:14:03Z

Makes sense to me, I think that would work. @xuhdev what do you think?

xuhdev · 2020-10-09T08:21:38Z

Makes sense. Semantically it might make more sense if we move type under pattern instead of at the same level (e.g., we may have more fields to describe the pattern itself in the future).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine how to handle subdataset level type fields #6

Determine how to handle subdataset level type fields #6

edwardleardi commented Oct 7, 2020

ptitzler commented Oct 8, 2020 •

edited

Loading

edwardleardi commented Oct 8, 2020

xuhdev commented Oct 9, 2020

Determine how to handle subdataset level type fields #6

Determine how to handle subdataset level type fields #6

Comments

edwardleardi commented Oct 7, 2020

ptitzler commented Oct 8, 2020 • edited Loading

edwardleardi commented Oct 8, 2020

xuhdev commented Oct 9, 2020

ptitzler commented Oct 8, 2020 •

edited

Loading