-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest QOL tweaks #1454
Ingest QOL tweaks #1454
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1454 +/- ##
==========================================
+ Coverage 72.18% 72.20% +0.02%
==========================================
Files 118 119 +1
Lines 6133 6145 +12
Branches 725 726 +1
==========================================
+ Hits 4427 4437 +10
- Misses 1550 1552 +2
Partials 156 156 ☔ View full report in Codecov by Sentry. |
e307d81
to
c8399fd
Compare
c8399fd
to
37314e7
Compare
37314e7
to
2d90da1
Compare
dcpy/lifecycle/ingest/run.py
Outdated
|
||
TMP_DIR = Path("tmp") | ||
INGEST_DIR = BASE_PATH / "ingest" | ||
STAGING_DIR = INGEST_DIR / "staging" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit of a nit: naming this var as INGEST_STAGING_DIR
feels more intuitive to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment for OUTPUT_DIR
on line 12 below
ingest_parent_dir: Path = STAGING_DIR, | ||
) -> None: | ||
ingest_dir = ingest_parent_dir / dataset / "staging" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a fix here: we don't need the line 113 sinc you are using the STAGING_DIR path. Also, I would rename ingest_parent_dir
to ingest_staging_dir
in the fn signuture
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should clean this up, but am still going to provide a folder for this to run in that's separated from the "usual" ingest folders, just to keep it separate from anyone running an "official" job on their own comp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the latest commit for renaming - mainly, worked to distinguish any path that's set for ingest
(i.e. a variable or override that's set for these sort of "parent" folders that contain many datasets) vs any folder that's specific to the dataset being run.
It's still a little more complicated than it should be but given how this operates now I think it makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcpy/lifecycle/ingest/run.py
Outdated
) | ||
staging_dir.mkdir(parents=True) | ||
else: | ||
staging_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
with open(staging_dir / "config.json", "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "config.json" with CONFIG_FILENAME
that you defined above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty - this looks like a rebase mistake, the same line is right below it but fixed
Curious, what is the need for 4th commit? In which case you wouldn't want to use the template for defining local source? |
Great question. Rationale is that for something that's a local csv (or emailed csv) that might have a different filename each time. I don't think that it makes sense to have to commit changes (or just modify the template) to archive it, it's something that should be valid to override at runtime |
31ade6d
to
4afeef4
Compare
All coming up as part of #1334 and the associated datasets moving over to ingest.
ingest
function (the function that runs the whole ingest process) to init so that it's easily accessible.lifecycle
dir for "staging" the process.lifecycle
dir to create a local file system reminiscent ofedm-recipes/datasets
where the outputs are moved to. Just seemed to be nice for troubleshooting (slash running without a connection to s3)