Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest QOL tweaks #1454

Merged
merged 7 commits into from
Feb 28, 2025
Merged

Ingest QOL tweaks #1454

merged 7 commits into from
Feb 28, 2025

Conversation

fvankrieken
Copy link
Contributor

@fvankrieken fvankrieken commented Feb 13, 2025

All coming up as part of #1334 and the associated datasets moving over to ingest.

  1. originally moved too much logic to init.py, now mainly just moves cli target and ingest function (the function that runs the whole ingest process) to init so that it's easily accessible
  2. make use of the .lifecycle dir for "staging" the process
  3. make use of the .lifecycle dir to create a local file system reminiscent of edm-recipes/datasets where the outputs are moved to. Just seemed to be nice for troubleshooting (slash running without a connection to s3)
  4. ability to provide the path to a source dataset file when running via cli. Just overrides whatever source is found in the template
  5. just make one "archive_dataset". @alexrichey based on our pairing yesterday though, maybe I'll scrap this - you could argue we want a different connector class for "raw" datasets and our parquet datasets, since we interact with them in fairly different ways

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 69.76744% with 13 lines in your changes missing coverage. Please review.

Project coverage is 72.20%. Comparing base (32d4887) to head (4afeef4).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
dcpy/lifecycle/scripts/validate_ingest.py 0.00% 9 Missing ⚠️
dcpy/lifecycle/_cli.py 0.00% 2 Missing ⚠️
.../lifecycle/scripts/ingest_with_library_fallback.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1454      +/-   ##
==========================================
+ Coverage   72.18%   72.20%   +0.02%     
==========================================
  Files         118      119       +1     
  Lines        6133     6145      +12     
  Branches      725      726       +1     
==========================================
+ Hits         4427     4437      +10     
- Misses       1550     1552       +2     
  Partials      156      156              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fvankrieken fvankrieken force-pushed the fvk-ingest-qol-tweaks branch 12 times, most recently from e307d81 to c8399fd Compare February 25, 2025 19:41
@fvankrieken fvankrieken marked this pull request as ready for review February 26, 2025 15:25
@fvankrieken fvankrieken force-pushed the fvk-ingest-qol-tweaks branch from c8399fd to 37314e7 Compare February 27, 2025 02:18

TMP_DIR = Path("tmp")
INGEST_DIR = BASE_PATH / "ingest"
STAGING_DIR = INGEST_DIR / "staging"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit of a nit: naming this var as INGEST_STAGING_DIR feels more intuitive to me

Copy link
Contributor

@sf-dcp sf-dcp Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment for OUTPUT_DIR on line 12 below

Comment on lines 111 to 113
ingest_parent_dir: Path = STAGING_DIR,
) -> None:
ingest_dir = ingest_parent_dir / dataset / "staging"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a fix here: we don't need the line 113 sinc you are using the STAGING_DIR path. Also, I would rename ingest_parent_dir to ingest_staging_dir in the fn signuture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should clean this up, but am still going to provide a folder for this to run in that's separated from the "usual" ingest folders, just to keep it separate from anyone running an "official" job on their own comp

Copy link
Contributor Author

@fvankrieken fvankrieken Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the latest commit for renaming - mainly, worked to distinguish any path that's set for ingest (i.e. a variable or override that's set for these sort of "parent" folders that contain many datasets) vs any folder that's specific to the dataset being run.

4afeef4

It's still a little more complicated than it should be but given how this operates now I think it makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the 'validate ingest" script outputs stuff now

image

)
staging_dir.mkdir(parents=True)
else:
staging_dir.mkdir(parents=True, exist_ok=True)

with open(staging_dir / "config.json", "w") as f:
Copy link
Contributor

@sf-dcp sf-dcp Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "config.json" with CONFIG_FILENAME that you defined above?

Copy link
Contributor Author

@fvankrieken fvankrieken Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty - this looks like a rebase mistake, the same line is right below it but fixed

@sf-dcp
Copy link
Contributor

sf-dcp commented Feb 27, 2025

Curious, what is the need for 4th commit? In which case you wouldn't want to use the template for defining local source?

@fvankrieken
Copy link
Contributor Author

Curious, what is the need for 4th commit? In which case you wouldn't want to use the template for defining local source?

Great question. Rationale is that for something that's a local csv (or emailed csv) that might have a different filename each time. I don't think that it makes sense to have to commit changes (or just modify the template) to archive it, it's something that should be valid to override at runtime

@fvankrieken fvankrieken force-pushed the fvk-ingest-qol-tweaks branch from 31ade6d to 4afeef4 Compare February 27, 2025 18:44
@fvankrieken fvankrieken requested a review from sf-dcp February 27, 2025 19:31
@fvankrieken fvankrieken merged commit 38b10fd into main Feb 28, 2025
22 of 23 checks passed
@fvankrieken fvankrieken deleted the fvk-ingest-qol-tweaks branch February 28, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants