-
Notifications
You must be signed in to change notification settings - Fork 4
Add simple generate summaries and totals functions that group by directory. #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ctory name instead of by dandiset.
for more information, see https://pre-commit.ci
str(file_path.absolute()) | ||
for file_path in natsort.natsorted(seq=directory.rglob(pattern="*")) | ||
if file_path.is_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about this some more, this will throw off the record keeping of our Drogon setup because of the asynchronous CRON job that consolidates original S3 logs (which would be recorded) into the .log
format
That said, I think it's great to offer a solution for the de facto S3 log structure, which therefore ought to be the default
So I propose, as with other calls in the current API, to add a keyword mode
to this method (and expose the option --mode
to the CLI) that has the following behavior
def extract_directory(self, *, directory: str | pathlib.Path, limit: int | None = None, mode: typing.Literal["dandi"] | None = None) -> None:
directory = pathlib.Path(directory)
pattern = "*.log" if mode == "dandi" else "*-*-*-*-*-*-*"
That way we preserve the oddness of DANDI specific stuff as a special case
Keeping the is_file()
is fine
Indeed I struggled with what to call these as well (even the phrase 'archive' is a bit close to our own use cases...) The goal of the tool (though I need to make this clearer in some dev documentation on data structures...) is to create a mirror of the S3 bucket contents, where every object in the bucket is it's own directory with the 3 The most general way I can think of referring to what we call data/dandisets is then, 'top level', as in 'top level summaries and totals' and we take that to mean datasets for OpenNeuro (because your top level is datasets) and DANDI of course has a special separate structure that has to be manually assembled (technically we even have other unmentionable things at our top-level besides What do you think?
PR looks great! I will add more tests to enhance coverage in a follow-up, but I'd say this is good to merge once we get the 'mode' of extraction file patterns ironed out in the above comment |
src/s3_log_extraction/summarize/_generate_all_dataset_summaries.py
Outdated
Show resolved
Hide resolved
…s.py Co-authored-by: Cody Baker <[email protected]>
… rng.integers call
…g-extraction into enh/directory_based_totals
LGTM @rwblair ready to merge? I will add tests for all of this in a followup after the performance benchmarking |
src/s3_log_extraction/extractors/_dandi_s3_log_access_extractor.py
Outdated
Show resolved
Hide resolved
for more information, see https://pre-commit.ci
I used
dataset
in place ofdandiset
but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.