Add simple generate summaries and totals functions that group by directory. #103

rwblair · 2025-06-10T18:00:30Z

I used dataset in place of dandiset but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.

Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.

…ctory name instead of by dandiset.

for more information, see https://pre-commit.ci

CodyCBakerPhD · 2025-06-11T01:32:37Z

src/s3_log_extraction/extractors/_s3_log_access_extractor.py

+            str(file_path.absolute())
+            for file_path in natsort.natsorted(seq=directory.rglob(pattern="*"))
+            if file_path.is_file()


Now that I think about this some more, this will throw off the record keeping of our Drogon setup because of the asynchronous CRON job that consolidates original S3 logs (which would be recorded) into the .log format

That said, I think it's great to offer a solution for the de facto S3 log structure, which therefore ought to be the default

So I propose, as with other calls in the current API, to add a keyword mode to this method (and expose the option --mode to the CLI) that has the following behavior

def extract_directory(self, *, directory: str | pathlib.Path, limit: int | None = None, mode: typing.Literal["dandi"] | None = None) -> None: directory = pathlib.Path(directory) pattern = "*.log" if mode == "dandi" else "*-*-*-*-*-*-*"

That way we preserve the oddness of DANDI specific stuff as a special case

Keeping the is_file() is fine

CodyCBakerPhD · 2025-06-11T01:38:22Z

I used dataset in place of dandiset but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.

Indeed I struggled with what to call these as well (even the phrase 'archive' is a bit close to our own use cases...)

The goal of the tool (though I need to make this clearer in some dev documentation on data structures...) is to create a mirror of the S3 bucket contents, where every object in the bucket is it's own directory with the 3 .txt files contained therein

The most general way I can think of referring to what we call data/dandisets is then, 'top level', as in 'top level summaries and totals' and we take that to mean datasets for OpenNeuro (because your top level is datasets) and DANDI of course has a special separate structure that has to be manually assembled (technically we even have other unmentionable things at our top-level besides blobs and zarr) to match the same concept

What do you think?

Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.

PR looks great! I will add more tests to enhance coverage in a follow-up, but I'd say this is good to merge once we get the 'mode' of extraction file patterns ironed out in the above comment

src/s3_log_extraction/summarize/_generate_all_dataset_summaries.py

…s.py Co-authored-by: Cody Baker <[email protected]>

… rng.integers call

…g-extraction into enh/directory_based_totals

CodyCBakerPhD · 2025-06-20T18:35:28Z

LGTM @rwblair ready to merge?

I will add tests for all of this in a followup after the performance benchmarking

src/s3_log_extraction/extractors/_dandi_s3_log_access_extractor.py

…r.py

for more information, see https://pre-commit.ci

rwblair and others added 2 commits June 10, 2025 12:45

Add simple generate summaries and totals functions that group by dire…

9fa4307

…ctory name instead of by dandiset.

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3e1f71

for more information, see https://pre-commit.ci

CodyCBakerPhD reviewed Jun 11, 2025

View reviewed changes

CodyCBakerPhD assigned rwblair Jun 11, 2025

CodyCBakerPhD added 2 commits June 10, 2025 21:38

Merge branch 'main' into enh/directory_based_totals

948b171

Merge branch 'main' into enh/directory_based_totals

2b71944

CodyCBakerPhD reviewed Jun 16, 2025

View reviewed changes

src/s3_log_extraction/summarize/_generate_all_dataset_summaries.py Outdated Show resolved Hide resolved

rwblair and others added 4 commits June 18, 2025 14:50

Update src/s3_log_extraction/summarize/_generate_all_dataset_summarie…

e4566a4

…s.py Co-authored-by: Cody Baker <[email protected]>

pass dtype into rng.integers on new_index collision to match original…

91c663f

… rng.integers call

make regex for log file name pattern a property of the extractor class

94381d7

Merge branch 'enh/directory_based_totals' of github.com:rwblair/s3-lo…

1ff2b9c

…g-extraction into enh/directory_based_totals

Merge branch 'main' into enh/directory_based_totals

385d211

CodyCBakerPhD reviewed Jun 20, 2025

View reviewed changes

src/s3_log_extraction/extractors/_dandi_s3_log_access_extractor.py Outdated Show resolved Hide resolved

CodyCBakerPhD and others added 4 commits June 20, 2025 16:48

Update src/s3_log_extraction/extractors/_dandi_s3_log_access_extracto…

0e719e6

…r.py

Merge branch 'main' into enh/directory_based_totals

aeca81f

Merge branch 'main' into enh/directory_based_totals

15ce9bf

[pre-commit.ci] auto fixes from pre-commit.com hooks

c06645a

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add simple generate summaries and totals functions that group by directory. #103

Add simple generate summaries and totals functions that group by directory. #103

Uh oh!

rwblair commented Jun 10, 2025

Uh oh!

CodyCBakerPhD Jun 11, 2025

Uh oh!

CodyCBakerPhD commented Jun 11, 2025

Uh oh!

Uh oh!

CodyCBakerPhD commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!

Add simple generate summaries and totals functions that group by directory. #103

Are you sure you want to change the base?

Add simple generate summaries and totals functions that group by directory. #103

Uh oh!

Conversation

rwblair commented Jun 10, 2025

Uh oh!

CodyCBakerPhD Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

CodyCBakerPhD commented Jun 11, 2025

Uh oh!

Uh oh!

CodyCBakerPhD commented Jun 20, 2025

Uh oh!

Uh oh!

Uh oh!