Skip to content

Add simple generate summaries and totals functions that group by directory. #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

rwblair
Copy link

@rwblair rwblair commented Jun 10, 2025

I used dataset in place of dandiset but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.

Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.

Comment on lines 131 to 133
str(file_path.absolute())
for file_path in natsort.natsorted(seq=directory.rglob(pattern="*"))
if file_path.is_file()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about this some more, this will throw off the record keeping of our Drogon setup because of the asynchronous CRON job that consolidates original S3 logs (which would be recorded) into the .log format

That said, I think it's great to offer a solution for the de facto S3 log structure, which therefore ought to be the default

So I propose, as with other calls in the current API, to add a keyword mode to this method (and expose the option --mode to the CLI) that has the following behavior

def extract_directory(self, *, directory: str | pathlib.Path, limit: int | None = None, mode: typing.Literal["dandi"] | None = None) -> None:
    directory = pathlib.Path(directory)

    pattern = "*.log" if mode == "dandi" else "*-*-*-*-*-*-*"

That way we preserve the oddness of DANDI specific stuff as a special case

Keeping the is_file() is fine

@CodyCBakerPhD
Copy link
Collaborator

I used dataset in place of dandiset but that's not honest. It just so happens that all of Openneuro's datasets have their own root level prefix in the bucket. Because of that I'm not sure they're a good default.

Indeed I struggled with what to call these as well (even the phrase 'archive' is a bit close to our own use cases...)

The goal of the tool (though I need to make this clearer in some dev documentation on data structures...) is to create a mirror of the S3 bucket contents, where every object in the bucket is it's own directory with the 3 .txt files contained therein

The most general way I can think of referring to what we call data/dandisets is then, 'top level', as in 'top level summaries and totals' and we take that to mean datasets for OpenNeuro (because your top level is datasets) and DANDI of course has a special separate structure that has to be manually assembled (technically we even have other unmentionable things at our top-level besides blobs and zarr) to match the same concept

What do you think?

Not worth merging as is, but more of a starting point to discuss what default behavior of update summaries and update totals should be.

PR looks great! I will add more tests to enhance coverage in a follow-up, but I'd say this is good to merge once we get the 'mode' of extraction file patterns ironed out in the above comment

@CodyCBakerPhD
Copy link
Collaborator

LGTM @rwblair ready to merge?

I will add tests for all of this in a followup after the performance benchmarking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants