-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for incremental model re-hashing #160
Comments
A simple first iteration could be to support just a list of files to recompute a hash for. The serializer would list files on disk, and for those that needs to be recomputed, it would re-serialize them. For the others, it would take it from the existing signature file. |
I think we always need 2 lists: one for files that got deleted and one for files that need to be rehashed. For a deleted file, we just remove it from the manifest, error if it was not present. For files that need to be rehashed, error if they're not present, recompute hash otherwise. Moving/renaming files and just copying the hash will speed up the update process as we won't need to re-read the file's content, but we can risk a misuse of the API where a file that's both moved and modified will have the wrong hash. |
Depends what semantics we define for the list. If we allow files in the list to not be present on disk, it would allow the caller to pass a list of changed files (either deleted, renamed or added) without the need for them to know the details of what was changed. Our lib would figure what to recompute by listing existing files on disk and using the list to determine which to recompute (deleted files would simply be ignored since they won't be on disk). I'm thinking that's as simple as it gets for callers. They would be able to use existing git command to list files changed, without parsing what sort of changes were made. To be honest I have not not played with git yet to see what command is needed to get the diff between a commit and the latest push :) A slightly more advanced solution is for callers to pass the type of changes (either via 2 lists as you suggest) or via a more structured parameters (in the original description). |
Oh, you're thinking from the point of view of the CLI usage. I was thinking from the point of view of a ML training pipeline that needs to orchestrate between runs. Both are valid, we should support both scenarios.
|
Can you provide more details about the requirements for this? What are the constraints? How are models stored between runs? Where does #160 (comment) fail for this use case? |
@McPatate how would you imagine an integration look like? Would HF API invoke the git command or use an existing Python git API? Or do you imagine a dedicated CLI for it, separate from HF Python framework API? |
We want to have the ML training pipeline code call directly a library. If the pipeline knows that they are going to write files X, Y, Z, they should be able to call the signing library saying that only those files changed. Using a CLI is a solution, but then it would require a separate process, that some MLOps engineers might not follow. Having this in the pipeline code itself would uplift it everywhere. |
I was not thinking in term of CLI vs API (both will be available), but in terms of whether the API interface works for the use case or not. In #160 (comment), would the "simple" |
Oh, I misunderstood, given the With an explicit list of files that are changed and deleted we don't need to scan the model tree and compare with a list, but we can miss some modifications if the library users don't pass the file in the proper list. And we'll have to define the error scenarios. With a |
yeah I think it's simple and the slowdown is likely negligible: even 100's files on disk is fairly fast to list them and compare them. It's also easier for the caller to not make mistakes. |
There exists repos with ungodly file structures, but let's not consider them, they are pretty rare 😄 Even https://huggingface.co/datasets/HuggingFaceFW/fineweb "only" has a couple thousand files.
I'm not exactly sure how |
Provide an API that allows re-computing the hash of a subset of files in a model. This is useful in cases where only a (small) set of files have changed, eg if a user has updated a README in a huggingface repo with files of several hundred of GB.
The API may look something like the following:
with
Example:
Something along these lines. The API would use the existing manifest file and update it by re-computing only the hashes for files that need to be re-hashed.
The text was updated successfully, but these errors were encountered: