change default shard size to 1GB #357

spencerschrock · 2025-02-26T18:48:55Z

Summary

The values was based on the benchmarks in #356.

The exact speed improvement depends on the size of the models being
serialized and of the individual files in the model. Speed improvements
ranged from a 5% improvement to a 87% improvement.

Manifest size is also influenced by shard size (and the model /
individual file size). Increasing shard size from ~1MB to 1GB decreases
manifest size by 3 orders of magnitude (99.9% reduction).

Release Note

Fine-tuned the shard sized used when hashing files, resulting in a modest speed improvement. The exact improvement is dependent on model size but we observed improvements of 5-87%. Manifest size was also reduced by 99.9%.

Documentation

NONE

spencerschrock

Lots of diminishing returns on speed after 50 MiB shard size, but manifest size still decreases until we reach the approximate size of the normal file-based manifest (but at that point having shards at all is pointless).

Given HuggingFace has a default shard size of 5GB (5_000_000_000), I'm assuming something in the 1-2G(i?)B range makes sense here. Happy to do either

benchmarks/serialize.py

The values was based on the benchmarks in b2ddcc0. The exact speed improvement depends on the size of the models being serialized and of the individual files in the model. Speed improvements ranged from a 5% improvement to a 87% improvement. Manifest size is also influenced by shard size (and the model / individual file size). Increasing shard size from ~1MB to 1GB decreases manifest size by 3 orders of magnitude (99.9% reduction). Signed-off-by: Spencer Schrock <[email protected]>

Signed-off-by: Spencer Schrock <[email protected]>

They're roughly equal in performance, any differences either vary model to model or run to run. However multiples of 1000 are slightly easier for humans to visualize in things like shard names. Signed-off-by: Spencer Schrock <[email protected]>

spencerschrock commented Feb 26, 2025

View reviewed changes

benchmarks/serialize.py Outdated Show resolved Hide resolved

spencerschrock marked this pull request as ready for review February 26, 2025 19:26

spencerschrock requested review from a team as code owners February 26, 2025 19:26

mihaimaruseac previously approved these changes Feb 26, 2025

View reviewed changes

spencerschrock added 2 commits February 26, 2025 13:08

update shard test goldens

541cae9

Signed-off-by: Spencer Schrock <[email protected]>

spencerschrock dismissed mihaimaruseac’s stale review via 541cae9 February 26, 2025 21:28

spencerschrock force-pushed the shard-opt branch from 39441d9 to 541cae9 Compare February 26, 2025 21:28

use gigabyte instead of gibibyte

05a63b3

They're roughly equal in performance, any differences either vary model to model or run to run. However multiples of 1000 are slightly easier for humans to visualize in things like shard names. Signed-off-by: Spencer Schrock <[email protected]>

spencerschrock force-pushed the shard-opt branch from 97e798a to 05a63b3 Compare February 26, 2025 21:36

mihaimaruseac approved these changes Feb 27, 2025

View reviewed changes

mihaimaruseac merged commit 4a610f7 into sigstore:main Feb 27, 2025
33 checks passed

spencerschrock deleted the shard-opt branch February 27, 2025 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

change default shard size to 1GB #357

change default shard size to 1GB #357

spencerschrock commented Feb 26, 2025

spencerschrock left a comment •

edited

Loading

change default shard size to 1GB #357

change default shard size to 1GB #357

Conversation

spencerschrock commented Feb 26, 2025

Summary

Release Note

Documentation

spencerschrock left a comment • edited Loading

Choose a reason for hiding this comment

spencerschrock left a comment •

edited

Loading