Release v2.1.0: 🤗 .generate(), faster loading, responsive inference, and more · bigscience-workshop/petals

Highlights

🔌 Compatibility with 🤗 Transformers generation utils. Petals models now directly use 🤗 Transformers .generate() implementation instead of custom generation code. This means that you can use a variety of generation methods and constraints implemented in 🤗 Transformers (e.g., repetition_penalty, beam search, etc.) and expect an exact match between Petals and a model running locally.

Most common methods are compatible with reusing inference sessions, so that you can run .generate() multiple times without reprocessing the dialogue history from scratch:

with model.inference_session(max_length=100):
    outputs1 = model.generate(user_prompt1, repetition_penalty=1.2)
    outputs2 = model.generate(user_prompt2, repetition_penalty=1.2)

⚡ Faster loading of Stable Beluga 2. We repacked Stable Beluga 2, the most popular model at the moment, to increase its loading speed and minimize RAM and disk space requirements. The repacked version can be loaded from the petals-team/StableBeluga2 repository and is fully compatible with clients and servers using the standard repository (stabilityai/StableBeluga2).

Now, clients need to download only 1.05 GB of data to run Stable Beluga 2 (instead of ~20 GB needed before) and require only 4 GB of RAM (instead of ~20 GB required before). Servers need to download and store 2x less data and load the model from disk significantly faster. If you're switching from the old repository, don't forget to remove the old cache in the~/.cache/petals/models--stabilityai--StableBeluga2 directory to save disk space.

⏱️ More responsive inference. In older versions, servers could become unresponsive for a few seconds while processing large prefixes (thousands of tokens) on inference. This release allows to perform small inference requests (a few tokens) in the middle of processing a large request, thus avoiding freezes during token-by-token inference caused by someone processing a large prefix.

🔒 Minor improvements. This release adds support for loading weights in the safetensors format on servers and adds the blocked_servers client option to avoid a given set of servers:

from petals import AutoDistributedModelForCausalLM

blocked_servers = ["12D3KooWA6g...", "12D3KooWGyD..."]  # Full peer IDs from https://health.petals.dev
model = AutoDistributedModelForCausalLM.from_pretrained(model_name, blocked_servers=blocked_servers)

🐞 Bug fixes. This release also includes a variety of bug fixes allowing to speed up the chatbot app and fine-tuning, better bypass recently disconnect servers, improve rebalancing algorithm and usability of benchmarks, fix throughput measurements and installation on ARM CPUs.

We also fixed Petals compatibility with the latest releases of 🤗 Transformers, Accelerate, and PEFT libraries.

Breaking changes

📖 Default inference sessions. If you run .generate() or forward passes inside an .inference_session() context, they now use the opened session by default. These snippets are now equivalent:

# Using default session
with model.inference_session(max_length=100):
    output_ids = model.generate(input_ids, max_new_tokens=3)

# Explicitly specifying a session
with model.inference_session(max_length=100) as sess:
    output_ids = model.generate(input_ids, max_new_tokens=3, session=sess)

Earlier, the 1st snippet was creating a new session, which confused most people and lead to bugs.

➡️ Renaming. We renamed SequenceManagerConfig to petals.ClientConfig and petals.dht_utils to petals.utils.dht. The old names now lead to DeprecationWarnings and will be removed in Petals 2.2.0+.

What's Changed

Fix stale link by @bot66 in #418
Add Discord badge and more Discord links to readme by @borzunov in #422
Add connect_timeout by @borzunov in #423
Add Stable Beluga 2 to readme by @borzunov in #424
Penalize servers that use relays during rebalancing by @borzunov in #428
Fix petals.utils.ping for servers with client-mode DHT by @borzunov in #430
Fix typo and make blocks message more informative by @vadi2 in #437
Update Discord links from channels to forums by @borzunov in #440
Remove distracting links from readme by @borzunov in #441
Remove deprecated comment in fine-tuning notebook by @borzunov in #443
Use bitsandbytes 0.41.1 by @borzunov in #442
[Refactor] extract block forward, backward and inference into a separate file by @justheuristic in #435
Override float32 in config to bfloat16 by @borzunov in #431
Prefer longer servers for fine-tuning, exclude unreachable by @borzunov in #448
Force using --new_swarm instead of empty --initial_peers by @borzunov in #451
Test Llama, rebalancing, throughput eval, and all CLI scripts by @borzunov in #452
benchmarks: Aggregate speed among workers, set default dtype torch32 by @borzunov in #454
Use torch.cuda.synchronize for compute throughput by @justheuristic in #456
Prioritize short inference, unmerge pools for long inference by @borzunov in #458
Bump version to 2.0.1.post2 by @borzunov in #459
Add blocked_servers argument by @borzunov in #462
Add customizable input tensors by @artek0chumak in #445
Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht by @borzunov in #463
Make client compatible with transformers' GenerationMixin by @borzunov in #464
Temporarily require peft<0.5.0, transformers<4.32.0 by @justheuristic in #470
Support transformers 4.32.x by @justheuristic in #471
Change transformers version assert by @justheuristic in #472
Support loading weights from Safetensors on server by @borzunov in #473
Update peft to 0.5.0 version by @artek0chumak in #475
Hide excess key message by @borzunov in #476
Bump version to 2.1.0 by @borzunov in #474
Don't install cpufeature on non-x86_64 machines by @borzunov in #478

New Contributors

@bot66 made their first contribution in #418

Full Changelog: v2.0.1...v2.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.1.0: 🤗 .generate(), faster loading, responsive inference, and more

Highlights

Breaking changes

What's Changed

New Contributors

Contributors