v2.1.0: 🤗 .generate(), faster loading, responsive inference, and more
Highlights
🔌 Compatibility with 🤗 Transformers generation utils. Petals models now directly use 🤗 Transformers .generate() implementation instead of custom generation code. This means that you can use a variety of generation methods and constraints implemented in 🤗 Transformers (e.g., repetition_penalty
, beam search, etc.) and expect an exact match between Petals and a model running locally.
Most common methods are compatible with reusing inference sessions, so that you can run .generate()
multiple times without reprocessing the dialogue history from scratch:
with model.inference_session(max_length=100):
outputs1 = model.generate(user_prompt1, repetition_penalty=1.2)
outputs2 = model.generate(user_prompt2, repetition_penalty=1.2)
⚡ Faster loading of Stable Beluga 2. We repacked Stable Beluga 2, the most popular model at the moment, to increase its loading speed and minimize RAM and disk space requirements. The repacked version can be loaded from the petals-team/StableBeluga2
repository and is fully compatible with clients and servers using the standard repository (stabilityai/StableBeluga2
).
Now, clients need to download only 1.05 GB of data to run Stable Beluga 2 (instead of ~20 GB needed before) and require only 4 GB of RAM (instead of ~20 GB required before). Servers need to download and store 2x less data and load the model from disk significantly faster. If you're switching from the old repository, don't forget to remove the old cache in the~/.cache/petals/models--stabilityai--StableBeluga2
directory to save disk space.
⏱️ More responsive inference. In older versions, servers could become unresponsive for a few seconds while processing large prefixes (thousands of tokens) on inference. This release allows to perform small inference requests (a few tokens) in the middle of processing a large request, thus avoiding freezes during token-by-token inference caused by someone processing a large prefix.
🔒 Minor improvements. This release adds support for loading weights in the safetensors format on servers and adds the blocked_servers
client option to avoid a given set of servers:
from petals import AutoDistributedModelForCausalLM
blocked_servers = ["12D3KooWA6g...", "12D3KooWGyD..."] # Full peer IDs from https://health.petals.dev
model = AutoDistributedModelForCausalLM.from_pretrained(model_name, blocked_servers=blocked_servers)
🐞 Bug fixes. This release also includes a variety of bug fixes allowing to speed up the chatbot app and fine-tuning, better bypass recently disconnect servers, improve rebalancing algorithm and usability of benchmarks, fix throughput measurements and installation on ARM CPUs.
We also fixed Petals compatibility with the latest releases of 🤗 Transformers, Accelerate, and PEFT libraries.
Breaking changes
📖 Default inference sessions. If you run .generate()
or forward passes inside an .inference_session()
context, they now use the opened session by default. These snippets are now equivalent:
# Using default session
with model.inference_session(max_length=100):
output_ids = model.generate(input_ids, max_new_tokens=3)
# Explicitly specifying a session
with model.inference_session(max_length=100) as sess:
output_ids = model.generate(input_ids, max_new_tokens=3, session=sess)
Earlier, the 1st snippet was creating a new session, which confused most people and lead to bugs.
➡️ Renaming. We renamed SequenceManagerConfig
to petals.ClientConfig and petals.dht_utils
to petals.utils.dht. The old names now lead to DeprecationWarning
s and will be removed in Petals 2.2.0+.
What's Changed
- Fix stale link by @bot66 in #418
- Add Discord badge and more Discord links to readme by @borzunov in #422
- Add connect_timeout by @borzunov in #423
- Add Stable Beluga 2 to readme by @borzunov in #424
- Penalize servers that use relays during rebalancing by @borzunov in #428
- Fix petals.utils.ping for servers with client-mode DHT by @borzunov in #430
- Fix typo and make blocks message more informative by @vadi2 in #437
- Update Discord links from channels to forums by @borzunov in #440
- Remove distracting links from readme by @borzunov in #441
- Remove deprecated comment in fine-tuning notebook by @borzunov in #443
- Use bitsandbytes 0.41.1 by @borzunov in #442
- [Refactor] extract block forward, backward and inference into a separate file by @justheuristic in #435
- Override float32 in config to bfloat16 by @borzunov in #431
- Prefer longer servers for fine-tuning, exclude unreachable by @borzunov in #448
- Force using --new_swarm instead of empty --initial_peers by @borzunov in #451
- Test Llama, rebalancing, throughput eval, and all CLI scripts by @borzunov in #452
- benchmarks: Aggregate speed among workers, set default dtype torch32 by @borzunov in #454
- Use torch.cuda.synchronize for compute throughput by @justheuristic in #456
- Prioritize short inference, unmerge pools for long inference by @borzunov in #458
- Bump version to 2.0.1.post2 by @borzunov in #459
- Add
blocked_servers
argument by @borzunov in #462 - Add customizable input tensors by @artek0chumak in #445
- Move SequenceManagerConfig -> ClientConfig, petals.dht_utils -> petals.utils.dht by @borzunov in #463
- Make client compatible with transformers' GenerationMixin by @borzunov in #464
- Temporarily require peft<0.5.0, transformers<4.32.0 by @justheuristic in #470
- Support transformers 4.32.x by @justheuristic in #471
- Change transformers version assert by @justheuristic in #472
- Support loading weights from Safetensors on server by @borzunov in #473
- Update peft to 0.5.0 version by @artek0chumak in #475
- Hide excess key message by @borzunov in #476
- Bump version to 2.1.0 by @borzunov in #474
- Don't install cpufeature on non-x86_64 machines by @borzunov in #478
New Contributors
Full Changelog: v2.0.1...v2.1.0