Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] small readme fix #124

Merged
merged 2 commits into from
Mar 7, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ replica groups and then a per replica group manager and fault tolerance library
that can be used in a standard PyTorch training loop.

This allows for membership changes at the training step granularity which can
greatly improve efficiency by avoiding stop the world training on errors.
greatly improve efficiency by avoiding stopping the world training on errors.

![](./media/torchft-overview.png)

Expand All @@ -57,7 +57,7 @@ Before proceeding, ensure you have the following installed:

Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command:
```sh
$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh
```

To install the required packages on a Debian-based system (such as Ubuntu) using apt, run:
Expand All @@ -75,7 +75,7 @@ sudo dnf install protobuf-compiler protobuf-devel
## Installation

```sh
$ pip install .
pip install .
```

This uses pyo3+maturin to build the package, you'll need maturin installed.
Expand All @@ -85,7 +85,7 @@ If the installation command fails to invoke `cargo update` due to an inability t
To install in editable mode w/ the Rust extensions you can use the normal pip install command:

```sh
$ pip install -e .
pip install -e .
```

## Usage
Expand All @@ -98,7 +98,7 @@ when using synchronous training.
You can start a lighthouse server by running:

```sh
$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
```

### Example Training Loop (DDP)
Expand All @@ -108,7 +108,7 @@ See [train_ddp.py](./train_ddp.py) for the full example.
Invoke with:

```sh
$ TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py
```

train.py:
Expand Down