From 639fb728d623c849fe708f8161985b4653770c97 Mon Sep 17 00:00:00 2001 From: Jackmin801 Date: Thu, 6 Mar 2025 20:55:04 +0000 Subject: [PATCH 1/2] remove $ prefix which made it hard to copy paste --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 2d0a25c..7891425 100644 --- a/README.md +++ b/README.md @@ -57,7 +57,7 @@ Before proceeding, ensure you have the following installed: Note that the Rust versions available in many conda environments may be outdated. To install the latest version of Rust, we recommend downloading it directly from the official website as shown in the below command: ```sh -$ curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh +curl --proto '=https' --tlsv1.2 https://sh.rustup.rs -sSf | sh ``` To install the required packages on a Debian-based system (such as Ubuntu) using apt, run: @@ -75,7 +75,7 @@ sudo dnf install protobuf-compiler protobuf-devel ## Installation ```sh -$ pip install . +pip install . ``` This uses pyo3+maturin to build the package, you'll need maturin installed. @@ -85,7 +85,7 @@ If the installation command fails to invoke `cargo update` due to an inability t To install in editable mode w/ the Rust extensions you can use the normal pip install command: ```sh -$ pip install -e . +pip install -e . ``` ## Usage @@ -98,7 +98,7 @@ when using synchronous training. You can start a lighthouse server by running: ```sh -$ RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 +RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 ``` ### Example Training Loop (DDP) @@ -108,7 +108,7 @@ See [train_ddp.py](./train_ddp.py) for the full example. Invoke with: ```sh -$ TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py +TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py ``` train.py: From 8b8e1f9a88f3f9727ccb20f3bcc022370b2d9763 Mon Sep 17 00:00:00 2001 From: Jackmin801 Date: Thu, 6 Mar 2025 20:56:23 +0000 Subject: [PATCH 2/2] typo fixes --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7891425..2215436 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ replica groups and then a per replica group manager and fault tolerance library that can be used in a standard PyTorch training loop. This allows for membership changes at the training step granularity which can -greatly improve efficiency by avoiding stop the world training on errors. +greatly improve efficiency by avoiding stopping the world training on errors. ![](./media/torchft-overview.png) @@ -108,7 +108,7 @@ See [train_ddp.py](./train_ddp.py) for the full example. Invoke with: ```sh -TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train.py +TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port 29501 --nnodes 1 --nproc_per_node 1 train_ddp.py ``` train.py: