You will receive an email from Barry Chen (most likely) and will need to provide your address for them to ship you an RSA token.
Meet with Barry following mail receipt of the RSA token.
1. ssh onto an LLNL compute cluster, accessing a compute node like Dane:
dhcp-10-118-155-112:~ mayavenkatraman$ ssh [email protected]
You will be prompted for your 8-digit pin appended to the 6-digit number on the RSA token you received in the mail.
[venkatraman2@dane6:~]$ ssh tuolumne.llnl.gov
Miniforge is an edition of Conda that only uses free and openly-licensed packages from the conda-forge project. It is already available on LLNL. Conda is an open-source package and environment manager that helps users install, update, and remove software. It allows you to isntall software WITHOUT needing root access (sudo
), which is critical on systems like LLNL.
Mamba is like Conda, but more efficient.
conda install -n base -c conda-forge mamba
Check that mamba is installed correctly by running
mamba --version
Run mamba install oh-my-bash
or some variant.
Run mamba install tmux
or some variant.
If you are on Mac, you might observe that whenever you try to delete characters in vim, the terminal adds ^?
. By default, the macOS Terminal and some remote SSH sessions send ^? for Delete, which Vim does not interpret as a delete command. Run stty erase ^?
to circumvent this.
To make this change permanent, run echo 'stty erase ^?' >> ~/.bashrc
.
Run echo "set mouse=a" >> ~/.vimrc
.
Add your git userame and personal access token to ~/.git-credentials
like so:
echo "https://your-username:[email protected]" >> ~/.git-credentials
This way, you won't have to log in every time you pull or push.
To find a model, check out /p/vast1/OpenFoldCollab/genome_lm/training_output/
. It is probably there.
Like on Manitou, you can use sbatch to run a script.
An example training script train_model.sh
can be found in /p/vast1/OpenFoldCollab/genome_lm/experiments/emb_exp/test_fsdp/train_model.sh
.
Run it like so:
[venkatraman2@tuolumne1003:scripts]$ sbatch train_model.sh
faKoR2uSfgB
[venkatraman2@tuolumne1004:scripts]$ squeue -u {username}
NOTE: If your job has an "S" under "ST" (Status), this means "Suspended." It is possible that all jobs on Tuolomne are being suspended, due to maintenance.
By running squeue without specifying a username:
[venkatraman2@tuolumne1004:scripts]$ squeue
flux jobs | grep {job_id}
Run scancel {job_id}
.
If your job fails silently (many of mine did at first), try the following.
srun python /p/vast1/OpenFoldCollab/genome_lm/glm/glm/train/training.py \
--config-yaml="/g/g14/venkatraman2/scripts/mvenkat_glm_12l_20k.yaml" \
--limit-val-batches 50 --inference-mode-off --skip-last-val \
--compile-off
ModelSeqParallelStrategy
seems much less reliable than DDP
, so if your pl_strategy
class
is set to ModelSeqParallelStrategy
, try changing it.
Normally, when NCCL detects an error (e.g., a timeout, GPU failure, or network issue), it does not immediately report the error. Instead, it waits for all ranks to reach the same failure point, which can cause deadlocks where some GPUs hang indefinitely.
With NCCL_ASYNC_ERROR_HANDLING=1:
- NCCL immediately detects errors and reports them asynchronously.
- If a GPU fails or hangs, the job is aborted automatically instead of waiting for other ranks.
- This prevents silent hangs in distributed training jobs.
Directory: /p/vast1/OpenFoldCollab/genome_lm/experiments/SL-GLM_exp
.
Experiment 1: /SL-GLM_exp/02.10.2025_experiment_1
.
Submission/training scripts: /02.10.2025_experiment_1/submit_SL/
.
Check /02.10.2025_experiment_1/configs_SL/esm3s_12l_varlen20k_spanmask01_student_teacher_token.yaml
for the token selection config.
- Note the
student_teacher:
key
Currently SL can only handle standard-rope so set data params like return_contig_indices: false
correctly in both student and teacher configs. The key in ['student_teacher']['selection_scheme'] can only have two values — token or batch . You can see the batch selection config in configs_SL/esm3s_12l_varlen20k_spanmask01_student_teacher_batch.yaml
.
To test FSDP, you need to change pl_strategy
in the model config. You must also specify args that multiply to the world size.
pl_strategy:
class: ModelSeqParallelStrategy
args:
data_parallel_size: 64
sequence_parallel_size: 1
tensor_parallel_size: 1
This is a valid setting for args if trainer
is configured like so:
trainer:
log_every_n_steps: 400 # Log every n steps
max_steps: 210005 # Maximum steps
precision: bf16-mixed # Precision
gradient_clip_val: # Gradient clip value
devices: 4 # Devices
num_nodes: 16 # Number of nodes