This project is aimed at exploring zero-shot hyper-parameter transfer with
I created this project mostly because I wanted to understand
$\mu\text{P}$ better and to have an example implementation for my work. I hope you find it helpful too. If you have questions or feedback please let me know 🧑💻
In Garg et al., 2022, the task is to perform next-token prediction on a sequence
I recommend to set up a conda environment first, e.g.,
conda create -n icl-linear-regression-mup python=3.10
conda activate icl-linear-regression-mup
and then run
pip install -e .
to install icl-linear-regression-mup
with all dependencies.
This code uses hydra
to configure experiments; see https://hydra.cc for more details and see below for a documentation of all the command line arguments.
To reproduce the results, first tune the learning rate of a small model with width=256
by running
python icl_linear_regression_mup.py -m use_mup=True learning_rate=1e-05,3e-05,9e-05,0.00027,0.00081,0.00243,0.00729,0.02187,0.06561
By default, some parameters and metrics will be written to a file named
results.csv
. You can change the file by settingresults_file
in the command line (see more below). I found it useful to simply write the results into acsv
file. The format will also be compatible with thenotebooks/scaling.ipynb
notebook, later.
Then, run the same experiments at scale, e.g., with more training steps, more layers and a larger width:
python icl_linear_regression_mup.py -m use_mup=True embed_dim=1024 num_layers=8 num_steps=50000 learning_rate=1e-05,3e-05,9e-05,0.00027,0.00081,0.00243,0.00729,0.02187,0.06561
You can now move to the notebooks/scaling.ipynb
notebook to analyze your results. You should then see that results/demo.csv
:
Parameter | Description |
---|---|
learning_rate |
Maximum learning rate during training (might be different when using use_mup=True ; see below). |
use_mup |
Whether to use |
num_steps |
Total number of gradient steps during training. |
batch_size |
Batch size per gradient step. |
weight_decay |
Weight decay parameter passed to AdamW . |
dropout |
Dropout used during training. |
gradient_norm |
Norm for gradient clipping. |
base_embed_dim |
Base embedding dimension for |
base_num_heads |
Number of heads for the base model in |
delta_embed_dim |
Delta embedding dimension for base_embed_dim . |
delta_num_heads |
Number of heads for the delta model in base_num_heads if you wish to scale the number of heads. |
num_layers |
Number of transformer layers. |
embed_dim |
Embedding dimension of the transformer. |
num_heads |
Number of attention heads of the transformer. |
seed |
Random seed to use for the run. |
log_every |
Log the current training metrics to the console every log_every gradient steps. |
block_size |
Number of shots for the in-context linear regression problem. |
param_range |
Value range for the a and b linear regression coefficients. |
sample_range |
Value range for the inputs x . |
results_file |
Path to the csv file to which the run results are logged. |