-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfine_tuned_base_model.out
559 lines (347 loc) · 31.4 KB
/
fine_tuned_base_model.out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
Train size: 6399
Test size: 1600
Map: 0%| | 0/6399 [00:00<?, ? examples/s]Map: 16%|█▌ | 1000/6399 [00:01<00:10, 537.58 examples/s]Map: 31%|███▏ | 2000/6399 [00:03<00:08, 544.49 examples/s]Map: 47%|████▋ | 3000/6399 [00:05<00:06, 546.15 examples/s]Map: 63%|██████▎ | 4000/6399 [00:07<00:04, 534.87 examples/s]Map: 78%|███████▊ | 5000/6399 [00:09<00:02, 536.07 examples/s]Map: 94%|█████████▍| 6000/6399 [00:11<00:00, 540.60 examples/s]Map: 100%|██████████| 6399/6399 [00:11<00:00, 527.04 examples/s]Map: 100%|██████████| 6399/6399 [00:11<00:00, 534.10 examples/s]
Map: 0%| | 0/1600 [00:00<?, ? examples/s]Map: 62%|██████▎ | 1000/1600 [00:01<00:01, 553.03 examples/s]Map: 100%|██████████| 1600/1600 [00:02<00:00, 536.67 examples/s]Map: 100%|██████████| 1600/1600 [00:02<00:00, 535.45 examples/s]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Loading checkpoint shards: 50%|█████ | 1/2 [00:13<00:13, 13.17s/it]Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00, 7.86s/it]Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00, 8.66s/it]
Trainable: 20971520 | total: 7262703616 | Percentage: 0.2888%
/home/matthewn/.conda/envs/kuda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:245: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
warnings.warn(
Map: 0%| | 0/6399 [00:00<?, ? examples/s]Map: 16%|█▌ | 1000/6399 [00:01<00:05, 903.48 examples/s]Map: 31%|███▏ | 2000/6399 [00:02<00:04, 930.03 examples/s]Map: 47%|████▋ | 3000/6399 [00:03<00:03, 963.32 examples/s]Map: 63%|██████▎ | 4000/6399 [00:04<00:02, 953.66 examples/s]Map: 78%|███████▊ | 5000/6399 [00:05<00:01, 965.50 examples/s]Map: 94%|█████████▍| 6000/6399 [00:06<00:00, 977.01 examples/s]Map: 100%|██████████| 6399/6399 [00:06<00:00, 973.76 examples/s]Map: 100%|██████████| 6399/6399 [00:06<00:00, 962.15 examples/s]
Map: 0%| | 0/1600 [00:00<?, ? examples/s]Map: 62%|██████▎ | 1000/1600 [00:00<00:00, 1010.45 examples/s]Map: 100%|██████████| 1600/1600 [00:01<00:00, 1010.38 examples/s]Map: 100%|██████████| 1600/1600 [00:01<00:00, 1008.16 examples/s]
/home/matthewn/.conda/envs/kuda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
warnings.warn(
/home/matthewn/.conda/envs/kuda/lib/python3.11/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
0%| | 0/100 [00:00<?, ?it/s]/home/matthewn/.conda/envs/kuda/lib/python3.11/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
1%| | 1/100 [00:06<10:34, 6.40s/it] {'loss': 2.4351, 'grad_norm': 3.538444995880127, 'learning_rate': 0.00019805941782534764, 'epoch': 0.0}
1%| | 1/100 [00:06<10:34, 6.40s/it] 2%|▏ | 2/100 [00:12<09:46, 5.99s/it] {'loss': 2.3249, 'grad_norm': 4.944902420043945, 'learning_rate': 0.00019605881764529358, 'epoch': 0.0}
2%|▏ | 2/100 [00:12<09:46, 5.99s/it] 3%|▎ | 3/100 [00:18<09:40, 5.99s/it] {'loss': 2.5113, 'grad_norm': 3.7843663692474365, 'learning_rate': 0.00019405821746523957, 'epoch': 0.0}
3%|▎ | 3/100 [00:18<09:40, 5.99s/it] 4%|▍ | 4/100 [00:23<09:31, 5.95s/it] {'loss': 1.5747, 'grad_norm': 3.0683703422546387, 'learning_rate': 0.00019205761728518557, 'epoch': 0.0}
4%|▍ | 4/100 [00:23<09:31, 5.95s/it] 5%|▌ | 5/100 [00:29<08:58, 5.67s/it] {'loss': 2.0008, 'grad_norm': 4.697077751159668, 'learning_rate': 0.00019005701710513156, 'epoch': 0.0}
5%|▌ | 5/100 [00:29<08:58, 5.67s/it] 6%|▌ | 6/100 [00:36<09:41, 6.19s/it] {'loss': 1.7482, 'grad_norm': 2.807176113128662, 'learning_rate': 0.00018805641692507753, 'epoch': 0.0}
6%|▌ | 6/100 [00:36<09:41, 6.19s/it] 7%|▋ | 7/100 [00:42<09:21, 6.03s/it] {'loss': 1.8565, 'grad_norm': 9.83282470703125, 'learning_rate': 0.00018605581674502352, 'epoch': 0.0}
7%|▋ | 7/100 [00:42<09:21, 6.03s/it] 8%|▊ | 8/100 [00:49<09:55, 6.47s/it] {'loss': 1.6586, 'grad_norm': 2.558476686477661, 'learning_rate': 0.00018405521656496952, 'epoch': 0.01}
8%|▊ | 8/100 [00:49<09:55, 6.47s/it] 9%|▉ | 9/100 [00:54<09:07, 6.02s/it] {'loss': 1.6344, 'grad_norm': 6.1568074226379395, 'learning_rate': 0.00018205461638491548, 'epoch': 0.01}
9%|▉ | 9/100 [00:54<09:07, 6.02s/it] 10%|█ | 10/100 [00:59<08:26, 5.62s/it] {'loss': 1.5794, 'grad_norm': 3.443655014038086, 'learning_rate': 0.00018005401620486148, 'epoch': 0.01}
10%|█ | 10/100 [00:59<08:26, 5.62s/it] 11%|█ | 11/100 [01:09<10:26, 7.04s/it] {'loss': 1.9405, 'grad_norm': 1.871228814125061, 'learning_rate': 0.00017805341602480744, 'epoch': 0.01}
11%|█ | 11/100 [01:09<10:26, 7.04s/it] 12%|█▏ | 12/100 [01:14<09:29, 6.47s/it] {'loss': 1.8111, 'grad_norm': 3.363508939743042, 'learning_rate': 0.00017605281584475344, 'epoch': 0.01}
12%|█▏ | 12/100 [01:14<09:29, 6.47s/it] 13%|█▎ | 13/100 [01:19<08:38, 5.96s/it] {'loss': 2.0277, 'grad_norm': 3.621725082397461, 'learning_rate': 0.00017405221566469943, 'epoch': 0.01}
13%|█▎ | 13/100 [01:19<08:38, 5.96s/it] 14%|█▍ | 14/100 [01:25<08:41, 6.07s/it]{'loss': 1.5773, 'grad_norm': 2.713484764099121, 'learning_rate': 0.00017205161548464542, 'epoch': 0.01}
14%|█▍ | 14/100 [01:25<08:41, 6.07s/it] 15%|█▌ | 15/100 [01:32<08:59, 6.34s/it] {'loss': 1.4289, 'grad_norm': 1.932316541671753, 'learning_rate': 0.00017005101530459136, 'epoch': 0.01}
15%|█▌ | 15/100 [01:32<08:59, 6.34s/it] 16%|█▌ | 16/100 [01:37<08:20, 5.96s/it] {'loss': 1.8814, 'grad_norm': 2.5634851455688477, 'learning_rate': 0.00016805041512453736, 'epoch': 0.01}
16%|█▌ | 16/100 [01:37<08:20, 5.96s/it] 17%|█▋ | 17/100 [01:43<08:05, 5.84s/it] {'loss': 1.7249, 'grad_norm': 2.450714349746704, 'learning_rate': 0.00016604981494448335, 'epoch': 0.01}
17%|█▋ | 17/100 [01:43<08:05, 5.84s/it] 18%|█▊ | 18/100 [01:49<08:13, 6.02s/it] {'loss': 1.2833, 'grad_norm': 2.1919944286346436, 'learning_rate': 0.00016404921476442935, 'epoch': 0.01}
18%|█▊ | 18/100 [01:49<08:13, 6.02s/it] 19%|█▉ | 19/100 [01:55<07:51, 5.82s/it] {'loss': 1.7619, 'grad_norm': 3.202223300933838, 'learning_rate': 0.0001620486145843753, 'epoch': 0.01}
19%|█▉ | 19/100 [01:55<07:51, 5.82s/it] 20%|██ | 20/100 [02:03<08:45, 6.57s/it] {'loss': 1.5633, 'grad_norm': 2.8867547512054443, 'learning_rate': 0.0001600480144043213, 'epoch': 0.01}
20%|██ | 20/100 [02:03<08:45, 6.57s/it] 21%|██ | 21/100 [02:12<09:34, 7.28s/it] {'loss': 1.9644, 'grad_norm': 1.845523715019226, 'learning_rate': 0.0001580474142242673, 'epoch': 0.01}
21%|██ | 21/100 [02:12<09:34, 7.28s/it] 22%|██▏ | 22/100 [02:17<08:34, 6.60s/it] {'loss': 1.7121, 'grad_norm': 4.0029215812683105, 'learning_rate': 0.00015604681404421327, 'epoch': 0.01}
22%|██▏ | 22/100 [02:17<08:34, 6.60s/it] 23%|██▎ | 23/100 [02:24<08:38, 6.74s/it] {'loss': 1.5748, 'grad_norm': 2.1224937438964844, 'learning_rate': 0.00015404621386415926, 'epoch': 0.01}
23%|██▎ | 23/100 [02:24<08:38, 6.74s/it] 24%|██▍ | 24/100 [02:31<08:45, 6.91s/it] {'loss': 1.7229, 'grad_norm': 2.557976007461548, 'learning_rate': 0.00015204561368410523, 'epoch': 0.02}
24%|██▍ | 24/100 [02:31<08:45, 6.91s/it] 25%|██▌ | 25/100 [02:39<08:49, 7.06s/it] {'loss': 1.9085, 'grad_norm': 2.127469062805176, 'learning_rate': 0.00015004501350405122, 'epoch': 0.02}
25%|██▌ | 25/100 [02:39<08:49, 7.06s/it] 26%|██▌ | 26/100 [02:45<08:36, 6.98s/it] {'loss': 2.1835, 'grad_norm': 2.2739717960357666, 'learning_rate': 0.00014804441332399721, 'epoch': 0.02}
26%|██▌ | 26/100 [02:45<08:36, 6.98s/it] 27%|██▋ | 27/100 [02:54<08:55, 7.34s/it] {'loss': 1.2317, 'grad_norm': 2.188575506210327, 'learning_rate': 0.0001460438131439432, 'epoch': 0.02}
27%|██▋ | 27/100 [02:54<08:55, 7.34s/it] 28%|██▊ | 28/100 [03:02<09:07, 7.61s/it] {'loss': 1.8217, 'grad_norm': 2.111154317855835, 'learning_rate': 0.00014404321296388918, 'epoch': 0.02}
28%|██▊ | 28/100 [03:02<09:07, 7.61s/it] 29%|██▉ | 29/100 [03:09<08:47, 7.43s/it] {'loss': 1.6625, 'grad_norm': 3.9960992336273193, 'learning_rate': 0.00014204261278383514, 'epoch': 0.02}
29%|██▉ | 29/100 [03:09<08:47, 7.43s/it] 30%|███ | 30/100 [03:14<07:57, 6.83s/it] {'loss': 2.095, 'grad_norm': 3.2567949295043945, 'learning_rate': 0.00014004201260378114, 'epoch': 0.02}
30%|███ | 30/100 [03:14<07:57, 6.83s/it] 31%|███ | 31/100 [03:24<08:46, 7.64s/it] {'loss': 1.9226, 'grad_norm': 1.679755687713623, 'learning_rate': 0.00013804141242372713, 'epoch': 0.02}
31%|███ | 31/100 [03:24<08:46, 7.64s/it] 32%|███▏ | 32/100 [03:33<09:17, 8.20s/it] {'loss': 1.9101, 'grad_norm': 1.734764575958252, 'learning_rate': 0.00013604081224367312, 'epoch': 0.02}
32%|███▏ | 32/100 [03:33<09:17, 8.20s/it] 33%|███▎ | 33/100 [03:39<08:14, 7.38s/it] {'loss': 1.3596, 'grad_norm': 2.311835765838623, 'learning_rate': 0.0001340402120636191, 'epoch': 0.02}
33%|███▎ | 33/100 [03:39<08:14, 7.38s/it] 34%|███▍ | 34/100 [03:46<08:05, 7.36s/it] {'loss': 1.3964, 'grad_norm': 1.8594880104064941, 'learning_rate': 0.00013203961188356508, 'epoch': 0.02}
34%|███▍ | 34/100 [03:46<08:05, 7.36s/it] 35%|███▌ | 35/100 [03:55<08:32, 7.88s/it] {'loss': 1.7572, 'grad_norm': 1.7741910219192505, 'learning_rate': 0.00013003901170351108, 'epoch': 0.02}
35%|███▌ | 35/100 [03:55<08:32, 7.88s/it] 36%|███▌ | 36/100 [04:02<07:59, 7.49s/it]{'loss': 1.5212, 'grad_norm': 1.9263994693756104, 'learning_rate': 0.00012803841152345704, 'epoch': 0.02}
36%|███▌ | 36/100 [04:02<07:59, 7.49s/it] 37%|███▋ | 37/100 [04:08<07:19, 6.98s/it] {'loss': 1.4748, 'grad_norm': 2.182133913040161, 'learning_rate': 0.000126037811343403, 'epoch': 0.02}
37%|███▋ | 37/100 [04:08<07:19, 6.98s/it] 38%|███▊ | 38/100 [04:15<07:16, 7.04s/it] {'loss': 1.8599, 'grad_norm': 2.1574134826660156, 'learning_rate': 0.000124037211163349, 'epoch': 0.02}
38%|███▊ | 38/100 [04:15<07:16, 7.04s/it] 39%|███▉ | 39/100 [04:21<07:01, 6.92s/it] {'loss': 1.7871, 'grad_norm': 2.128430128097534, 'learning_rate': 0.000122036610983295, 'epoch': 0.02}
39%|███▉ | 39/100 [04:21<07:01, 6.92s/it] 40%|████ | 40/100 [04:28<06:43, 6.73s/it] {'loss': 1.8182, 'grad_norm': 2.207282543182373, 'learning_rate': 0.00012003601080324098, 'epoch': 0.03}
40%|████ | 40/100 [04:28<06:43, 6.73s/it] 41%|████ | 41/100 [04:35<06:38, 6.75s/it] {'loss': 1.7375, 'grad_norm': 1.9659651517868042, 'learning_rate': 0.00011803541062318697, 'epoch': 0.03}
41%|████ | 41/100 [04:35<06:38, 6.75s/it] 42%|████▏ | 42/100 [04:41<06:20, 6.56s/it] {'loss': 1.6187, 'grad_norm': 2.445244550704956, 'learning_rate': 0.00011603481044313295, 'epoch': 0.03}
42%|████▏ | 42/100 [04:41<06:20, 6.56s/it] 43%|████▎ | 43/100 [04:47<06:18, 6.64s/it] {'loss': 1.4641, 'grad_norm': 1.8128530979156494, 'learning_rate': 0.00011403421026307892, 'epoch': 0.03}
43%|████▎ | 43/100 [04:47<06:18, 6.64s/it] 44%|████▍ | 44/100 [04:55<06:27, 6.92s/it] {'loss': 1.6727, 'grad_norm': 1.966376781463623, 'learning_rate': 0.00011203361008302491, 'epoch': 0.03}
44%|████▍ | 44/100 [04:55<06:27, 6.92s/it] 45%|████▌ | 45/100 [05:02<06:14, 6.81s/it] {'loss': 1.6154, 'grad_norm': 3.1520309448242188, 'learning_rate': 0.0001100330099029709, 'epoch': 0.03}
45%|████▌ | 45/100 [05:02<06:14, 6.81s/it] 46%|████▌ | 46/100 [05:09<06:16, 6.98s/it] {'loss': 2.1585, 'grad_norm': 1.929316520690918, 'learning_rate': 0.00010803240972291689, 'epoch': 0.03}
46%|████▌ | 46/100 [05:09<06:16, 6.98s/it] 47%|████▋ | 47/100 [05:18<06:38, 7.51s/it] {'loss': 2.3206, 'grad_norm': 1.9876534938812256, 'learning_rate': 0.00010603180954286287, 'epoch': 0.03}
47%|████▋ | 47/100 [05:18<06:38, 7.51s/it] 48%|████▊ | 48/100 [05:25<06:22, 7.37s/it] {'loss': 1.7564, 'grad_norm': 2.1722543239593506, 'learning_rate': 0.00010403120936280886, 'epoch': 0.03}
48%|████▊ | 48/100 [05:25<06:22, 7.37s/it] 49%|████▉ | 49/100 [05:30<05:46, 6.79s/it] {'loss': 1.9552, 'grad_norm': 2.638056755065918, 'learning_rate': 0.00010203060918275482, 'epoch': 0.03}
49%|████▉ | 49/100 [05:30<05:46, 6.79s/it] 50%|█████ | 50/100 [05:36<05:21, 6.42s/it] {'loss': 1.779, 'grad_norm': 2.5376086235046387, 'learning_rate': 0.00010003000900270081, 'epoch': 0.03}
50%|█████ | 50/100 [05:36<05:21, 6.42s/it] 51%|█████ | 51/100 [05:42<05:09, 6.31s/it] {'loss': 2.0671, 'grad_norm': 2.3074777126312256, 'learning_rate': 9.802940882264679e-05, 'epoch': 0.03}
51%|█████ | 51/100 [05:42<05:09, 6.31s/it] 52%|█████▏ | 52/100 [05:51<05:49, 7.27s/it] {'loss': 1.6051, 'grad_norm': 1.6245694160461426, 'learning_rate': 9.602880864259278e-05, 'epoch': 0.03}
52%|█████▏ | 52/100 [05:51<05:49, 7.27s/it] 53%|█████▎ | 53/100 [05:58<05:31, 7.06s/it] {'loss': 2.0837, 'grad_norm': 2.74434757232666, 'learning_rate': 9.402820846253876e-05, 'epoch': 0.03}
53%|█████▎ | 53/100 [05:58<05:31, 7.06s/it] 54%|█████▍ | 54/100 [06:03<05:04, 6.61s/it] {'loss': 1.4009, 'grad_norm': 2.3917911052703857, 'learning_rate': 9.202760828248476e-05, 'epoch': 0.03}
54%|█████▍ | 54/100 [06:03<05:04, 6.61s/it] 55%|█████▌ | 55/100 [06:10<04:56, 6.60s/it] {'loss': 1.4145, 'grad_norm': 2.0445759296417236, 'learning_rate': 9.002700810243074e-05, 'epoch': 0.03}
55%|█████▌ | 55/100 [06:10<04:56, 6.60s/it] 56%|█████▌ | 56/100 [06:16<04:38, 6.33s/it] {'loss': 1.9497, 'grad_norm': 2.4165585041046143, 'learning_rate': 8.802640792237672e-05, 'epoch': 0.04}
56%|█████▌ | 56/100 [06:16<04:38, 6.33s/it] 57%|█████▋ | 57/100 [06:22<04:27, 6.22s/it] {'loss': 2.178, 'grad_norm': 4.071282386779785, 'learning_rate': 8.602580774232271e-05, 'epoch': 0.04}
57%|█████▋ | 57/100 [06:22<04:27, 6.22s/it] 58%|█████▊ | 58/100 [06:29<04:31, 6.47s/it] {'loss': 2.1401, 'grad_norm': 2.345099925994873, 'learning_rate': 8.402520756226868e-05, 'epoch': 0.04}
58%|█████▊ | 58/100 [06:29<04:31, 6.47s/it] 59%|█████▉ | 59/100 [06:35<04:27, 6.52s/it] {'loss': 1.5554, 'grad_norm': 2.0914130210876465, 'learning_rate': 8.202460738221467e-05, 'epoch': 0.04}
59%|█████▉ | 59/100 [06:35<04:27, 6.52s/it] 60%|██████ | 60/100 [06:42<04:28, 6.70s/it] {'loss': 1.7456, 'grad_norm': 2.224461555480957, 'learning_rate': 8.002400720216065e-05, 'epoch': 0.04}
60%|██████ | 60/100 [06:42<04:28, 6.70s/it] 61%|██████ | 61/100 [06:51<04:41, 7.22s/it] {'loss': 2.402, 'grad_norm': 2.2246463298797607, 'learning_rate': 7.802340702210663e-05, 'epoch': 0.04}
61%|██████ | 61/100 [06:51<04:41, 7.22s/it] 62%|██████▏ | 62/100 [06:57<04:23, 6.93s/it] {'loss': 2.2079, 'grad_norm': 2.5683395862579346, 'learning_rate': 7.602280684205261e-05, 'epoch': 0.04}
62%|██████▏ | 62/100 [06:57<04:23, 6.93s/it] 63%|██████▎ | 63/100 [07:07<04:46, 7.74s/it] {'loss': 1.5462, 'grad_norm': 1.8507758378982544, 'learning_rate': 7.402220666199861e-05, 'epoch': 0.04}
63%|██████▎ | 63/100 [07:07<04:46, 7.74s/it] 64%|██████▍ | 64/100 [07:14<04:36, 7.69s/it] {'loss': 1.6232, 'grad_norm': 2.7969861030578613, 'learning_rate': 7.202160648194459e-05, 'epoch': 0.04}
64%|██████▍ | 64/100 [07:14<04:36, 7.69s/it] 65%|██████▌ | 65/100 [07:20<04:10, 7.16s/it] {'loss': 1.6601, 'grad_norm': 2.431699514389038, 'learning_rate': 7.002100630189057e-05, 'epoch': 0.04}
65%|██████▌ | 65/100 [07:20<04:10, 7.16s/it] 66%|██████▌ | 66/100 [07:25<03:40, 6.50s/it] {'loss': 1.3301, 'grad_norm': 2.624474048614502, 'learning_rate': 6.802040612183656e-05, 'epoch': 0.04}
66%|██████▌ | 66/100 [07:25<03:40, 6.50s/it] 67%|██████▋ | 67/100 [07:32<03:37, 6.60s/it] {'loss': 1.8318, 'grad_norm': 2.211852550506592, 'learning_rate': 6.601980594178254e-05, 'epoch': 0.04}
67%|██████▋ | 67/100 [07:32<03:37, 6.60s/it] 68%|██████▊ | 68/100 [07:38<03:26, 6.45s/it] {'loss': 1.8032, 'grad_norm': 2.268580675125122, 'learning_rate': 6.401920576172852e-05, 'epoch': 0.04}
68%|██████▊ | 68/100 [07:38<03:26, 6.45s/it] 69%|██████▉ | 69/100 [07:43<03:08, 6.08s/it] {'loss': 1.4598, 'grad_norm': 2.5581161975860596, 'learning_rate': 6.20186055816745e-05, 'epoch': 0.04}
69%|██████▉ | 69/100 [07:43<03:08, 6.08s/it] 70%|███████ | 70/100 [07:51<03:20, 6.67s/it] {'loss': 1.4538, 'grad_norm': 1.7709766626358032, 'learning_rate': 6.001800540162049e-05, 'epoch': 0.04}
70%|███████ | 70/100 [07:51<03:20, 6.67s/it] 71%|███████ | 71/100 [07:59<03:23, 7.03s/it] {'loss': 1.9253, 'grad_norm': 1.7000513076782227, 'learning_rate': 5.801740522156648e-05, 'epoch': 0.04}
71%|███████ | 71/100 [07:59<03:23, 7.03s/it] 72%|███████▏ | 72/100 [08:07<03:25, 7.35s/it] {'loss': 1.9427, 'grad_norm': 1.5653332471847534, 'learning_rate': 5.601680504151246e-05, 'epoch': 0.05}
72%|███████▏ | 72/100 [08:07<03:25, 7.35s/it] 73%|███████▎ | 73/100 [08:16<03:28, 7.72s/it] {'loss': 1.7858, 'grad_norm': 1.6286309957504272, 'learning_rate': 5.4016204861458444e-05, 'epoch': 0.05}
73%|███████▎ | 73/100 [08:16<03:28, 7.72s/it] 74%|███████▍ | 74/100 [08:22<03:08, 7.26s/it] {'loss': 1.7306, 'grad_norm': 1.8845025300979614, 'learning_rate': 5.201560468140443e-05, 'epoch': 0.05}
74%|███████▍ | 74/100 [08:22<03:08, 7.26s/it] 75%|███████▌ | 75/100 [08:29<02:57, 7.09s/it] {'loss': 1.4665, 'grad_norm': 2.0926172733306885, 'learning_rate': 5.0015004501350405e-05, 'epoch': 0.05}
75%|███████▌ | 75/100 [08:29<02:57, 7.09s/it] 76%|███████▌ | 76/100 [08:36<02:50, 7.11s/it] {'loss': 1.6459, 'grad_norm': 1.7403364181518555, 'learning_rate': 4.801440432129639e-05, 'epoch': 0.05}
76%|███████▌ | 76/100 [08:36<02:50, 7.11s/it] 77%|███████▋ | 77/100 [08:43<02:45, 7.18s/it] {'loss': 1.7293, 'grad_norm': 1.8789687156677246, 'learning_rate': 4.601380414124238e-05, 'epoch': 0.05}
77%|███████▋ | 77/100 [08:43<02:45, 7.18s/it] 78%|███████▊ | 78/100 [08:51<02:38, 7.23s/it] {'loss': 1.7452, 'grad_norm': 1.7817176580429077, 'learning_rate': 4.401320396118836e-05, 'epoch': 0.05}
78%|███████▊ | 78/100 [08:51<02:38, 7.23s/it] 79%|███████▉ | 79/100 [08:56<02:22, 6.76s/it] {'loss': 1.5719, 'grad_norm': 1.994706392288208, 'learning_rate': 4.201260378113434e-05, 'epoch': 0.05}
79%|███████▉ | 79/100 [08:56<02:22, 6.76s/it] 80%|████████ | 80/100 [09:04<02:22, 7.14s/it] {'loss': 1.4432, 'grad_norm': 1.6284862756729126, 'learning_rate': 4.0012003601080326e-05, 'epoch': 0.05}
80%|████████ | 80/100 [09:04<02:22, 7.14s/it] 81%|████████ | 81/100 [09:12<02:19, 7.34s/it] {'loss': 1.4348, 'grad_norm': 1.7171870470046997, 'learning_rate': 3.801140342102631e-05, 'epoch': 0.05}
81%|████████ | 81/100 [09:12<02:19, 7.34s/it] 82%|████████▏ | 82/100 [09:18<02:03, 6.88s/it] {'loss': 1.6557, 'grad_norm': 2.716853141784668, 'learning_rate': 3.6010803240972294e-05, 'epoch': 0.05}
82%|████████▏ | 82/100 [09:18<02:03, 6.88s/it] 83%|████████▎ | 83/100 [09:26<02:01, 7.14s/it] {'loss': 1.741, 'grad_norm': 1.697729229927063, 'learning_rate': 3.401020306091828e-05, 'epoch': 0.05}
83%|████████▎ | 83/100 [09:26<02:01, 7.14s/it] 84%|████████▍ | 84/100 [09:34<01:59, 7.48s/it] {'loss': 1.592, 'grad_norm': 1.5702276229858398, 'learning_rate': 3.200960288086426e-05, 'epoch': 0.05}
84%|████████▍ | 84/100 [09:34<01:59, 7.48s/it] 85%|████████▌ | 85/100 [09:40<01:44, 7.00s/it] {'loss': 1.487, 'grad_norm': 2.2214341163635254, 'learning_rate': 3.0009002700810245e-05, 'epoch': 0.05}
85%|████████▌ | 85/100 [09:40<01:44, 7.00s/it] 86%|████████▌ | 86/100 [09:48<01:40, 7.20s/it] {'loss': 1.3567, 'grad_norm': 1.6150580644607544, 'learning_rate': 2.800840252075623e-05, 'epoch': 0.05}
86%|████████▌ | 86/100 [09:48<01:40, 7.20s/it] 87%|████████▋ | 87/100 [09:55<01:33, 7.17s/it] {'loss': 1.5584, 'grad_norm': 1.9342750310897827, 'learning_rate': 2.6007802340702216e-05, 'epoch': 0.05}
87%|████████▋ | 87/100 [09:55<01:33, 7.17s/it] 88%|████████▊ | 88/100 [10:03<01:31, 7.64s/it] {'loss': 1.3003, 'grad_norm': 1.474592685699463, 'learning_rate': 2.4007202160648196e-05, 'epoch': 0.06}
88%|████████▊ | 88/100 [10:03<01:31, 7.64s/it] 89%|████████▉ | 89/100 [10:09<01:16, 6.98s/it] {'loss': 1.365, 'grad_norm': 2.2492763996124268, 'learning_rate': 2.200660198059418e-05, 'epoch': 0.06}
89%|████████▉ | 89/100 [10:09<01:16, 6.98s/it] 90%|█████████ | 90/100 [10:15<01:08, 6.81s/it] {'loss': 1.8737, 'grad_norm': 1.9495824575424194, 'learning_rate': 2.0006001800540163e-05, 'epoch': 0.06}
90%|█████████ | 90/100 [10:15<01:08, 6.81s/it] 91%|█████████ | 91/100 [10:24<01:07, 7.46s/it] {'loss': 2.258, 'grad_norm': 1.4941810369491577, 'learning_rate': 1.8005401620486147e-05, 'epoch': 0.06}
91%|█████████ | 91/100 [10:24<01:07, 7.46s/it] 92%|█████████▏| 92/100 [10:31<00:57, 7.13s/it] {'loss': 1.8555, 'grad_norm': 2.0222268104553223, 'learning_rate': 1.600480144043213e-05, 'epoch': 0.06}
92%|█████████▏| 92/100 [10:31<00:57, 7.13s/it] 93%|█████████▎| 93/100 [10:37<00:48, 6.96s/it] {'loss': 1.7721, 'grad_norm': 2.419062614440918, 'learning_rate': 1.4004201260378114e-05, 'epoch': 0.06}
93%|█████████▎| 93/100 [10:37<00:48, 6.96s/it] 94%|█████████▍| 94/100 [10:45<00:42, 7.09s/it] {'loss': 1.3962, 'grad_norm': 1.6081466674804688, 'learning_rate': 1.2003601080324098e-05, 'epoch': 0.06}
94%|█████████▍| 94/100 [10:45<00:42, 7.09s/it] 95%|█████████▌| 95/100 [10:51<00:34, 6.85s/it] {'loss': 1.6754, 'grad_norm': 2.2233669757843018, 'learning_rate': 1.0003000900270082e-05, 'epoch': 0.06}
95%|█████████▌| 95/100 [10:51<00:34, 6.85s/it] 96%|█████████▌| 96/100 [10:58<00:28, 7.07s/it] {'loss': 1.8344, 'grad_norm': 1.9766446352005005, 'learning_rate': 8.002400720216065e-06, 'epoch': 0.06}
96%|█████████▌| 96/100 [10:58<00:28, 7.07s/it] 97%|█████████▋| 97/100 [11:05<00:20, 6.87s/it] {'loss': 1.881, 'grad_norm': 1.9126064777374268, 'learning_rate': 6.001800540162049e-06, 'epoch': 0.06}
97%|█████████▋| 97/100 [11:05<00:20, 6.87s/it] 98%|█████████▊| 98/100 [11:13<00:14, 7.40s/it] {'loss': 1.8616, 'grad_norm': 1.7095757722854614, 'learning_rate': 4.001200360108033e-06, 'epoch': 0.06}
98%|█████████▊| 98/100 [11:13<00:14, 7.40s/it] 99%|█████████▉| 99/100 [11:21<00:07, 7.46s/it] {'loss': 2.0069, 'grad_norm': 2.0039148330688477, 'learning_rate': 2.0006001800540163e-06, 'epoch': 0.06}
99%|█████████▉| 99/100 [11:21<00:07, 7.46s/it]100%|██████████| 100/100 [11:27<00:00, 7.08s/it] {'loss': 1.5449, 'grad_norm': 1.942215085029602, 'learning_rate': 0.0, 'epoch': 0.06}
100%|██████████| 100/100 [11:27<00:00, 7.08s/it] {'train_runtime': 689.2624, 'train_samples_per_second': 0.58, 'train_steps_per_second': 0.145, 'train_loss': 1.7498149812221526, 'epoch': 0.06}
100%|██████████| 100/100 [11:29<00:00, 7.08s/it]100%|██████████| 100/100 [11:29<00:00, 6.89s/it]
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Loading checkpoint shards: 50%|█████ | 1/2 [00:08<00:08, 8.23s/it]Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00, 5.67s/it]Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00, 6.05s/it]