Clip training - Diverging Eval Loss #7835

humanely · 2024-05-01T23:25:38Z

humanely
May 1, 2024

I am training a CLIP model, but the eval_loss is diverging after 25K steps. Here are the few scenarios I have tried.

This was trained with following config:
CUDA_VISIBLE_DEVICES="0, 1, 2, 3" accelerate launch --mixed_precision="fp16" run_clip.py
--max_grad_norm 0.9
--num_train_epochs 1500
--output_dir ~/clip-sa-base4
--model_name_or_path /home/user/clip-sa2
--tokenizer_name /home/user/clip-sanskrit-data/checkpoint-37920000
--train_file /home/user/clip-sanskrit-data/annotations/train2017.json
--validation_file /home/user/data/annotations/captions_val2017.json
--image_column image_path
--caption_column captions
--remove_unused_columns=False
--torch_compile=True
--load_best_model_at_end=True
--evaluation_strategy "steps"
--save_strategy "steps"
--dataloader_drop_last True
--save_total_limit 10
--no_cuda False
--do_train --do_eval
--learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1
--overwrite_output_dir

I added a 0.9 of max_grad_norm for second trial and weight decay of 0.2

3rd Run: Change lr to 5e-7, Weight decay to 0.1, max_grad_norm 1)

Can somebody help me with how to reduce or eliminate this overfitting? TIA

tolgacangoz · 2024-05-02T08:23:29Z

tolgacangoz
May 2, 2024

Isn't run_clip.py from transformers? It may be more appropriate to ask this question there, but I think we can also discuss about it.

0 replies

tolgacangoz · 2024-05-02T08:31:07Z

tolgacangoz
May 2, 2024

In the second trial, increasing regularization, i.e., increasing weight decay and applying early stopping (especially this one) seems helped IMHO.

0 replies

humanely · 2024-05-02T15:00:53Z

humanely
May 2, 2024
Author

Yes, but excessive decay or early stop is not good approach.

1 reply

tolgacangoz May 2, 2024

I think weight_decay=0.2 is not excessive here compared to weight_decay=0.1. Their plots seem similar. There is no underfitting.
I think just applying early stopping may be enough here. Why do you think that they are not good approaches?

humanely · 2024-05-04T03:39:09Z

humanely
May 4, 2024
Author

I will setup wandb for this. I am new so will have to check how exactly to do it. I have tensorflow in it currently.

For weight_decay of 0.3, learning stops.

Early stopping might give you the best low evaluation scores but the search for global minima will remain equally hard. So early stopping is good for its worth. Do you suggest that do the early stop, and reuse the checkpoint for next set of trials?

4 replies

tolgacangoz May 6, 2024

AFAIK, "getting stuck in a local minimum" doesn't seem to be much of an issue anymore.
Could you evaluate the outputs of the CLIP obtained with early stopping?

humanely May 7, 2024
Author

Everytime after early stopping, it either stops after one earlystop without improvements for all variations of weight_decay.
Otherwise it keeps diverging.

tolgacangoz May 7, 2024

Early stopping is the final point. One shouldn't train further.
Could you evaluate the outputs of the CLIP obtained with early stopping manually? Please give it some input and evaluate.

humanely May 7, 2024
Author

I tried evaluating. But this became another issue. Not able to load the tokenizer.

https://discuss.huggingface.co/t/clip-the-backend-tokenizer-provided-does-not-match-the-expected-format/85718

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clip training - Diverging Eval Loss #7835

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Clip training - Diverging Eval Loss #7835

humanely May 1, 2024

Replies: 4 comments · 5 replies

tolgacangoz May 2, 2024

tolgacangoz May 2, 2024

humanely May 2, 2024 Author

tolgacangoz May 2, 2024

humanely May 4, 2024 Author

tolgacangoz May 6, 2024

humanely May 7, 2024 Author

tolgacangoz May 7, 2024

humanely May 7, 2024 Author

humanely
May 1, 2024

Replies: 4 comments 5 replies

tolgacangoz
May 2, 2024

tolgacangoz
May 2, 2024

humanely
May 2, 2024
Author

humanely
May 4, 2024
Author

humanely May 7, 2024
Author

humanely May 7, 2024
Author