Optionally Reduce LR on plateau #25

johnlockejrr · 2025-05-17T13:21:29Z

Learning Rate Warmup and Optimization Implementation

Overview

Added learning rate warmup functionality to improve training stability, especially when using pretrained weights. The implementation uses TensorFlow's native learning rate scheduling for better performance.

Changes Made

1. Configuration Updates (`runs/train_no_patches_448x448.json`)

Added new configuration parameters for warmup:

{
    "warmup_enabled": true,
    "warmup_epochs": 5,
    "warmup_start_lr": 1e-6
}

2. Training Script Updates (`train.py`)

A. Optimizer and Learning Rate Schedule

Replaced fixed learning rate with dynamic scheduling
Implemented warmup using tf.keras.optimizers.schedules.PolynomialDecay
Maintained compatibility with existing ReduceLROnPlateau and EarlyStopping

if warmup_enabled:
    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
        initial_learning_rate=warmup_start_lr,
        decay_steps=warmup_epochs * steps_per_epoch,
        end_learning_rate=learning_rate,
        power=1.0  # Linear decay
    )
    optimizer = Adam(learning_rate=lr_schedule)
else:
    optimizer = Adam(learning_rate=learning_rate)

B. Learning Rate Behavior

Initial learning rate: 1e-6 (configurable via warmup_start_lr)
Target learning rate: 5e-5 (configurable via learning_rate)
Linear increase over 5 epochs (configurable via warmup_epochs)
After warmup, learning rate remains at target value until ReduceLROnPlateau triggers

Benefits

Improved training stability during initial epochs
Better handling of pretrained weights
Efficient implementation using TensorFlow's native scheduling
Configurable through JSON configuration file
Maintains compatibility with existing callbacks (ReduceLROnPlateau, EarlyStopping)

Usage

To enable warmup:

Set warmup_enabled: true in the configuration file
Adjust warmup_epochs and warmup_start_lr as needed
The warmup will automatically integrate with existing learning rate reduction and early stopping

To disable warmup:

Set warmup_enabled: false or remove the warmup parameters from the configuration file

avoid ensembling if no model weights met the threshold f1 score in the case of classification

…reading order

…n an output directory with the same file name

…d as an argument for inference

…d as a class for layout segmentation

…in label images after artificial label

…or no patch case

…no patch augmentation

…herwise can give an cv2 error

…to `model.fit`

Changed unsafe basename extraction: `file_name = i.split('.')[0]` to `file_name = os.path.splitext(i)[0]` and `filename = n[i].split('.')[0]` to `filename = os.path.splitext(n[i])[0]` because `"Vat.sam.2_206.jpg` -> `Vat` instead of `"Vat.sam.2_206`

Keep safely the full basename without extension

# Learning Rate Warmup and Optimization Implementation ## Overview Added learning rate warmup functionality to improve training stability, especially when using pretrained weights. The implementation uses TensorFlow's native learning rate scheduling for better performance. ## Changes Made ### 1. Configuration Updates (`runs/train_no_patches_448x448.json`) Added new configuration parameters for warmup: ```json { "warmup_enabled": true, "warmup_epochs": 5, "warmup_start_lr": 1e-6 } ``` ### 2. Training Script Updates (`train.py`) #### A. Optimizer and Learning Rate Schedule - Replaced fixed learning rate with dynamic scheduling - Implemented warmup using `tf.keras.optimizers.schedules.PolynomialDecay` - Maintained compatibility with existing ReduceLROnPlateau and EarlyStopping ```python if warmup_enabled: lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay( initial_learning_rate=warmup_start_lr, decay_steps=warmup_epochs * steps_per_epoch, end_learning_rate=learning_rate, power=1.0 # Linear decay ) optimizer = Adam(learning_rate=lr_schedule) else: optimizer = Adam(learning_rate=learning_rate) ``` #### B. Learning Rate Behavior - Initial learning rate: 1e-6 (configurable via `warmup_start_lr`) - Target learning rate: 5e-5 (configurable via `learning_rate`) - Linear increase over 5 epochs (configurable via `warmup_epochs`) - After warmup, learning rate remains at target value until ReduceLROnPlateau triggers ## Benefits 1. Improved training stability during initial epochs 2. Better handling of pretrained weights 3. Efficient implementation using TensorFlow's native scheduling 4. Configurable through JSON configuration file 5. Maintains compatibility with existing callbacks (ReduceLROnPlateau, EarlyStopping) ## Usage To enable warmup: 1. Set `warmup_enabled: true` in the configuration file 2. Adjust `warmup_epochs` and `warmup_start_lr` as needed 3. The warmup will automatically integrate with existing learning rate reduction and early stopping To disable warmup: - Set `warmup_enabled: false` or remove the warmup parameters from the configuration file

# Training Script Improvements ## Learning Rate Management Fixes ### 1. ReduceLROnPlateau Implementation - Fixed the learning rate reduction mechanism by replacing the manual epoch loop with a single `model.fit()` call - This ensures proper tracking of validation metrics across epochs - Configured with: ```python reduce_lr = ReduceLROnPlateau( monitor='val_loss', factor=0.2, # More aggressive reduction patience=3, # Quick response to plateaus min_lr=1e-6, # Minimum learning rate min_delta=1e-5, # Minimum change to be considered improvement verbose=1 ) ``` ### 2. Warmup Implementation - Added learning rate warmup using TensorFlow's native scheduling - Gradually increases learning rate from 1e-6 to target (2e-5) over 5 epochs - Helps stabilize initial training phase - Implemented using `PolynomialDecay` schedule: ```python lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay( initial_learning_rate=warmup_start_lr, decay_steps=warmup_epochs * steps_per_epoch, end_learning_rate=learning_rate, power=1.0 # Linear decay ) ``` ### 3. Early Stopping - Added early stopping to prevent overfitting - Configured with: ```python early_stopping = EarlyStopping( monitor='val_loss', patience=6, restore_best_weights=True, verbose=1 ) ``` ## Model Saving Improvements ### 1. Epoch-based Model Saving - Implemented custom `ModelCheckpointWithConfig` to save both model and config - Saves after each epoch with corresponding config.json - Maintains compatibility with original script's saving behavior ### 2. Best Model Saving - Saves the best model at training end - If early stopping triggers: saves the best model from training - If no early stopping: saves the final model ## Configuration All parameters are configurable through the JSON config file: ```json { "reduce_lr_enabled": true, "reduce_lr_monitor": "val_loss", "reduce_lr_factor": 0.2, "reduce_lr_patience": 3, "reduce_lr_min_lr": 1e-6, "reduce_lr_min_delta": 1e-5, "early_stopping_enabled": true, "early_stopping_monitor": "val_loss", "early_stopping_patience": 6, "early_stopping_restore_best_weights": true, "warmup_enabled": true, "warmup_epochs": 5, "warmup_start_lr": 1e-6 } ``` ## Benefits 1. More stable training with proper learning rate management 2. Better handling of training plateaus 3. Automatic saving of best model 4. Maintained compatibility with existing config saving 5. Improved training monitoring and control

johnlockejrr · 2025-05-17T20:34:46Z

With Warmup, ReduceLROnPlateau and early stopping the training process looks smoother and much more stable than the default:

==================================================================================================
Total params: 38,211,247
Trainable params: 38,154,089
Non-trainable params: 57,158
__________________________________________________________________________________________________
Epoch 1/50
1194/1194 [==============================] - ETA: 0s - loss: 0.7081 - accuracy: 0.7516
Epoch 1: saving model to runs/sam_41_mss_npt_448x448/model_1
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_1/assets
1194/1194 [==============================] - 396s 302ms/step - loss: 0.7081 - accuracy: 0.7516 - val_loss: 0.5615 - val_accuracy: 0.8679 - lr: 4.7968e-06
Epoch 2/50
1194/1194 [==============================] - ETA: 0s - loss: 0.5898 - accuracy: 0.8990
Epoch 2: saving model to runs/sam_41_mss_npt_448x448/model_2
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_2/assets
1194/1194 [==============================] - 357s 299ms/step - loss: 0.5898 - accuracy: 0.8990 - val_loss: 0.5170 - val_accuracy: 0.8843 - lr: 8.5968e-06
Epoch 3/50
1194/1194 [==============================] - ETA: 0s - loss: 0.5065 - accuracy: 0.9154
Epoch 3: saving model to runs/sam_41_mss_npt_448x448/model_3
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_3/assets
1194/1194 [==============================] - 356s 298ms/step - loss: 0.5065 - accuracy: 0.9154 - val_loss: 0.4268 - val_accuracy: 0.8995 - lr: 1.2397e-05
Epoch 4/50
1194/1194 [==============================] - ETA: 0s - loss: 0.4244 - accuracy: 0.9236
Epoch 4: saving model to runs/sam_41_mss_npt_448x448/model_4
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_4/assets
1194/1194 [==============================] - 357s 299ms/step - loss: 0.4244 - accuracy: 0.9236 - val_loss: 0.3657 - val_accuracy: 0.9017 - lr: 1.6197e-05
Epoch 5/50
1194/1194 [==============================] - ETA: 0s - loss: 0.3661 - accuracy: 0.9287
Epoch 5: saving model to runs/sam_41_mss_npt_448x448/model_5
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_5/assets
1194/1194 [==============================] - 358s 299ms/step - loss: 0.3661 - accuracy: 0.9287 - val_loss: 0.3451 - val_accuracy: 0.9161 - lr: 1.9997e-05
Epoch 6/50
1194/1194 [==============================] - ETA: 0s - loss: 0.3299 - accuracy: 0.9324
Epoch 6: saving model to runs/sam_41_mss_npt_448x448/model_6
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_6/assets
1194/1194 [==============================] - 356s 298ms/step - loss: 0.3299 - accuracy: 0.9324 - val_loss: 0.3256 - val_accuracy: 0.9052 - lr: 2.0000e-05
Epoch 7/50
1194/1194 [==============================] - ETA: 0s - loss: 0.3069 - accuracy: 0.9355
Epoch 7: saving model to runs/sam_41_mss_npt_448x448/model_7
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_7/assets
1194/1194 [==============================] - 357s 299ms/step - loss: 0.3069 - accuracy: 0.9355 - val_loss: 0.2609 - val_accuracy: 0.9106 - lr: 2.0000e-05
Epoch 8/50
1194/1194 [==============================] - ETA: 0s - loss: 0.2906 - accuracy: 0.9380
Epoch 8: saving model to runs/sam_41_mss_npt_448x448/model_8
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_8/assets
1194/1194 [==============================] - 356s 298ms/step - loss: 0.2906 - accuracy: 0.9380 - val_loss: 0.2852 - val_accuracy: 0.9165 - lr: 2.0000e-05
Epoch 9/50
1194/1194 [==============================] - ETA: 0s - loss: 0.2774 - accuracy: 0.9403
Epoch 9: saving model to runs/sam_41_mss_npt_448x448/model_9
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_9/assets
1194/1194 [==============================] - 358s 300ms/step - loss: 0.2774 - accuracy: 0.9403 - val_loss: 0.2608 - val_accuracy: 0.9229 - lr: 2.0000e-05
Epoch 10/50
1194/1194 [==============================] - ETA: 0s - loss: 0.2656 - accuracy: 0.9422
Epoch 10: saving model to runs/sam_41_mss_npt_448x448/model_10
INFO - tensorflow - Assets written to: runs/sam_41_mss_npt_448x448/model_10/assets
1194/1194 [==============================] - 357s 299ms/step - loss: 0.2656 - accuracy: 0.9422 - val_loss: 0.2218 - val_accuracy: 0.9341 - lr: 2.0000e-05
Epoch 11/50
 254/1194 [=====>........................] - ETA: 4:26 - loss: 0.2586 - accuracy: 0.9439

We'll see at training's end.

vahidrezanezhad added 30 commits April 16, 2024 01:00

first working update of branch

d27647a

integrating first working classification training model

dbb8450

adding enhancement training

38db3e9

inference script is added

8d1050e

modifications

ce1108a

Update train.py

a7e1f25

avoid ensembling if no model weights met the threshold f1 score in the case of classification

Update utils.py

45aba32

adding page xml to label generator

6ef8658

page to label enable textline new concept

2623113

update requirements

5f06a02

page2label with a dynamic layout

f7dda07

dynamic layout decorated with artificial class on text elements boundry

d687f53

missing text types are added

947a0e0

use cases like textline, word and glyph are added

4b7f7da

use case printspace is added

f574601

machine based reading order training dataset generator is added

bf14683

machine based reading order training is integrated

4e4490d

adding rest_as_paragraph and rest_as_graphic to elements

5aa6ee0

pass degrading scales for image enhancement as a json file

29ddd4d

min area size of text region passes as an argument for machine based …

356da4c

…reading order

inference for reading order

2e7c69f

reading order detection on xml with layout + result will be written i…

f6abefb

…n an output directory with the same file name

min_area size of regions considered for reading order detection passe…

7850335

…d as an argument for inference

modifying xml parsing

4640d9f

scaling and cropping of labels and org images

821290c

replacement in a list done correctly

b9cbc0e

Update README.md

e25a925

just defined textregion types can be extracted as label

1c8873f

just defined textregion types can be extracted as label

b1d971a

just defined graphic region types can be extracted as label

dc356a5

vahidrezanezhad and others added 29 commits June 12, 2024 17:40

update config

9358657

update reading order machine based

033cf67

update inference

c0faece

resolving typo

647a3f8

printspace_as_class_in_layout is integrated. Printspace can be define…

55f3cb9

…d as a class for layout segmentation

adding degrading and brightness augmentation to no patches case training

9521768

brightness augmentation modified

f2692cf

increasing margin in the case of pixelwise inference

c340fbb

erosion and dilation parameters are changed & separators are written …

30894dd

…in label images after artificial label

inference updated

5fbe941

erosion rate changed

59e5892

add documentation from wiki as markdown file to the codebase

b6bdf94

save only layout output. different from overlayed layout on image

f4bad09

update

85dd59f

augmentation function for red textlines, rgb background and scaling f…

7be326d

…or no patch case

updating augmentations

95bbdf8

scaling, channels shuffling, rgb background and red content added to …

f31219b

…no patch augmentation

using prepared binarized images in the case of augmentation

9904846

early dilation for textline artificial class

4f0e3ef

adding foreground rgb to augmentation

c502e67

fixing artificial class bug

5f456cf

new augmentations for patchwise training

cca4d17

Update inference.py to check if save_layout was passed as argument ot…

df4a47a

…herwise can give an cv2 error

Changed deprecated lr to learning_rate and model.fit_generator …

451188c

…to `model.fit`

Update utils.py

be57f13

Update utils.py

102b04c

Changed unsafe basename extraction: `file_name = i.split('.')[0]` to `file_name = os.path.splitext(i)[0]` and `filename = n[i].split('.')[0]` to `filename = os.path.splitext(n[i])[0]` because `"Vat.sam.2_206.jpg` -> `Vat` instead of `"Vat.sam.2_206`

Update gt_gen_utils.py

1bf8019

Keep safely the full basename without extension

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optionally Reduce LR on plateau #25

Optionally Reduce LR on plateau #25

Uh oh!

johnlockejrr commented May 17, 2025

Uh oh!

johnlockejrr commented May 17, 2025

Uh oh!

Uh oh!

Optionally Reduce LR on plateau #25

Are you sure you want to change the base?

Optionally Reduce LR on plateau #25

Uh oh!

Conversation

johnlockejrr commented May 17, 2025

Learning Rate Warmup and Optimization Implementation

Overview

Changes Made

1. Configuration Updates (runs/train_no_patches_448x448.json)

2. Training Script Updates (train.py)

A. Optimizer and Learning Rate Schedule

B. Learning Rate Behavior

Benefits

Usage

Uh oh!

johnlockejrr commented May 17, 2025

Uh oh!

Uh oh!

1. Configuration Updates (`runs/train_no_patches_448x448.json`)

2. Training Script Updates (`train.py`)