Add pt hf trainer #203

gkumbhat · 2023-09-21T23:47:00Z

closes: #175

Changes

Refactor prompt tuning module to use HF Trainer instead of custom training loop. This includes:
- changing the data loader to one we use for fine-tuning
- changing logging file logic
- changing gradient accumulation
Refactoring huggingface trainer logic to a common utility function to be re-used between prompt tuning and fine-tuning
Remove unused code for custom tuning logic in peft prompt tuning module
Add random_seed to prompt tuning training API

Notes

HF dataset dataset_type.from_generator converted doesn't accept empty dataset and raises error. For this reason, I had to modify train_stream in test and make them not empty
Currently we will not be supporting evaluation

Evaluation

Prompt Tuning:
- Parameters:
- Dataset:
- Accuracy:
- Matthews Correlation:
Fine tuning
- Parameters:
- Dataset:

TODO

Training is quite slow (50x) as compared to manual training loop via HF accelerate. This needs to be fixed
saving the model after training doesn't work for peft. This needs to be resolved before merging this PR.
We need to change the infer_steps function, since its currently not doing it the same way trainer would do, i.e not considering gradient checkpointing 🤔

gkumbhat · 2023-10-09T18:05:11Z

Thoughts on TODOs on the PR:

Training is quite slow (50x) as compared to manual training loop via HF accelerate.
Following are potential suspects in order of probability:
1. Misconfiguration of epochs / steps resulting in extra work involving gradient accumulation step. One way to quickly test if this is related to some misconfiguration with steps / gradient accumulation step would be to disable gradient accumulation by commenting out gradient_accumulation_steps and gradient_checkpointing and commenting out max_steps over here. If the training becomes orders of magnitude faster, than this is the issue and we need to figure out how to efficiently turn on gradient accumulation. Note that we do need gradient accumulation to enable larger models in the memory.
2. There is some pre-processing happens via map function at the dataloader level, somehow this could be causing data to go back and forth between CPU / GPU thus causing slowness
3. Side effects of trainer parameters being configured / mis configuration of certain parameters that is resulting in the slowness.
4. It could always be a bug somewhere in caikit-nlp code somewhere causing this slowness.
Model saving issue.
We are saving the trainer model using trainer's function towards the end as can be seen here, which doesn't quite work for PEFT models. We need to figure out a common method to save the models, keeping in note that trainer.save_state function actually works across multi-gpu, so if there is a way to make this work for PEFT models, that would be ideal.

cc: @olson-ibm @ibm-peach-fish

gkumbhat · 2023-11-13T20:47:19Z

gkumbhat · 2023-11-13T22:29:52Z

Notes from testing

Test parameters

Model Name: [google/flan-t5-xl]
|- Inferred Model Resource Type: [<class 'caikit_nlp.resources.pretrained_model.hf_auto_seq2seq_lm.HFAutoSeq2SeqLM'>]
Tuning Type: [MULTITASK_PROMPT_TUNING]
Prompt Tuning Initialization Type [TEXT]
Number of Virtual Tokens: [8]
|- Prompt Tuning initialization Text: [Recognize textual entailment:]
Dataset: [glue/rte]
Verbalizer: [rte { 0 : entailment, 1 : not entailment } {{input}}]
Number of Epochs: [1]
Learning Rate: [0.3]
Batch Size: [16]
Output Directory: [prompt_prefixes/test]
Exporting prompt only: [False]
Number of shots: [None]
Maximum source sequence length: [256]
Maximum target sequence length: [512]
Gradient accumulation steps: [16]

HF Trainer branch

Time:

2023-11-13T 20:20:21.948010 
to
2023-11-13T 20:47:05.818565

Time: 27 minutes
Notes:
for 1 epoch for prompt tuning flan-t5-xl with single GPU on HF trainer branch with params:
With accumulation steps
No gradient checkpointing

Main branch, No HF Trainer branch

2023-11-13T 20:50:06.924492
to 
2023-11-13T 20:54:31.013823

Time: 4 minutes
Notes:
4 minutes for same parameters as above with non HF trainer approach (on current main)
Gradient checkpointing

HF Trainer branch

2023-11-13T 20:56:41.132890
to
2023-11-13T 20:58:52.220874

Time: 2 minutes
Notes:
2 minutes for same parameters as above with HF Trainer with accumulation step to 1
No gradient checkpointing

Main branch, No HF Trainer branch

2023-11-13T 21:00:26.299356
to
2023-11-13T 21:05:14.446629

Time: 5 minutes
Notes:
5 minutes for same parameters as above with non HF train approach with accumulation step to 1
With gradient checkpointing

HF Trainer branch

2023-11-13T 21:15:24.106248
to
2023-11-13T 21:20:36.585468

Time: 5 minutes
Notes:
5 minutes for same parameters as above with HF train approach with accumulation step to 1 (or NO accumulation step)
With gradient checkpointing

Multi-gpu testing (2 A100-80G)

HF Trainer branch

to

Time: 5 minutes
Notes:
With gradient checkpointing

this avoids printing a deprecation warning Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: gkumbhat <[email protected]>

…tion Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

…text_func "OTHER" is an invalid value for caikit.interfaces.nlp.data_model.text_generation.FinishReason, resulting in failed serialization of responses when querying the text generation endpoint. For `generate_text_func`, it is reasonable to assume that if the finish reason is not `EOS_TOKEN` or `STOP_SEQUENCE`, it must be `MAX_TOKENS`. Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

… configuration Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: gkumbhat <[email protected]>

fixes caikit#245 Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

… defaults Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: Evaline Ju <[email protected]> Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: Evaline Ju <[email protected]>

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

- add install subsection - add model serving subsection - cleanup docker section - add configuration subsection Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

Signed-off-by: gkumbhat <[email protected]>

gkumbhat force-pushed the add_pt_hf_trainer branch 2 times, most recently from 9cdbace to 3bfd53c Compare September 27, 2023 23:49

gkumbhat marked this pull request as ready for review September 28, 2023 15:57

gkumbhat requested review from alex-jw-brooks, evaline-ju, gabe-l-hart and tharapalanivel as code owners September 28, 2023 15:57

gkumbhat mentioned this pull request Oct 1, 2023

Causal LM tokenization: Chunking and seq2seq Forward #206

Merged

gkumbhat force-pushed the add_pt_hf_trainer branch 2 times, most recently from 88020a7 to 4106ba7 Compare October 19, 2023 19:55

dtrifiro and others added 17 commits November 13, 2023 17:30

fix imports for caikit >=0.15.0

9e1c084

this avoids printing a deprecation warning Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

🐛 Add support for accepting tokenization via model path

cf75fc2

Signed-off-by: gkumbhat <[email protected]>

🎨 Fix formatting

e2fc1dd

Signed-off-by: gkumbhat <[email protected]>

🔊 Add info level log message for tokenizer load from model dir

c686cfe

Signed-off-by: gkumbhat <[email protected]>

✨ Add stepwise logging for prompt tuning

3bd95c7

Signed-off-by: gkumbhat <[email protected]>

🚧 Changing logging in FT HF Trainer to step level

aa08009

Signed-off-by: gkumbhat <[email protected]>

🐛 Remove epoch number validation for training loss

758142a

Signed-off-by: gkumbhat <[email protected]>

🚧 Make changes to enable logging for FT in distributed computation

7bb7c13

Signed-off-by: gkumbhat <[email protected]>

🚧 Update trainers to include base classes and create log utility func…

7bf55f7

…tion Signed-off-by: gkumbhat <[email protected]>

🎨 Fix formatting and linting

7df1e03

Signed-off-by: gkumbhat <[email protected]>

🐛 Fix empty training metadata issue

aae6a72

Signed-off-by: gkumbhat <[email protected]>

♻️ Revert back run fine tuning print statement

9433af5

Signed-off-by: gkumbhat <[email protected]>

✨ Add trainer util file

e9759da

Signed-off-by: gkumbhat <[email protected]>

🎨 Fix formatting for trainer utils

53d895a

Signed-off-by: gkumbhat <[email protected]>

♻️ Refactor HF trainer logic and move to utils

fd6cdb7

Signed-off-by: gkumbhat <[email protected]>

🎨 Fix formatting

b5a5586

Signed-off-by: gkumbhat <[email protected]>

🚧

f2a8bca

Signed-off-by: gkumbhat <[email protected]>

dtrifiro and others added 27 commits November 13, 2023 18:28

model_run_utils: add STOP_SEQUENCE finish reason

74d8d69

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

tests: speed up test_bad_tgis_connection by adding connect_timeout to…

6a4a19b

… configuration Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

🚧 initate config changes

d0f8a91

Signed-off-by: gkumbhat <[email protected]>

✅ Add test for training data validation

fcbec63

Signed-off-by: gkumbhat <[email protected]>

🎨 Fix formatting

fee487f

Signed-off-by: gkumbhat <[email protected]>

📝 Add comment in the config yaml file indicating prompt tuning module

f0186a5

Signed-off-by: gkumbhat <[email protected]>

📦 Update caikit to include len of stream fix

9655e21

Signed-off-by: gkumbhat <[email protected]>

🐛📦 fix deps declaration in pyproject.toml file

3361249

Signed-off-by: gkumbhat <[email protected]>

📦 Update caikit to 0.23.2

b829220

Signed-off-by: gkumbhat <[email protected]>

🔧 change default model to dummy model

ee01ac7

Signed-off-by: gkumbhat <[email protected]>

model_run_utils: use enum values as finish_reason

969500c

fixes caikit#245 Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

🔧 Update training data validation to consider global and module level…

dbbb5d7

… defaults Signed-off-by: gkumbhat <[email protected]>

🐛 Fix module default

df04ff3

Signed-off-by: gkumbhat <[email protected]>

🎨 Fix linter

deb731a

Signed-off-by: gkumbhat <[email protected]>

📦 Update caikit to 0.24.0

0fac5d3

Signed-off-by: gkumbhat <[email protected]>

tests: make models fixtures session-scoped

3832262

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

🥅 Disallow empty train streams

f5c29bf

Signed-off-by: Evaline Ju <[email protected]> Signed-off-by: gkumbhat <[email protected]>

✅ Update test fixtures for no empty train streams

ff4d18a

Signed-off-by: Evaline Ju <[email protected]>

🎨 Format

8615ebe

Signed-off-by: Evaline Ju <[email protected]>

add Dockerfile

c0b3899

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

gha: add build-image workflow

f20df11

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

README: add docker instructions

6caf98f

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

README: improve getting started section

da706c7

- add install subsection - add model serving subsection - cleanup docker section - add configuration subsection Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

dockerfile: add LICENSE and README.md

5bd1bf8

Signed-off-by: Daniele Trifirò <[email protected]> Signed-off-by: gkumbhat <[email protected]>

🐛 Fix converting of peft model to type errorneously

eb07cf3

Signed-off-by: gkumbhat <[email protected]>

🔧🐛 Fix rebasing small descripencies

45e6c18

Signed-off-by: gkumbhat <[email protected]>

gkumbhat force-pushed the add_pt_hf_trainer branch from ed63133 to 45e6c18 Compare November 13, 2023 23:40

gkumbhat mentioned this pull request Nov 14, 2023

Replace training loop in peft_prompt_tuning with HF Trainer #175

Open

2 tasks

gkumbhat mentioned this pull request Dec 4, 2023

Migrate hf trainer #287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pt hf trainer #203

Add pt hf trainer #203

Uh oh!

gkumbhat commented Sep 21, 2023 •

edited

Loading

Uh oh!

gkumbhat commented Oct 9, 2023 •

edited

Loading

Uh oh!

gkumbhat commented Nov 13, 2023 •

edited

Loading

Uh oh!

gkumbhat commented Nov 13, 2023 •

edited

Loading

Uh oh!

Uh oh!

Add pt hf trainer #203

Are you sure you want to change the base?

Add pt hf trainer #203

Uh oh!

Conversation

gkumbhat commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Notes

Evaluation

TODO

Uh oh!

gkumbhat commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gkumbhat commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Some preliminary tests to be done before merging

Notes

Uh oh!

gkumbhat commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes from testing

Test parameters

HF Trainer branch

Main branch, No HF Trainer branch

HF Trainer branch

Main branch, No HF Trainer branch

HF Trainer branch

Multi-gpu testing (2 A100-80G)

HF Trainer branch

Uh oh!

Uh oh!

gkumbhat commented Sep 21, 2023 •

edited

Loading

gkumbhat commented Oct 9, 2023 •

edited

Loading

gkumbhat commented Nov 13, 2023 •

edited

Loading

gkumbhat commented Nov 13, 2023 •

edited

Loading