Skip to content

Add pt hf trainer #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 141 commits into
base: main
Choose a base branch
from
Open

Add pt hf trainer #203

wants to merge 141 commits into from

Conversation

gkumbhat
Copy link
Collaborator

@gkumbhat gkumbhat commented Sep 21, 2023

closes: #175

Changes

  • Refactor prompt tuning module to use HF Trainer instead of custom training loop. This includes:
    • changing the data loader to one we use for fine-tuning
    • changing logging file logic
    • changing gradient accumulation
  • Refactoring huggingface trainer logic to a common utility function to be re-used between prompt tuning and fine-tuning
  • Remove unused code for custom tuning logic in peft prompt tuning module
  • Add random_seed to prompt tuning training API

Notes

  • HF dataset dataset_type.from_generator converted doesn't accept empty dataset and raises error. For this reason, I had to modify train_stream in test and make them not empty
  • Currently we will not be supporting evaluation

Evaluation

  • Prompt Tuning:

    • Parameters:
    • Dataset:
    • Accuracy:
    • Matthews Correlation:
  • Fine tuning

    • Parameters:
    • Dataset:

TODO

  • Training is quite slow (50x) as compared to manual training loop via HF accelerate. This needs to be fixed
  • saving the model after training doesn't work for peft. This needs to be resolved before merging this PR.
  • We need to change the infer_steps function, since its currently not doing it the same way trainer would do, i.e not considering gradient checkpointing 🤔

@gkumbhat gkumbhat force-pushed the add_pt_hf_trainer branch 2 times, most recently from 9cdbace to 3bfd53c Compare September 27, 2023 23:49
@gkumbhat gkumbhat marked this pull request as ready for review September 28, 2023 15:57
@gkumbhat
Copy link
Collaborator Author

gkumbhat commented Oct 9, 2023

Thoughts on TODOs on the PR:

  1. Training is quite slow (50x) as compared to manual training loop via HF accelerate.
    Following are potential suspects in order of probability:
    1. Misconfiguration of epochs / steps resulting in extra work involving gradient accumulation step. One way to quickly test if this is related to some misconfiguration with steps / gradient accumulation step would be to disable gradient accumulation by commenting out gradient_accumulation_steps and gradient_checkpointing and commenting out max_steps over here. If the training becomes orders of magnitude faster, than this is the issue and we need to figure out how to efficiently turn on gradient accumulation. Note that we do need gradient accumulation to enable larger models in the memory.
    2. There is some pre-processing happens via map function at the dataloader level, somehow this could be causing data to go back and forth between CPU / GPU thus causing slowness
    3. Side effects of trainer parameters being configured / mis configuration of certain parameters that is resulting in the slowness.
    4. It could always be a bug somewhere in caikit-nlp code somewhere causing this slowness.
  2. Model saving issue.
    We are saving the trainer model using trainer's function towards the end as can be seen here, which doesn't quite work for PEFT models. We need to figure out a common method to save the models, keeping in note that trainer.save_state function actually works across multi-gpu, so if there is a way to make this work for PEFT models, that would be ideal.

cc: @olson-ibm @ibm-peach-fish

@gkumbhat gkumbhat force-pushed the add_pt_hf_trainer branch 2 times, most recently from 88020a7 to 4106ba7 Compare October 19, 2023 19:55
@gkumbhat
Copy link
Collaborator Author

gkumbhat commented Nov 13, 2023

Some preliminary tests to be done before merging

  • Flan-t5-xl
    • Single GPU prompt tuning to check for:
      • compatibility
      • performance (training time)
        - Had to remove support for accumulation step. Accumulation step logic in HF Trainer seems to make training much slower than on current main branch. Removing accumulation step means that we won't be able to process for larger batches, which is particularly relevant for tuning on larger models.
      • quality
    • Multi-GPU prompt tuning to check for:
      • compatibility
      • performance (training time)
      • quality
  • LLama 13B
    • Single GPU prompt tuning to check for:
      • compatibility
      • performance (training time)
      • quality
    • Multi-GPU prompt tuning to check for:
      • compatibility
      • performance (training time)
      • quality

Notes

  • With auto_find_batch_size setting in Trainer, the trainer automatically tries to find batch_size that fits in the memory. So we can even provide higher batch sizes and if they don't fit in memory, trainer will take care of it automatically.

@gkumbhat
Copy link
Collaborator Author

gkumbhat commented Nov 13, 2023

Notes from testing

Test parameters

  • Model Name: [google/flan-t5-xl]
    |- Inferred Model Resource Type: [<class 'caikit_nlp.resources.pretrained_model.hf_auto_seq2seq_lm.HFAutoSeq2SeqLM'>]
  • Tuning Type: [MULTITASK_PROMPT_TUNING]
  • Prompt Tuning Initialization Type [TEXT]
  • Number of Virtual Tokens: [8]
    |- Prompt Tuning initialization Text: [Recognize textual entailment:]
  • Dataset: [glue/rte]
  • Verbalizer: [rte { 0 : entailment, 1 : not entailment } {{input}}]
  • Number of Epochs: [1]
  • Learning Rate: [0.3]
  • Batch Size: [16]
  • Output Directory: [prompt_prefixes/test]
  • Exporting prompt only: [False]
  • Number of shots: [None]
  • Maximum source sequence length: [256]
  • Maximum target sequence length: [512]
  • Gradient accumulation steps: [16]

HF Trainer branch

Time:

2023-11-13T 20:20:21.948010 
to
2023-11-13T 20:47:05.818565
  • Time: 27 minutes
    Notes:
  • for 1 epoch for prompt tuning flan-t5-xl with single GPU on HF trainer branch with params:
  • With accumulation steps
  • No gradient checkpointing

Main branch, No HF Trainer branch

2023-11-13T 20:50:06.924492
to 
2023-11-13T 20:54:31.013823
  • Time: 4 minutes
    Notes:
  • 4 minutes for same parameters as above with non HF trainer approach (on current main)
  • Gradient checkpointing

HF Trainer branch

2023-11-13T 20:56:41.132890
to
2023-11-13T 20:58:52.220874

  • Time: 2 minutes
    Notes:
  • 2 minutes for same parameters as above with HF Trainer with accumulation step to 1
  • No gradient checkpointing

Main branch, No HF Trainer branch

2023-11-13T 21:00:26.299356
to
2023-11-13T 21:05:14.446629

  • Time: 5 minutes
    Notes:
  • 5 minutes for same parameters as above with non HF train approach with accumulation step to 1
  • With gradient checkpointing

HF Trainer branch

2023-11-13T 21:15:24.106248
to
2023-11-13T 21:20:36.585468
  • Time: 5 minutes
    Notes:
  • 5 minutes for same parameters as above with HF train approach with accumulation step to 1 (or NO accumulation step)
  • With gradient checkpointing

Multi-gpu testing (2 A100-80G)

HF Trainer branch


to

  • Time: 5 minutes
    Notes:
  • With gradient checkpointing

dtrifiro and others added 17 commits November 13, 2023 17:30
this avoids printing a deprecation warning

Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
dtrifiro and others added 27 commits November 13, 2023 18:28
Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
…text_func

"OTHER" is an invalid value for caikit.interfaces.nlp.data_model.text_generation.FinishReason,
resulting in failed serialization of responses when querying the text
generation endpoint.
For `generate_text_func`, it is reasonable to assume that if the finish
reason is not `EOS_TOKEN` or `STOP_SEQUENCE`, it must be `MAX_TOKENS`.

Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
… configuration

Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
fixes caikit#245

Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: Evaline Ju <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: Evaline Ju <[email protected]>
Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
- add install subsection
- add model serving subsection
- cleanup docker section
- add configuration subsection

Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Signed-off-by: Daniele Trifirò <[email protected]>
Signed-off-by: gkumbhat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace training loop in peft_prompt_tuning with HF Trainer
6 participants