[WIP] Further memory optimization of SPHINX series models #118

linziyi96 · 2023-11-29T08:41:13Z

This PR currently introduces 3 changes to fit SPHINX-13B FP16 on 4*16GB GPUs:

Support resharding the checkpoints to higher degree of tensor parallelism to support 4 GPUs (our checkpoints are released with a tensor parallel size of 2).
Move the visual backbone creation to CPU. As the visual backbones have to be created with FP32 and with some unused language parameters, directly creating on GPUs, as is currently implemented, causes a memory spike and the consequent OOM on 16GB GPUs.
In the multi_turn_mm_box demo, gives an option to disable SAM. This is a work-around to save a few GBs of memory on GPU 0 as they cannot be sharded easily now.

nvidia-smi with the model running on 4*V100-16GB after this PR:

…eak cpu mem usage

memory opt experimental commit

0cadcfb

linziyi96 mentioned this pull request Nov 29, 2023

Tracking issue for SPHINX quantization & other memory issues #114

Open

linziyi96 added 2 commits November 29, 2023 19:17

tensor parallel load_state_dict add synchronization warning

46ed145

move each vision encoder to cuda instantly after creation to reduce p…

c2a4ccd

…eak cpu mem usage

Provide feedback