In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2)"that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and significantly improves state-of-the-art LMMs' performance on challenging multimodal benchmarks of diverse tasks, without training any parameters in the base model.
You should first run the following lines.
conda create -n R2T2 python=3.9
conda activate R2T2
conda clean -a && pip cache purge
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
pip install flash-attn --no-build-isolation
You should make directory 'checkpoints' in moai/sgg and upload checkpoint of Scene Graph Generation after downloading it, where its checkpoint filename should be 'psgtr_r50_epoch_60.pth'
Download checkpoints with labeled name 'PSGTR' in Panoptic SGG.
At init_detector function in mmdet/apis/inference.py, line 95-110 should be commented to get compatibility.
# if palette != 'none':
# model.dataset_meta['palette'] = palette
# else:
# test_dataset_cfg = copy.deepcopy(config.test_dataloader.dataset)
# # lazy init. We only need the metainfo.
# test_dataset_cfg['lazy_init'] = True
# metainfo = DATASETS.build(test_dataset_cfg).metainfo
# cfg_palette = metainfo.get('palette', None)
# if cfg_palette is not None:
# model.dataset_meta['palette'] = cfg_palette
# else:
# if 'palette' not in model.dataset_meta:
# warnings.warn(
# 'palette does not exist, random is used by default. '
# 'You can also set the palette to customize.')
# model.dataset_meta['palette'] = 'random'
At inference_detector function in mmdet/apis/inference.py, line 179- should be changed by the following lines.
# build the data pipeline
data_ = test_pipeline(data_)
data_['inputs'] = data_['inputs'].unsqueeze(0)
data_['data_samples'] = [data_['data_samples']]
# forward the model
with torch.no_grad():
results = model.test_step(data_)[0]
In mmcv/transforms/processing.py, line 388 should be commented to get compatibility.
# results['img_shape'] = padded_img.shape[:2]
Then download the benchmark and reference dataset
./download.sh
Run evaluate.py
python evaluate.py --reference reference.json --eval CV-Bench --num_neighbors 5 --num_steps 10 --initial_lr 0.01 --final_lr 1e-5
Acknowledgement: This code is based and developed on MoAI.