Skip to content
/ MAYE Public

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Notifications You must be signed in to change notification settings

GAIR-NLP/MAYE

Repository files navigation

LogoRethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

πŸ“– Paper | πŸ€— Datasets

πŸ“š Overview

This project presents MAYE, a transparent and reproducible framework and a comprehensive evaluation scheme for applying reinforcement learning (RL) to vision-language models (VLMs). The codebase is built entirely from scratch without relying on existing RL toolkits.

Key contributions include:

  • 🧱 From-scratch RL framework: A minimal training pipeline using standard libraries (Transformers, FSDP2, vLLM), enabling full visibility and extensibility in VLM RL training.
  • πŸ“Š Standardized evaluation scheme: A unified protocol for measuring training dynamics and reflective behavior, filling a critical gap in RL evaluation for VLMs.
  • πŸ” Empirical insights: Experiments across datasets and models uncover trends in response length, reflection patterns, and demonstrate that RL consistently outperforms supervised fine-tuning (SFT) in generalizationβ€”even with high-quality data.

Together, the framework and evaluation scheme aim to establish a reproducible baseline and encourage broader adoption of RL in multimodal reasoning research.

For more details on the training framework, evaluation scheme, and experimental analysis, please refer to our paper.

🧠 Design Philosophy

This project is not intended to compete with existing high-performance RL libraries such as TRL, OpenRLHF, verl, or AReaL.
Instead, it aims to offer a transparent, lightweight, and educational framework that exposes the core logic of RL training for VLMsβ€”without heavy abstraction or engineering overhead.
In spirit, it is similar to OpenAI SpinningUp: not focused on performance or feature completeness, but on being a clear and reproducible entry point for understanding and building VLM-RL systems.

The code is validated on 8 GPUs using the following setup:

  • Ranks 0–6: handle distributed training via FSDP2.
  • Rank 7: is dedicated to high-throughput inference using vLLM.

This separation ensures smooth integration of training and generation within a unified pipeline, and allows researchers to easily trace, debug, and extend every component.

πŸ’‘ For debugging under distributed execution, we recommend using torch.distributed.breakpoint() to enter an interactive shell on all ranks.

πŸ™ Acknowledgements:
The training logic is heavily inspired by torchtune, a well-designed and native PyTorch post-training library with clean abstractions.
The use of vLLM for response generationβ€”separated onto a dedicated inference rankβ€”follows the early design in TRL, which adopted this pattern before supporting tensor parallel inference. (As of now, TRL's latest version has integrated TP-compatible vLLM inference.)

πŸ• Preliminary

While (GRPO) has become the most widely used RL algorithm in recent multimodal training pipelines, this project explores an alternative: Reinforce++. The goal is not to replace GRPO, but to investigate how different policy-gradient methods perform on vision-language reasoning tasks.

We evaluate Reinforce++ on two datasets:

Each sample follows the format:

{
  "question": "<image>In the figure, $\\overline{A D}$ is perpendicular to $\\overline{B C}$ and $\\overline{A B}$ is perpendicular to $\\overline{A C}$. What is $B C ?$",
  "answer": "20",
  "solution": "\\boxed{20}",
  "id": "geometry3k_2948",
  "image": "geometry3k_images/geometry3k_2948.png",
}

πŸ“Œ Field descriptions:

question: mathematical question.

answer: ground-truth numeric answer.

solution: final boxed output with or without step-by-step reasoning.

image: relative path to the image file (relative to the .jsonl location).

πŸš€ Quick Start

πŸ›  Installation

Follow the steps below to set up the environment:

# Conda environment setup
conda create -n maye python=3.11
source activate maye

# PyTorch installation (CUDA 12.1)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# FlashAttention installation (for efficient attention)
pip uninstall -y ninja && pip install ninja
pip install flash-attn --no-build-isolation

# Install other Python dependencies
pip install -r requirements.txt

# Install the project as a package
poetry install

# Set up Weights & Biases for experiment logging
wandb login

πŸ“₯ Download Dataset

The paper experiments are conducted on two datasets: mm_math5k and geometry3k.
We release them on πŸ€— ManTle/MAYE.

After downloading, place them under the datasets directory with the following structure:

The datasets structure should look like this:

.
β”œβ”€β”€ geometry3k
β”‚   β”œβ”€β”€ geometry3k_images
β”‚   β”œβ”€β”€ geometry3k_rl_v0.jsonl
β”‚   β”œβ”€β”€ geometry3k_test_v0.jsonl
β”‚   └── geometry3k_validation_v0.jsonl
└── mm_math
    β”œβ”€β”€ images
    β”œβ”€β”€ mathverse_images
    β”œβ”€β”€ math_verse_test_v0.jsonl
    β”œβ”€β”€ mm_math_rl_v0.jsonl
    └── mm_math_validation_v0.jsonl

πŸ™ Acknowledgements:

πŸ“¦ Download Model

The experiments use the instruction-tuned variants of Qwen2/2.5-VL (≀7B).
Please download the following models from Hugging Face and place them in your local directory:

πŸš€ Launch Training

To launch distributed training, edit the script scripts/train_ppo_vllm_distributed.sh with the appropriate dataset and model settings. Below is a simplified example:

# === Basic Configuration ===
DATA=mmmath          # Options: mmmath, geo3k
MODEL=2.5it          # Options: 2.5it (Qwen2.5-VL-Instruct), 2it (Qwen2-VL-Instruct)

# Dataset selection based on DATA
if [[ "$DATA" == "geo3k" ]]; then
    DATADIR=geometry3k
    TRAIN_DATASET_NAME=geometry3k_rl_v0
    VALIDATION_DATASET_NAME=geometry3k_validation_v0
    TEST_DATASET_NAME=geometry3k_test_v0

elif [[ "$DATA" == "mmmath" ]]; then
    DATADIR=mm_math
    TRAIN_DATASET_NAME=mm_math_rl_v0
    VALIDATION_DATASET_NAME=mm_math_validation_v0
    TEST_DATASET_NAME=math_verse_test_v0

else
    echo "Error: Invalid data value '$DATA'. Use 'geo3k', 'mmmath'."
    exit 1
fi

# Model selection based on MODEL
if [[ "$MODEL" == "2.5it" ]]; then
    MODEL_NAME="Qwen2.5-VL-7B-Instruct"
    PROCESSOR_NAME="Qwen2.5-VL-7B-Instruct"
    MODEL_CLASS_NAME="Qwen2_5_VLForConditionalGeneration"
elif [[ "$MODEL" == "2it" ]]; then
    MODEL_NAME="Qwen2-VL-7B-Instruct"
    PROCESSOR_NAME="Qwen2-VL-7B-Instruct"
    MODEL_CLASS_NAME="Qwen2VLForConditionalGeneration"
else
    echo "Error: Invalid tag value '$MODEL'. Use '2.5it' or '2it'."
    exit 1
fi


# === Paths ===
MODEL_PATH=/tmp/ckpts/$MODEL_NAME
PROCESSOR_PATH=/tmp/ckpts/$PROCESSOR_NAME
OUTPUT_DIR=/tmp/ckpts/$MODEL_NAME-$DATADIR_NAME
TRAIN_DATASET_PATH=/tmp/MAYE/datasets/$DATADIR/${TRAIN_DATASET_NAME}.jsonl
VALIDATION_DATASET_PATH=/tmp/MAYE/datasets/$DATADIR/${VALIDATION_DATASET_NAME}.jsonl
TEST_DATASET_PATH=/tmp/MAYE/datasets/$DATADIR/${TEST_DATASET_NAME}.jsonl

The script automatically selects the proper dataset and model classes based on your DATA and MODEL choices. You may then launch training by running:

bash scripts/train_ppo_vllm_distributed.sh

πŸ“Œ Note: Ensure all paths and checkpoint names are consistent with your local setup.

🧷 Citation

If you find our work helpful, please consider citing:

@article{ma2025maye,
  title={Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme},
  author={Ma, Yan and Chern, Steffi and Shen, Xuyang and Zhong, Yiran and Liu, Pengfei},
  journal={arXiv preprint arXiv:2504.02587},
  year={2025},
}

About

Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published