link: CSU-JPG V-MAGE Repo
🌐 Project Page | 📃 Paper | 🤗 Playground
V-MAGE is a game-based benchmark designed to evaluate visual-centric capabilities through flexible gameplay and carefully designed levels. Its defining features are as follows:
- Visual-Centric: Models receive only visual input, requiring pixel-level scene understanding, object tracking, and spatial-temporal reasoning.
- Flexible Gameplay: Unlike grid-based benchmarks, V-MAGE features continuous-space environments, allowing models to explore a almost infinite state space with no single correct solution.
- Granular Skill Assessment: Each game are deigned with different difficulty levels that targeting various skill dimensions.
- Extensible Evaluation Framework: V-MAGE extends beyond model evaluation to assess agentic skills that is current out-of-scope for SOTA MLLMs
- Adaptive ELO-based Ranking: V-MAGE uses a dynamic Elo system for performance comparison, avoiding manual score normalization and performance ceilings.
model | avg_elo | race | supermario | pong | flappybird | tempestrun |
---|---|---|---|---|---|---|
gpt4o | 1550.83 | 1605.57 | 1536.58 | 1506.19 | 1590.23 | 1515.60 |
qwen2_5vl_72b | 1543.85 | 1546.67 | 1608.86 | 1496.32 | 1535.53 | 1531.89 |
gemini-2.0-flash-exp | 1522.82 | 1519.88 | 1536.84 | 1516.51 | 1524.91 | 1515.96 |
internvl2_5_78b | 1512.60 | 1468.74 | 1573.56 | 1512.72 | 1497.01 | 1510.96 |
qwen2vl_72b | 1506.49 | 1509.68 | 1535.26 | 1498.65 | 1455.87 | 1532.99 |
internvl2_5_8b | 1466.12 | 1464.91 | 1393.19 | 1505.23 | 1489.18 | 1478.11 |
random | 1450.53 | 1445.54 | 1436.00 | 1487.91 | 1457.36 | 1425.84 |
qwen2vl_7b | 1446.75 | 1439.01 | 1379.71 | 1476.48 | 1449.90 | 1488.66 |
Submit your own agent results.
To evaluate model with V-MAGE, you can use the following steps:
Dependencies can be installed via pip:
cd V-MAGE
conda create -n v-mage python=3.10 -y
conda activate v-mage
pip install -r requirements.txt
If you are using existing api service, you can skip this step.
Otherwise, we recommend using vLLM or SWIFT to deploy the OpenAI interface service for your local model.
Take vLLM and Qwen2.5VL-7B Instruct as an example, you can start the service by running the following command:
# Download the model.
# Remember to replace <path-to-model> with the path where you want to save the model.
pip install -U huggingface_hub
huggingface-cli download --resume-download Qwen/Qwen2.5-VL-7B-Instruct --local-dir <path-to-model>
# Start the service. You can change the parameters according to your needs.
pip install vllm
vllm serve <path-to-model> --trust-remote-code --max-model-len 15000 --limit-mm-per-prompt image=6 --port 8000 --gpu-memory-utilization 0.90 --tensor-parallel-size 2
You can also use nohup to run the service in the background.
Prepare config file for the model service.
For example, if you are using vLLM, you can simply change the model_path
and openai_api_base
in the config/model_config/openai_service_config.ini
.
[lmm]
model_name = OpenAI
model_path = <path-to-model>
openai_api_key = EMPTY
openai_api_base = http://localhost:8000/v1 # or your own service address
python runner.py \
--llmProviderConfig=./config/model_config/openai_service_config.ini \
--gameEnvConfig=./config/env_config/env_config_race_reasoning_0steps.json \
--levelConfig=./config/level_config/racegame/level1_no_history.json \
--output_dir=runs/Qwen2_5VL_7B \
--test_rounds=10
python multi_runner.py \
--config_file=./config/multi_runner_config/Race_3steps.json \
--llmProviderConfig=./config/model_config/openai_service_config.ini \
--output_dir=runs/Qwen2_5VL_7B \
--test_rounds=10
If you don't want to watch the game screen, you can set the environment variable SDL_VIDEODRIVER
to dummy
before running the script:
export SDL_VIDEODRIVER=dummy
will be added soon
will be added soon
will be added soon
Thanks to the open-source community, we are able to leverage existing game codebases to build our benchmark. Here are the games we used:
Game | Codebase |
---|---|
RaceGame | tdostilio/Race_Game |
FlappyBird | agneay/pygame-projects/Flappy Bird |
Pong | pyGuru123/Python-Games/Pong |
SuperMario | mx0c/super-mario-python |
Tempest Run | daipenger/pygame-summer-team-jam |
will be added soon