This is the official implementation of our paper accepted by CVPR 2025 (All strong accept)
Authors: Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization — a hallmark of foundation models in other computer vision tasks — remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.
TLDR: Our method takes as input a pair of stereo images and outputs a dense disparity map, which can be converted to a metric-scale depth map or 3D point cloud.
We obtained the 1st place on the world-wide Middlebury leaderboard and ETH3D leaderboard.
Our method outperforms existing approaches in zero-shot stereo matching tasks across different scenes.
conda env create -f environment.yml
conda activate foundation_stereo
- Download the foundation model for zero-shot inference on your data from here. Put the entire folder (e.g.
23-51-11
) under./pretrained_models/
.
python scripts/run_demo.py --left_file ./assets/left.png --right_file ./assets/right.png --ckpt_dir ./pretrained_models/model_best_bp2.pth --out_dir ./test_outputs/
You can see output point cloud.
Tips:
- The input left and right images should be rectified and undistorted, which means there should not be fisheye kind of lens distortion and the epipolar lines are horizontal between the left/right images. If you obtain images from stereo cameras such as Zed, they usually have handled this for you.
- Do not swap left and right image. The left image should really be obtained from the left-side camera (objects will appear righter in the image).
- We recommend to use PNG files with no lossy compression
- Our method works best on stereo RGB images. However, we have also tested it on monochrome or IR stereo images (e.g. from RealSense D4XX series) and it works well too.
- For all options and instructions, check by
python scripts/run_demo.py --help
- To get point cloud for your own data, you need to specify the intrinsics. In the intrinsic file in args, 1st line is the flattened 1x9 intrinsic matrix, 2nd line is the baseline (distance) between the left and right camera, unit in meters.
- For high-resolution image (>1000px), you can run with
--hiera 1
to enable hierarchical inference for better performance. - For faster inference, you can reduce the input image resolution by e.g.
--scale 0.5
, and reduce refine iterations by e.g.--valid_iters 16
.
To create ONNX models:
-
Make this change to replace flash-attention
-
Make ONNX:
export XFORMERS_DISABLED=1
python scripts/make_onnx.py --save_path ./output/foundation_stereo.onnx --ckpt_dir ./pretrained_models/23-51-11/model_best_bp2.pth --height 480 --width 640 --valid_iters 22
- Convert ONNX to TensorRT:
trtexec --onnx=./output/foundation_stereo.onnx --saveEngine=./output/foundation_stereo.engine --fp16 --verbose
We have observed 6X speed on the same GPU 3090 with TensorRT FP16. Although how much it speeds up depends on various factors, we recommend trying it out if you care about faster inference. Also remember to adjust the args setting based on your need.
This feature is experimental as of now and contributions are welcome!
You can download the whole dataset here (>1TB). We also provide a small sample data (3GB) to peek. The whole dataset contains ~1M data points, where each consists of:
- Left and right images
- Ground-truth disparity
You can check how to read data by using our example with the sample data:
python scripts/vis_dataset.py --dataset_path ./DATA/sample/manipulation_v5_realistic_kitchen_2500_1/dataset/data/
It will produce:
-
Q: Conda install does not work for me?
A: Check this -
Q: My GPU doesn't support Flash attention?
A: See this -
Q: RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
A: This may indicate OOM issue. Try reducing your image resolution or use a GPU with more memory. -
Q: How to run with RealSense?
A: See this
@article{wen2025stereo,
title={FoundationStereo: Zero-Shot Stereo Matching},
author={Bowen Wen and Matthew Trepte and Joseph Aribido and Jan Kautz and Orazio Gallo and Stan Birchfield},
journal={CVPR},
year={2025}
}
We would like to thank Gordon Grigor, Jack Zhang, Karsten Patzwaldt, Hammad Mazhar and other NVIDIA Isaac team members for their tremendous engineering support and valuable discussions. Thanks to the authors of DINOv2, DepthAnything V2, Selective-IGEV and RAFT-Stereo for their code release. Finally, thanks to CVPR reviewers and AC for their appreciation of this work and constructive feedback.
For questions, please reach out to Bowen Wen ([email protected]).