CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Teaching robots in the real world to respond to natural language queries with zero human labels — using pretrained large language models (LLMs), visual language models (VLMs), and neural fields.

Authors: Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, Arthur Szlam.

warm_up_my_lunch.mp4

Tl;dr CLIP-Field is a novel weakly supervised approach for learning a semantic robot memory that can respond to natural language queries solely from raw RGB-D and odometry data with no extra human labelling. It combines the image and language understanding capabilites of novel vision-language models (VLMs) like CLIP, large language models like sentence BERT, and open-label object detection models like Detic, and with spatial understanding capabilites of neural radiance field (NeRF) style architectures to build a spatial database that holds semantic information in it.

Installation

To properly install this repo and all the dependencies, follow these instructions.

# Clone this repo.
git clone --recursive https://github.com/notmahi/clip-fields
cd clip-fields

# Create conda environment and install the dependencies.
conda create -n cf python=3.8
conda activate cf
conda install -y pytorch torchvision torchaudio cudatoolkit=11.8 -c pytorch-lts -c nvidia
pip install -r requirements.txt

# Install the hashgrid encoder with the relevant cuda module.
cd gridencoder
# For this part, it may be necessary to find out what your nvcc path is and use that, 
# For me $which nvcc gives public/apps/cuda/11.8/bin/nvcc, so I used the following part
# export CUDA_HOME=/public/apps/cuda/11.8
python setup.py install
cd ..

Interactive Tutorial and Evaluation

We have an interactive tutorial and evaluation notebook that you can use to explore the model and evaluate it on your own data. You can find them in the demo/ directory, that you can run after installing the dependencies.

Training a CLIP-Field directly

Once you have the dependencies installed, you can run the training script train.py with any .r3d files that you have! If you just want to try out a sample, download the sample data nyu.r3d and run the following command.

python train.py dataset_path=nyu.r3d

If you want to use LSeg as an additional source of open-label annotations, you should download the LSeg demo model and place it in the path_to_LSeg/checkpoints/demo_e200.ckpt. Then, you can run the following command.

python train.py dataset_path=nyu.r3d use_lseg=true

You can check out the config/train.yaml for a list of possible configuration options. In particular, if you want to train with any particular set of labels, you can specify them in the custom_labels field in config/train.yaml.

Acknowledgements

We would like to thank the following projects for making their code and models available, which we relied upon heavily in this work.

CLIP with MIT License
Detic with Apache License 2.0
Torch NGP with MIT License
LSeg with MIT License
Sentence BERT with Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Teaching robots in the real world to respond to natural language queries with zero human labels — using pretrained large language models (LLMs), visual language models (VLMs), and neural fields.

Installation

Interactive Tutorial and Evaluation

Training a CLIP-Field directly

Acknowledgements

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory

Teaching robots in the real world to respond to natural language queries with zero human labels — using pretrained large language models (LLMs), visual language models (VLMs), and neural fields.

Installation

Interactive Tutorial and Evaluation

Training a CLIP-Field directly

Acknowledgements