DeepNote: Note-Centric Deep Retrieval-Augmented Generation

We develop DeepNote, an adaptive RAG framework that achieves in-depth and robust exploration of knowledge sources through note-centric adaptive retrieval. DeepNote employs notes as carriers for refining and accumulating knowledge. During in-depth exploration, it uses these notes to determine retrieval timing, formulate retrieval queries, and iteratively assess knowledge growth, ultimately leveraging the best note for answer generation.

Prepare Datasets

All corpus and evaluation files should be placed in the /data directory. You can download the experimental data here.

We use Wikipedia as the corpus for ASQA and StrategyQA. Due to its large size, please download it separately here and place it in /data/corpus/wiki/.

Retrieval Settings

For different datasets, we employ various retrieval methods:

For 2WikiMQA, MusiQue, and HotpotQA:

BM25 retrieval based on ElasticSearch
Dense retrieval with FAISS index using embeddings from BGE model

For ASQA and StrategyQA:

Dense retrieval with FAISS index using embeddings from GTR model

Setup ElasticSearch

Install Elasticsearch:

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
cd elasticsearch-7.10.2/
./bin/elasticsearch # Start the server
pkill -f elasticsearch # To stop the server

Build Indices

For BM25

cd src/build_index/es

# 2WikiMQA
python index_2wiki.py

# MusiQue
python index_musique.py

# HotpotQA
python index_hotpotqa.py

For Dense Retrieval

For HotpotQA, 2WikiMQA, and MusiQue

cd src/build_index/emb
python index.py --dataset hotpotqa --model bge-base-en-v1.5 # e.g., for HotpotQA dataset

For ASQA and StrategyQA

Since generating GTR embeddings for Wikipedia corpus is time-consuming, you can download the pre-computed GTR embeddings and place them in data/corpus/wiki/:

wget https://huggingface.co/datasets/princeton-nlp/gtr-t5-xxl-wikipedia-psgs_w100-index/resolve/main/gtr_wikipedia_index.pkl

Then build FAISS index:

cd src/build_index/emb
python index.py --dataset asqa --model gtr-t5-xxl

Configuration

You can configure your API key, URL, and other settings in the ./config/config.yaml file.

Training DeepNote

The training process consists of three main steps:

1. Generate Training Data

Generate the initial training data using LLaMA model:

python gen_dpo_data.py \
    --model llama-3.1-8b-instruct \
    --batch_size 9 \
    --output_path ../data/dpo_data \
    --device 0,1,2,3

2. Data Selection

Filter and process the generated data:

python select_dpo_data.py \
    --output_path ../data/dpo/processed/train.jsonl \
    --init_num 1900 \
    --refine_num 1900 \
    --query_num 1900

3. Start Training

Launch the training process:

bash train.sh

Running DeepNote and Evaluation

python main.py --method deepnote --retrieve_top_k 5 --dataset hotpotqa --max_step 3 --max_fail_step 2 --MaxClients 5 --model gpt-4o-mini-2024-07-18 --device cuda:0

The predicted results and evaluation metrics will be automatically saved in the output/{dataset}/ directory. The evaluation results can be found at the end of the file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
config		config
prompts/en		prompts/en
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepNote: Note-Centric Deep Retrieval-Augmented Generation

Prepare Datasets

Retrieval Settings

Setup ElasticSearch

Build Indices

For BM25

For Dense Retrieval

For HotpotQA, 2WikiMQA, and MusiQue

For ASQA and StrategyQA

Configuration

Training DeepNote

1. Generate Training Data

2. Data Selection

3. Start Training

Running DeepNote and Evaluation

About

Releases

Packages

Languages

thunlp/DeepNote

Folders and files

Latest commit

History

Repository files navigation

DeepNote: Note-Centric Deep Retrieval-Augmented Generation

Prepare Datasets

Retrieval Settings

Setup ElasticSearch

Build Indices

For BM25

For Dense Retrieval

For HotpotQA, 2WikiMQA, and MusiQue

For ASQA and StrategyQA

Configuration

Training DeepNote

1. Generate Training Data

2. Data Selection

3. Start Training

Running DeepNote and Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages