This provides various Dense Retrieval functionality for PyTerrier.
This repository can be installed using pip.
pip install pyterrier-dr
If you want the latest version of pyterrier_dr
, you can install direct from the Github repo:
pip install --upgrade git+https://github.com/terrierteam/pyterrier_dr.git
if you want to use the BGE-M3 encoder with pyterrier_dr
, you can install the package with the bgem3
dependency:
pip install pyterrier-dr[bgem3]
You'll also need to install FAISS.
On Colab:
!pip install faiss-cpu
On Anaconda:
# CPU-only version
conda install -c pytorch faiss-cpu
# GPU(+CPU) version
conda install -c pytorch faiss-gpu
You can then import the package and PyTerrier in Python:
import pyterrier as pt
import pyterrier_dr
Model | .query_encoder() |
.doc_encoder() |
.scorer() |
---|---|---|---|
TctColBert |
✅ | ✅ | ✅ |
TasB |
✅ | ✅ | ✅ |
Ance |
✅ | ✅ | ✅ |
Query2Query |
✅ | ||
BGE-M3 |
✅ | ✅ | ✅ |
Bi-encoder models are represented as PyTerrier transformers. For instance, to load up a TCT-ColBERT model,
model = pyterrier_dr.TctColBert()
# Loads castorini/tct_colbert-msmarco by default.
# You can load up other versions by specifying the huggingface model ID, e.g.,
model = pyterrier_dr.TctColBert('castorini/tct_colbert-v2-hnp-msmarco')
Once you have a bi-encoder transformer, you can use it encode queries, encode documents, or perform on-they-fly scoring, depending on the input.
# Compute query vectors
model([
{'qid': '0', 'query': 'Hello Terrier'},
{'qid': '1', 'query': 'find me some documents'},
])
# qid query query_vec
# 0 Hello Terrier [-0.044920705, 0.08312888, 0.26291823, -0.0690...
# 1 find me some documents [0.09036196, 0.19262837, 0.13174239, 0.0649483...
# Compute document vectors
model([
{'docno': '0', 'text': 'The Five Find-Outers and Dog, also known as The Five Find-Outers, is a series of children\'s mystery books written by Enid Blyton.'},
{'docno': '1', 'text': 'City is a 1952 science fiction fix-up novel by American writer Clifford D. Simak.'},
])
# docno text doc_vec
# 0 The Five Find-Outers and Dog, also known as Th... [-0.13535342, 0.16328977, 0.16885889, -0.08592...
# 1 City is a 1952 science fiction fix-up novel by... [-0.06430543, 0.1267311, 0.13813286, 0.0954021...
# Compute on-they-fly scores
model([
{'qid': '0', 'query': 'Hello Terrier', 'docno': '0', 'text': 'The Five Find-Outers and Dog, also known as The Five Find-Outers, is a series of children\'s mystery books written by Enid Blyton.'},
{'qid': '0', 'query': 'Hello Terrier', 'docno': '1', 'text': 'City is a 1952 science fiction fix-up novel by American writer Clifford D. Simak.'},
])
# qid query docno text score rank
# 0 Hello Terrier 0 The Five Find-Outers and Dog, also known as Th... 66.522240 0
# 0 Hello Terrier 1 City is a 1952 science fiction fix-up novel by... 64.964241 1
Of course you can also use the model within a larger pipeline. For instance, if you want to re-reank BM25 results using the model, or split a long documents into passages for later indexing:
# Retrieval pipeline
bm25 = pt.TerrierRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25')
retr_pipeline = bm25 >> pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text') >> model
retr_pipeline.search('Hello Terrier')
# qid docid docno score query text rank
# 1 1899117 1899117 68.693260 Hello Terrier The key word is Terrier! Do your homework, I'd... 0
# 1 5679466 5679466 68.605782 Hello Terrier Introduction. The Biewer Terrier, also known a... 1
# 1 3971237 3971237 68.582764 Hello Terrier Norwich Terrier. The spirited Norwich is one o... 2
# ...
# Indexing pipeline: split long documents into passages of length 50 (stride 25)
idx_pipeline = pt.text.sliding('text', prepend_title=False, length=50, stride=25) >> model
idx_pipeline([
{'docno': '0', 'text': "The Five Find-Outers and Dog, also known as The Five Find-Outers, is a series of children's mystery books written by Enid Blyton. The first was published in 1943 and the last in 1961. Set in the fictitious village of Peterswood based on Bourne End, close to Marlow, Buckinghamshire, the children Fatty (Frederick Trotteville), who is the leader of the team, Larry (Laurence Daykin), Pip (Philip Hilton), Daisy (Margaret Daykin), Bets (Elizabeth Hilton) and Buster, Fatty's dog, encounter a mystery almost every school holiday, always solving the puzzle before Mr Goon, the unpleasant village policeman, much to his annoyance."},
])
# docno text doc_vec
# 0%p0 The Five Find-Outers and Dog, also known as Th... [-0.2607395, 0.21450453, 0.25845605, -0.190567...
# 0%p1 published in 1943 and the last in 1961. Set in... [-0.4286567, 0.2093819, 0.37688383, -0.2590821...
A FLexible EXecution (FLEX) Index is a dense index format that allows for a variety of retrieval implementations (NumPy, FAISS, etc.) and algorithms (exhaustive, HNSW, etc.) to be tested. In many cases, the same vector storage can be used across implementations and algorithms, saving considerably on disk space.
You can use it as part of an indexing pipeline that includes a model to encode documents:
index = pyterrier_dr.FlexIndex('myindex.flex')
idx_pipeline = model >> index
idx_pipeline.index([
{'docno': '0', 'text': 'The Five Find-Outers and Dog, also known as The Five Find-Outers, is a series of children\'s mystery books written by Enid Blyton.'},
{'docno': '1', 'text': 'City is a 1952 science fiction fix-up novel by American writer Clifford D. Simak.'},
])
# Creates an index in myindex.flex:
# $ ls myindex.flex/
# docnos.npids pt_meta.json vecs.f4
Normally you'll run this over a standard corpus. You can use those provided by ir_datasets:
index = pyterrier_dr.FlexIndex('antique.flex')
idx_pipeline = model >> index
idx_pipeline.index(pt.get_dataset('irds:antique').get_corpus_iter())
Once built, you can use an index object in a retrieval pipeline too. Be sure to include the model in your pipeline to encode the query text first!
retr_pipeline = model >> index.np_retriever()
retr_pipeline.search('Hello Terrier')
# qid query docno score rank
# 1 Hello Terrier 3771188_6 68.791359 0
# 1 Hello Terrier 723025_2 68.791359 1
# 1 Hello Terrier 1969155_1 68.357742 2
# ...
The above performs an exhaustive (exact) search using numpy. You
can also use other retrievers from a FlexIndex
:
retr_pipeline = model >> index.torch_retriever()
retr_pipeline.search('Hello Terrier')
# qid query docno score rank
# 1 Hello Terrier 723025_2 68.774750 0
# 1 Hello Terrier 3771188_6 68.774750 1
# 1 Hello Terrier 1969155_1 68.340683 2
retr_pipeline = model >> index.faiss_hnsw_retriever()
# ...
pyterrier_dr
also supports using BGE-M3 for indexing and retrieval with the following encoders:
query_encoder()
: Encodes queries into single-vector representations only.doc_encoder()
: Encodes documents into single-vector representations only.query_multi_encoder()
: Allows user to encode queries in dense, sparse or multi-vector representations.doc_multi_encoder()
: Allows user to encode documents in dense, sparse or multi-vector representations.
What encodings are returned by both query_multi_encoder()
and doc_multi_encoder()
can be controlled by the return_dense
, return_sparse
and return_colbert_vecs
parameters. By default, all three are set to True
.
The BGE-M3 Encoder requires the FlagEmbedding library. You can install it as part of the bgem3
dependency of pyterrier_dr
(see Installation section).
factory = BGEM3(batch_size=32, max_length=1024, verbose=True)
encoder = factory.doc_encoder()
index = FlexIndex(f"mmarco/v2/fr_bgem3", verbose=True)
indexing_pipeline = encoder >> index
indexing_pipeline.index(pt.get_dataset(f"irds:mmarco/v2/fr").get_corpus_iter())
factory = BGEM3(batch_size=32, max_length=1024)
encoder = factory.query_encoder()
index = FlexIndex(f"mmarco/v2/fr_bgem3", verbose=True)
pipeline = encoder >> idx.np_retriever()
- PyTerrier: PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval (Macdonald et al, CIKM 2021)
- FAISS: Billion-Scale Similarity Search with GPUs (Johnson et al., 2017)
- TCT-ColBERT: In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval (Lin et al., RepL4NLP 2021)
Contributors to this repository:
- Sean MacAvaney, University of Glasgow
- Xiao Wang, University of Glasgow