Releases · deepset-ai/haystack

16 Jul 12:30

tholor

0.3.0

a6ec430

0.3.0

🔍 Dense Passage Retrieval

Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]

See Tutorial 6 for more details

📊 Evaluation

We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval() is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.

document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)

See Tutorial 5 for more details

📄 Basic Support for PDF and Docx Files

You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page

#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

And there's much more that happened ...

Preprocessing

Added Support for Docx Files #225
Add PDF parser for indexing #109
Adjust PDF conversion subprocess for Python v3.6 #194
Fix boundary condition in detection of header/footer in file converters #165

Retriever

Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever #239
Add dummy retriever for benchmarking / reader-only settings #235
Fix id for documents returned by the TfidfRetriever #232
Tutorial for Dense Passage Retriever #186
Fix device arg for sentence transformers #124
Fix embeddings from sentence-transformers (type cast & gpu flags) #121
Adding metadata to be returned from tfidf retreiver #122

Reader

Add ONNXRuntime support #157
Fix multi gpu training via Dataparallel #234
Fix document id missing in farm inference output #174
Add document meta for Transformer Reader #114
Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
Adjust to farm handling of no answer #170

DocumentStores

Move document_name attribute to meta #217
Remove meta field when writing documents in Elasticsearch #240
Harmonize meta data handling across doc stores #214
Add filtering by tags for InMemoryDocumentStore #108
Make FAQ question field customizable #146
Increase timeout for Elasticsearch bulk indexing #119
Add embedding query for InMemoryDocumentStore #112
Increase timeout for bulk indexing in ES #130
Add custom port to ElasticsearchDocumentStore #129
Remove hard-coded embedding field #107

REST API

Move out REST API from PyPI package #160
Fix format of /export-doc-qa-feedback to comply with SQuAD #241
Create file upload directory in the REST API #166
Add API endpoint to upload files #154
Missing PORT and SCHEME for elasticsearch to run the API #134
Add EMBEDDING_MODEL_FORMAT in API config #152
Add success response for successful file upload API #195
Add response time in logs #201
Fix rest api in Docker image after refactoring #178

Other

Upgrade to new FARM / Transformers / PyTorch versions #212
Fix Evaluation Dataset #233
Remove mutation of documents in write_documents() #231
Remove mutation of results dict in print_answers() #230
Make doc name optional #100
Fix Dockerfile to build successfully without models directory #210
Docker available for TransformsReader Class #180
Fix embedding method in FAQ-QA Tutorial #220
Add more tests #213
Update docstring for embedding_field and embedding_dim #208
Make "meta" field generic for Document Schema #102
Update tutorials #200
Upgrade FARM version #172
Fix for installing PyTorch on Windows OS #159
Remove Literal type hint #156
Remove PyMuPDF dependency #148
Add missing type hints #138
Add a GitHub Action to start Elasticsearch instance for Build workflow #142
Correct field in evaluation tutorial #139
Update Haystack version in tutorials #136
Fix evaluation #132
Add stalebot #131
Add Reader/Retriever validations in Finder #113
Add document metadata for FAQ style QA #106
Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
Make saving more explicit in tutorial #95

Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan

Assets 2

05 May 10:57

tanaysoni

0.2.1

b4842f2

0.2.1

In our release notes, we will always highlight a few important changes first and list the detailed PRs below.

🎉 First release

Happy to announce our first proper release incl. many of the features that we found absolutely crucial for a QA system. While we still have countless exciting features on our roadmap, we are confident that this version already accelerates your development phase of QA systems significantly and it was also tested successfully in the first production deployments.
From now on, we will switch to a more regular release cycle.

📜 ElasticsearchDocumentStore

We recommend the new ElasticsearchDocumentStore for all production deployments. While we will keep more light-weight options (SQL, In-Memory) for easy prototyping, new features will be implemented first for Elasticsearch.

🚀 New Retrievers

Beside plain TF-IDF (in memory), we introduced the ElasticsearchRetriever that supports Elasticsearch's native scoring (BM25) or custom queries (e.g. using boosting).

As a further option, we also added the EmbeddingRetriever that encodes texts into embeddings (e.g. via Sentence-BERT) and retrieves via cosine-similarity. Especially the latter is very promising and you will likely see more features in this direction.

⁉️ FAQ-style QA

Beside extractive QA, you can now also index existing question-answer pairs (e.g. from FAQs) and find answers via matching the incoming user-question with the indexed questions and returning the related answer from that pair. This can be an interesting alternative or addition to extractive QA, if you already have huge collections of FAQs and/or need a solution that works with low computational resources.

🔁 Modular API based on FastAPI

We changed the basic REST API from Flask to FastAPI and modularized it.
You can now:

search answers in texts (extractive QA)
search answers by comparing user question to existing questions (FAQ-style QA)
collect & export user feedback on answers to gain domain-specific training data (feedback)
do basic monitoring of requests (currently via APM in Kibana)

Detailed changes:

Document Stores

Add Elasticsearch Datastore #13
Refactor database layer #10
Add test for Elasticsearch document store #88
Make filters optional for Elasticsearch query #80
Inmemory store #76
Fix get_all_documents() in ElasticsearchDocumentStore #77
Fix get_all_documents query for Elasticsearch #21

Retrievers

Add FAQ-style QA #44
added option for custom elasticsearch queries and filters #52
More flexbile es config & support for filters #29
Add more ES connection params #35
Simplify Retriever query #73
Refactor ElasticsearchRetriever into separate class #72
Add params to create_embeddings in retriever #45
fix scaling of pseudo probs for es scores. fix filtering of embedding retrieval #46
Fixing doc_name for TFIDF Retriever #33

Readers

Refactor pipeline for better generalizability & Add TransformersReader #1
Add method to train a reader on custom data #5
Add no answer handling #26
Add no_answer option to results #24
Fix offsets in reader #4
FARMReader.train() now takes default values from FARMReader #47
Update inferencer args (num_processes, chunksize) to latest FARM version #54
update readme & rename arg in TransformersReader for consistency #86
Fixing typo in transformer. use_gpu provides ordinal of the gpu, not … #83
Add document_id with Transformers Reader #60
Make eval during reader.train() more verbose #28
Removed "document_name" from farm.py #31
Add a document_name field in answers #30

REST API / Deployment

Move API from flask to fastAPI #3
Modularize API components #55
Return more meta data & restructure reponse format #66
Log API responses in APM #70
Make Elastic-APM optional #65
Update Python version in Dockerfile-GPU #71
Update Dockerfiles to use Gunicorn for deployment #69
Add limit on concurrent requests for doc-qa #64
Add Docker Images for running Haystack #85
Fix cyclic import of Elasticsearch client #59
Add Feedback export API #56
Add gpu dockerfile, improve logging, fix minor bug with filtering #36
Improve deployment of REST API (Configs, logging, minor bugs) #40

Others

Standardize Finder, Readers, and Retriever interfaces #62
pin haystack version in tutorials until release #87
Update tutorials to use Elasticsearch, new Retrievers #79
Adding coverage reports and a few more tests #78
Added Jupyter notebooks of Tutorials #43
Add minimal tutorial for ES #19
Update tutorials #12

Thanks to all contributors for your great work 👏
@tanaysoni, @Timoeller , @brandenchan, @bogdankostic , @skirdey , @stedomedo , @karthik19967829 , @aadil-srivastava01 , @tholor

Assets 2

28 Nov 09:55

tanaysoni

0.1.0

d2c77f3

Initial Release

0.1.0

0.1.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔍 Dense Passage Retrieval

📊 Evaluation

📄 Basic Support for PDF and Docx Files

Preprocessing

Retriever

Reader

DocumentStores

REST API

Other

🎉 First release

📜 ElasticsearchDocumentStore

🚀 New Retrievers

⁉️ FAQ-style QA

🔁 Modular API based on FastAPI

Detailed changes:

Document Stores

Retrievers

Readers

REST API / Deployment

Others

Releases: deepset-ai/haystack

0.3.0

🔍 Dense Passage Retrieval

📊 Evaluation

📄 Basic Support for PDF and Docx Files

Preprocessing

Retriever

Reader

DocumentStores

REST API

Other

0.2.1

🎉 First release

📜 ElasticsearchDocumentStore

🚀 New Retrievers

⁉️ FAQ-style QA

🔁 Modular API based on FastAPI

Detailed changes:

Document Stores

Retrievers

Readers

REST API / Deployment

Others

Initial Release