Skip to content

Releases: deepset-ai/haystack

0.3.0

16 Jul 12:30
Compare
Choose a tag to compare

🔍 Dense Passage Retrieval

Glad to introduce the new Dense Passage Retriever (aka DPR).
Using dense embeddings of texts is a powerful alternative to score the similarity of texts. This retriever uses two BERT models - one to embed your query, one to embed your passage. This Dual-Encoder architecture can deal much better with the different nature of query and texts (length, syntax ...). It's was published by Karpukhin et al and shows impressive performance - especially if there's no direct overlap between tokens in your queries and your texts.

retriever = DensePassageRetriever(document_store=document_store,
                                  embedding_model="dpr-bert-base-nq",
                                  do_lower_case=True, use_gpu=True)
retriever.retrieve(query="What is cosine similarity?")
# returns: [Document, Document]

See Tutorial 6 for more details

📊 Evaluation

We introduce the option to evaluate your reader, retriever, and the combination of both. While there's usually a good understanding of the reader's performance, the interplay with the retriever is what really matters in practice. You want to answer: Is my retriever a bottleneck? Is it worth increasing top_k for the retriever? How do different retrievers compare in performance? What is the effect on speed?
The new eval() is a first step towards answering those questions and gives a comprehensive picture of your pipeline. Stay tuned for more enhancements here.

document_store.add_eval_data("../data/nq/nq_dev_subset_v2.json")
...
retriever.eval(top_k=10)
reader.eval(document_store=document_store, device=device)
finder.eval(top_k_retriever=10, top_k_reader=10)

See Tutorial 5 for more details

📄 Basic Support for PDF and Docx Files

You can now index PDF and docx files more easily to your DocumentStore. We introduce a new BaseConverter class, that offers basic cleaning functions (e.g. removing footers or tables). It's file format specific child classes (e.g. PDFToTextConverter) handle the actual extraction of the text.

#PDF
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
# => list of str, one per page

#DOCX
from haystack.indexing.file_converters.docx import DocxToTextConverter
converter = DocxToTextConverter()
paragraphs = converter.extract_pages(file_path=file)
#  => list of str, one per paragraph (as docx has no direct notion of pages)

And there's much more that happened ...

Preprocessing

  • Added Support for Docx Files #225
  • Add PDF parser for indexing #109
  • Adjust PDF conversion subprocess for Python v3.6 #194
  • Fix boundary condition in detection of header/footer in file converters #165

Retriever

  • Refactor DPR for latest transformers version & change init arg gpu -> use_gpu for DPR and EmbeddingRetriever #239
  • Add dummy retriever for benchmarking / reader-only settings #235
  • Fix id for documents returned by the TfidfRetriever #232
  • Tutorial for Dense Passage Retriever #186
  • Fix device arg for sentence transformers #124
  • Fix embeddings from sentence-transformers (type cast & gpu flags) #121
  • Adding metadata to be returned from tfidf retreiver #122

Reader

  • Add ONNXRuntime support #157
  • Fix multi gpu training via Dataparallel #234
  • Fix document id missing in farm inference output #174
  • Add document meta for Transformer Reader #114
  • Fix naming of offset in answers of TransformersReader (for consistency with FARMReader) #204
  • Adjust to farm handling of no answer #170

DocumentStores

  • Move document_name attribute to meta #217
  • Remove meta field when writing documents in Elasticsearch #240
  • Harmonize meta data handling across doc stores #214
  • Add filtering by tags for InMemoryDocumentStore #108
  • Make FAQ question field customizable #146
  • Increase timeout for Elasticsearch bulk indexing #119
  • Add embedding query for InMemoryDocumentStore #112
  • Increase timeout for bulk indexing in ES #130
  • Add custom port to ElasticsearchDocumentStore #129
  • Remove hard-coded embedding field #107

REST API

  • Move out REST API from PyPI package #160
  • Fix format of /export-doc-qa-feedback to comply with SQuAD #241
  • Create file upload directory in the REST API #166
  • Add API endpoint to upload files #154
  • Missing PORT and SCHEME for elasticsearch to run the API #134
  • Add EMBEDDING_MODEL_FORMAT in API config #152
  • Add success response for successful file upload API #195
  • Add response time in logs #201
  • Fix rest api in Docker image after refactoring #178

Other

  • Upgrade to new FARM / Transformers / PyTorch versions #212
  • Fix Evaluation Dataset #233
  • Remove mutation of documents in write_documents() #231
  • Remove mutation of results dict in print_answers() #230
  • Make doc name optional #100
  • Fix Dockerfile to build successfully without models directory #210
  • Docker available for TransformsReader Class #180
  • Fix embedding method in FAQ-QA Tutorial #220
  • Add more tests #213
  • Update docstring for embedding_field and embedding_dim #208
  • Make "meta" field generic for Document Schema #102
  • Update tutorials #200
  • Upgrade FARM version #172
  • Fix for installing PyTorch on Windows OS #159
  • Remove Literal type hint #156
  • Remove PyMuPDF dependency #148
  • Add missing type hints #138
  • Add a GitHub Action to start Elasticsearch instance for Build workflow #142
  • Correct field in evaluation tutorial #139
  • Update Haystack version in tutorials #136
  • Fix evaluation #132
  • Add stalebot #131
  • Add Reader/Retriever validations in Finder #113
  • Add document metadata for FAQ style QA #106
  • Add basic tutorial for FAQ-based QA & batch comp. of embeddings #98
  • Make saving more explicit in tutorial #95

Thanks to all contributors for working on this and shaping Haystack together: @skirdey @guillim @antoniolanza1996 @F4r1n @arthurbarros @elyase @anirbansaha96 @Timoeller @bogdankostic @tanaysoni @brandenchan

0.2.1

05 May 10:57
Compare
Choose a tag to compare

In our release notes, we will always highlight a few important changes first and list the detailed PRs below.

🎉 First release

Happy to announce our first proper release incl. many of the features that we found absolutely crucial for a QA system. While we still have countless exciting features on our roadmap, we are confident that this version already accelerates your development phase of QA systems significantly and it was also tested successfully in the first production deployments.
From now on, we will switch to a more regular release cycle.

📜 ElasticsearchDocumentStore

We recommend the new ElasticsearchDocumentStore for all production deployments. While we will keep more light-weight options (SQL, In-Memory) for easy prototyping, new features will be implemented first for Elasticsearch.

🚀 New Retrievers

Beside plain TF-IDF (in memory), we introduced the ElasticsearchRetriever that supports Elasticsearch's native scoring (BM25) or custom queries (e.g. using boosting).

As a further option, we also added the EmbeddingRetriever that encodes texts into embeddings (e.g. via Sentence-BERT) and retrieves via cosine-similarity. Especially the latter is very promising and you will likely see more features in this direction.

⁉️ FAQ-style QA

Beside extractive QA, you can now also index existing question-answer pairs (e.g. from FAQs) and find answers via matching the incoming user-question with the indexed questions and returning the related answer from that pair. This can be an interesting alternative or addition to extractive QA, if you already have huge collections of FAQs and/or need a solution that works with low computational resources.

🔁 Modular API based on FastAPI

We changed the basic REST API from Flask to FastAPI and modularized it.
You can now:

  • search answers in texts (extractive QA)
  • search answers by comparing user question to existing questions (FAQ-style QA)
  • collect & export user feedback on answers to gain domain-specific training data (feedback)
  • do basic monitoring of requests (currently via APM in Kibana)

Detailed changes:

Document Stores

  • Add Elasticsearch Datastore #13
  • Refactor database layer #10
  • Add test for Elasticsearch document store #88
  • Make filters optional for Elasticsearch query #80
  • Inmemory store #76
  • Fix get_all_documents() in ElasticsearchDocumentStore #77
  • Fix get_all_documents query for Elasticsearch #21

Retrievers

  • Add FAQ-style QA #44
  • added option for custom elasticsearch queries and filters #52
  • More flexbile es config & support for filters #29
  • Add more ES connection params #35
  • Simplify Retriever query #73
  • Refactor ElasticsearchRetriever into separate class #72
  • Add params to create_embeddings in retriever #45
  • fix scaling of pseudo probs for es scores. fix filtering of embedding retrieval #46
  • Fixing doc_name for TFIDF Retriever #33

Readers

  • Refactor pipeline for better generalizability & Add TransformersReader #1
  • Add method to train a reader on custom data #5
  • Add no answer handling #26
  • Add no_answer option to results #24
  • Fix offsets in reader #4
  • FARMReader.train() now takes default values from FARMReader #47
  • Update inferencer args (num_processes, chunksize) to latest FARM version #54
  • update readme & rename arg in TransformersReader for consistency #86
  • Fixing typo in transformer. use_gpu provides ordinal of the gpu, not … #83
  • Add document_id with Transformers Reader #60
  • Make eval during reader.train() more verbose #28
  • Removed "document_name" from farm.py #31
  • Add a document_name field in answers #30

REST API / Deployment

  • Move API from flask to fastAPI #3
  • Modularize API components #55
  • Return more meta data & restructure reponse format #66
  • Log API responses in APM #70
  • Make Elastic-APM optional #65
  • Update Python version in Dockerfile-GPU #71
  • Update Dockerfiles to use Gunicorn for deployment #69
  • Add limit on concurrent requests for doc-qa #64
  • Add Docker Images for running Haystack #85
  • Fix cyclic import of Elasticsearch client #59
  • Add Feedback export API #56
  • Add gpu dockerfile, improve logging, fix minor bug with filtering #36
  • Improve deployment of REST API (Configs, logging, minor bugs) #40

Others

  • Standardize Finder, Readers, and Retriever interfaces #62
  • pin haystack version in tutorials until release #87
  • Update tutorials to use Elasticsearch, new Retrievers #79
  • Adding coverage reports and a few more tests #78
  • Added Jupyter notebooks of Tutorials #43
  • Add minimal tutorial for ES #19
  • Update tutorials #12

Thanks to all contributors for your great work 👏
@tanaysoni, @Timoeller , @brandenchan, @bogdankostic , @skirdey , @stedomedo , @karthik19967829 , @aadil-srivastava01 , @tholor

Initial Release

28 Nov 09:55
Compare
Choose a tag to compare
0.1.0

0.1.0