Github:https://github.com/avnlp/biothink
- Developed BioThink, a framework featuring self-reflective reasoning, where the model explicitly structures its reasoning process within XML tags and performs self-evaluation using specialized tokens (Relevance, Grounding, Utility) to critically assess its own output quality and alignment on Bio-Medical question-answering.
- Trained the BioThink model on Qwen3-1.7B using QLoRA for efficient parameter adaptation and the GRPO algorithm for alignment, on a corpus of question-answers from MedInstruct, Mol-Instructions, PubMed abstracts, PubMed Central full texts, MedQA, and Clinical Guidelines datasets.
- Integrated five reward functions with GRPO to enforce generation of self-evaluation tokens (Relevance, Grounding, Utility) and ensure strict adherence to the required XML reasoning structure tag presence, correct order, and syntactical validity.
- Implemented a robust answer correctness reward function and metric using DeepEval's GEval metric, configured with a custom LLM-as-a-Judge instruction tailored for Bio-Medical Question Answering.
- Systematically assessed model performance across seven metrics: XML Structure integrity for presence/order of all reasoning, answer, and self-eval tags, token accuracy - correct generation of Utility, Relevance, Groundness tokens, Answer Correctness - using custom GEval metric, Faithfulness - adherence to source context , and Answer Relevancy - question alignment using DeepEval's LLM-as-a-Judge metric.
- Further work to incorporate additional retrieval mechanisms into BioThink based on Adaptive RAG, Corrective RAG, RQ-RAG are ongoing.
Github: https://github.com/avnlp/rag-model-training
- Trained a T5 model for query complexity classifier for Adaptive RAG on a combination of Musique, NQ, TriviaQA and HotpotQA datasets. The model classifies the query into Simple/Moderate/Complex based on whether the query can be answered without retrieval, single retrieval or multiple retrievals.
- Trained a T5 model for retrieval evaluation on the Corrective RAG dataset to classify documents as 'Correct', 'Ambiguous', or 'Incorrect'.
- Fine-tuned a Llama-3.2 model for 3-stage query refinement (rewriting/decomposition/disambiguation) RQ-RAG technique. The model is trained to generate '[Rewritten_Query]', '[Decomposed_Query]' and '[Disambiguated_Query]' tokens.
- Trained a T5 model for Self-RAG on the Earnings-Call data. Created training data with the reflection tokens (Retrieval/Relevance/Grounding/Utility). Trained the model in two phases: critic model to evaluate retrieval and a generator model using the critic's feedback to generate responses.
- Fine-tuned Llama-3-8B with GRPO on TriviaQA for Agentic RAG, enabling autonomous missing-info detection, query rewriting, and validated tool-call generation using LLM-as-a-Judge rewards.
- Fine-tuned Llama-3-8B with GRPO on TriviaQA for ReZero Agentic RAG - search retrying and query refinement. A composite LLM-as-a-Judge reward function was used for enforcing response structure.
Github: https://github.com/avnlp/grpo
- Compared four implementations of GRPO from scratch, each demonstrating different approaches to the core algorithm while sharing common principles.
- Refactored the implementations to highlight the differences in the implementation of the core algorithm, reward functions, training frameworks, and reference model handling.
- The implementations share the Group Sampling, Reward Calculation, Advantage Normalization and Policy Update steps.
- Each implementation has different reward functions tailored to the task. There are Format Reward functions for enforcing XML-style reasoning and Correctness Reward functiThe implementations use different training frameworks (e.g., DeepSpeed, pure PyTorch). Their approaches to generation - vLLM, HuggingFace transformers and batching also vary.ons for validation solution accuracy.
- Some implementations use a fixed reference model (via a separate server or a frozen copy) while others update the reference model periodically.
Github: avnlp/llm-finetuning
- Finetuned Llama3.2-3B using QLoRA and GRPO on HotpotQA, FreshQA, and Musique datasets, implementing four correctness reward functions - DeepEval's GEval with custom LLM-as-a-Judge for RAG, Summarization, Answer Relevancy, and Evidently AI's CorrectnessLLMEval and four format reward functions to enforce '' tags and multiline response compliance.
- Finetuned Phi-4, Mistral-7B, Llama3.2-3B, Llama3.1-8B, and Gemma3-1B on GSM8K using QLoRA and GRPO to generate step-by-step solutions, applying one correctness reward function and four format reward functions for '' tags and multiline structure.
- Finetuned three models on UltraFeedback dataset for preference alignment using QLoRA: Zephyr-7B using DPO, Qwen2.5-1.5B via KTO, and Llama-3-8B via DPO, ORPO, and PPO (using LLM-Blender PairRM as reward model).
- Finetuned Llama-3-8B using QLoRA on the WebGPT comparisons dataset with DPO and PPO (using LLM-Blender PairRM as reward model) for question-answering preference alignment.
- Finetuned models using QLoRA, LoRA, DoRA, P-Tuning, and Prefix-Tuning on the ARC, Earnings Call, FactScore, PopQA, and TriviaQA datasets. Compared the performance of different adapter based supervised-finetuning techniques.
Github: avnlp/rankers
Paper: LLM Rankers
- Implemented Pairwise, Setwise and Listwise ranking techniques. Released modular ranker components for the Haystack LLM framework. The implementation for all the rankers utilized Structured-Generation and Pydantic validation for robust Zero-Shot LLM ranking.
- The Pairwise and Setwise rankers utilize efficient sorting methods (Heapsort and Bubblesort) to speed up inference.
- The Listwise ranker integrates with the RankLLM framework and supports LLMs specifically trained for ranking (such as RankGPT, RankLlama, and RankZephyr).
- Evaluated the performance of the ranking techniques on the FIQA, SciFact, NFCorpus, TREC-19, and TREC-20 datasets using the Mistral, Phi-3, and Llama-3 models.
- All rankers performed closely across all datasets. RankLlama and RankZephyr (with the Listwise ranker) achieved slightly better results than the other rankers. Among the base models, the Llama-3 model with the Setwise and Pairwise ranker performed the best.
Github: https://github.com/avnlp/prp
- Implemented the Pairwise-Ranking-Prompting ranking technique from the paper with three sorting strategies: All-pair, Heapsort, Sliding Window.
- 'PRP-allpair' enumerates all pairs and performs a global aggregation to generate a score for each document. It is highly insensitive to input ordering. It essentially ranks documents with win ratio.
- 'PRP-heapsort' uses the pairwise preferences from the LLM as a comparator with HeapSort. It favors lower computation complexity than PRP-allpair while also being largely insensitive to input orders.
- 'PRP-sliding_k' uses a sliding window that starts at the bottom of the initial ranking, compares pairs of documents, and swaps document pairs with a stride of 1. It has favorable time complexity but has high dependency on input order.
- Evaluated the performance of the ranker on the FIQA, SciFact, NFCorpus, TREC-19, and TREC-20 datasets using the Mistral, Phi-3, and Llama-3 models.
- The 'PRP-allpair' with the Llama-3 model performed the best across all datasets. 'PRP-sliding_k' and 'PRP-heapsort' perform similarly across all datasets.
Github: avnlp/rrf
Paper: Performance Evaluation of Rankers and RRF Techniques for Retrieval Pipelines
- Evaluated performance of ranking models in conjunction with Reciprocal Rank Fusion for fusing results from Hybrid Retrieval pipelines.
- Implemented Diversity Ranker that maximizes the overall diversity of the documents using embedding based similarity search. It uses a Sentence-Transformer model for embeddings.
- Implemented Lost In The Middle Ranker that positions the most relevant documents at the beginning and at the end of the prompt while placing the least relevant documents in the middle to overcome the Lost-in-middle problem of LLMs.
- Evaluated the performance of the rankers on the FIQA dataset using the
Instructor-XL
andall-mpnet-base-v2
as the embedding models. - Evaluated different combinations of Dense Retrieval, Hybrid Retrieval in conjunction with usage of the Diversity, Lost In The Middle and Similarity Rankers.
- The best performance was achieved with the Instructor-XL embedding model with Similarity Ranker, then a Diversity Ranker and Lost In The Middle Ranker.
- We found that new instruction-tuned embedding models like Instructor-XL outperform many combinations of rankers due to their ability to utilize data specific instructions
Github: avnlp/llm-blender
Paper: LLM Ensembling: Haystack Pipelines with LLM-Blender
-
LLM-Blender is an ensembling framework designed to achieve consistently superior performance by combining the outputs of multiple language models (LLMs). This work focuses on integrating LLM-Blender with RAG pipelines to significantly improve the quality of generated text.
-
A custom Haystack component, LLMBlenderRanker, has been implemented to integrate LLM-Blender with Haystack pipelines. The component utilizes the PairRanker module from the LLM-Blender framework, which compares each candidate with the input in a pairwise manner.
-
Different LLMs can generate subtly different texts, since they are trained on different datasets and tasks. By comparing each text in a pairwise manner, the component ranks and ensembles the text so it is robust to these subtle differences.
-
Ranking techniques like MLM-scoring, SimCLS, and SummaReranker focus on individually scoring each candidate based on the input text, but do not compare candidates in a pairwise manner, which can lead to missing subtle differences between LLM outputs.
-
Pipelines ensembling various LLMs, such as Mistral-7B, Llama-3-8B and Phi-3-mini, using the LLM-Blender were evaluated. The MixInstruct benchmark dataset was curated by the LLM-Blender authors to benchmark ensemble models for LLMs on instruction-following tasks. The pipelines were evaluated using the BERTScore, BARTScore, and BLEURT metrics on the MixInstruct and BillSum Datasets.
-
The performance of the RAG pipelines with the LLM-Blender Ranker was evaluated on the BillSum and MixInstruct datasets using the Mistral, Phi-3, and Llama-3 models on the BERTScore, BARTScore and BLEURT metrics.
-
The newer models like Llama-3-8B, Phi-3-mini, and Mistral-7B significantly outperformed all the models used by the LLM Blender authors on all the three metrics: BERTScore, BARTScore and BLEURT on the MixInstruct dataset.
Omega RAG provides a framework to combine several advanced RAG techniques into a high-performing RAG pipeline. Query Rewriting, Hyde, Adaptive retrieval (no retrieval, single-step, iterative retrieval), Correction by retrieval evaluation and confidence scoring, Unified active retrieval, Reranking, Citation generation, User feedback, Hybrid structured router, Scattered knowledge structurizer and Structured knowledge utilizer.
Under active development.
Github: avnlp/vectordb
Designed and implemented pipelines demonstrating the use of various vector databases for Semantic Search, Metadata Filtering, Hybrid Search, Reranking, and Retrieval-Augmented Generation (RAG). The pipelines were made using the Pinecone, Weaviate, Chroma, Milvus and Qdrant vector databases.
We compare and contrast the functionality of the vector database with different pipelines for each technique.
Vector database pipelines were created using Langchain and Haystack to highlight the Hybrid Search, Metadata Filtering and Reranker.
Pipelines were developed for datasets such as TriviaQA, ARC, PopQA, FactScore, Earnings Calls, and SEC Filings.
Hybrid Search Pipelines: Created pipelines combining sparse and dense indexes by upserting data into each index separately. Hybrid Search enables a unified query approach that merges semantic (dense) and keyword (sparse) search for improved relevance in results.
Metadata Filtering Pipelines: Developed pipelines leveraging metadata fields associated with vectors to filter search results during query time. Metadata enhances vectors with contextual information, enabling more meaningful and precise filtering.
Reranker Pipelines: Implemented pipelines to rerank semantic search results, ensuring the most relevant results are prioritized and returned.
Feature | Milvus | Weaviate | Qdrant | Pinecone | Chromadb |
---|---|---|---|---|---|
Indexes | Supports both sparse and dense vectors, using IVF for dense indexing and BM25 for sparse retrieval | Supports both sparse and dense vectors, using HNSW for dense and BM25 for sparse indexing | Supports both sparse and dense vectors, using HNSW for dense and hybrid search mechanisms for sparse | Supports only dense vectors, optimized for approximate nearest neighbor (ANN) search. Sparse vectors not supported | Supports only dense vectors with flat embeddings, optimized for in-memory search |
Hybrid Search | BM25 + vector search using hybrid query modes | BM25 + vector search with alpha parameter for balance |
BM25 + ANN search with structured filtering | Single sparse-dense index. Requires both sparse and dense query vectors | Not supported |
Partition | Uses partitions to separate data. Queries limited to a partition | Uses tenants for isolation. Queries limited to a tenant | Uses named collections for data separation. Queries filtered within collections | Uses namespaces to partition records. Queries limited to one namespace | Uses collections as namespaces. Queries directed to a collection |
Semantic Search | Uses IVF, HNSW, and ANNOY for efficient vector retrieval | Vector-based retrieval. Results based on embedding similarity | Real-time vector similarity search with contextual relevance | Finds similar content using vector proximity. Supports metadata filtering | Stores and retrieves vector embeddings for similarity search |
Metadata Filtering | SQL-like filtering with structured metadata fields | GraphQL-based filtering with hierarchical queries | Payload-based filtering with structured metadata | Dictionary-based metadata filtering attached to vectors | Key-value filtering using Pythonic expressions |
Github: avnlp/hyperparameter-tuning
Paper: Optimizer Inclusions
- The choice of optimization algorithm for training Large Language Models (LLMs) significantly impacts both training speed and final predictive performance.
- To illustrate the sensitivity of optimizer comparisons to hyperparameter tuning protocols, we conducted extensive hyperparameter tuning across three NLP tasks: Sentiment Analysis, Question Answering, and Text Summarization.
- For Sentiment Analysis, we used Financial Phrasebank, StockTwits, and FinGPT-Sentiment datasets. Question Answering experiments were conducted on SQuAD, CoQA, and FIQA datasets, while Summarization tasks employed Multi-News and BillSum datasets.
- We fine-tuned DistilBERT, BERT, and FinBERT models for Sentiment Analysis on the StockTwits and Financial PhraseBank dataset, while DistilBERT, BERT, RoBERTa were fine-tuned for Question Answering on the CoQA and SQuAD dataset. For Text Summarization, BART, DistillBART, and T5 models were fine-tuned on the BillSum and Multi-News dataset.
- Using these fine-tuned models, we demonstrated the inclusion relationships for a range of optimizers, including Adam, RMSProp, Nesterov Accelerated Gradient (NAG), SGD with momentum, and vanilla SGD.