aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAditya <bluenerd@protonmail.com>2025-02-12 19:58:00 +0530
committerAditya <bluenerd@protonmail.com>2025-02-12 19:58:00 +0530
commitfb06af59643004afcbe7bd10fa4a7cedbbbaf44f (patch)
tree1a1610e53754f8425e09882b7d3e6dd360aea219
parent20a30bf01fa908bf40f059595d57852ace676efa (diff)
add stuff
-rw-r--r--sources.md2393
1 files changed, 2298 insertions, 95 deletions
diff --git a/sources.md b/sources.md
index 30f32f6..6406a2b 100644
--- a/sources.md
+++ b/sources.md
@@ -5,13 +5,76 @@
**DOI**: [https://doi.org/10.1561/1500000019](https://doi.org/10.1561/1500000019)
+**Published**: Foundations and Trends in Information Retrieval, Volume 3, Issue 4 (01 April 2009)
+
+**Authors**
+- [Stephen E. Robertson](https://dl.acm.org/profile/81100303767), _University College London_
+- [Hugo Zaragoza](https://dl.acm.org/profile/81100334077), _Yahoo Research Barcelona_
+
## Summary
explores the theoretical underpinnings, development, and extensions of the Probabilistic Relevance Framework (PRF) used in information retrieval systems. Central to this framework is the idea of estimating the probability of relevance between a query and a document, which serves as the foundation for ranking algorithms like BM25.
+## Issues Targeted
+- **Probabilistic Relevance Framework (PRF)**
+ - Exploration of the theoretical foundations of the PRF.
+ - Discussion on the development and evolution of retrieval models, particularly BM25 and BM25F.
+
+- **Model Derivation and Comparison**
+ - Examination of various derived models from the basic PRF, including the Binary Independence Model and BM25.
+ - Comparison of PRF with other information retrieval models, such as language models and divergence from randomness models.
+
+- **Parameter Optimization**
+ - Addressing the need for optimizing parameters within the models to improve retrieval effectiveness.
+ - Exploration of different optimization techniques, including greedy optimization and gradient optimization.
+- **Incorporation of Non-Textual Features**
+ - Investigation of how non-textual features (e.g., document age, type) can be integrated into the PRF and BM25 scoring frameworks.
+
+## Contribution/Novelty
+- **Development of BM25 and BM25F**
+ - Introduction and thorough explanation of BM25, one of the most successful text-retrieval algorithms, and its extension, BM25F, which incorporates document metadata and structure.
+
+- **Integration of Relevance Feedback Mechanisms**
+ - Novel insights into how relevance feedback can be effectively integrated into retrieval models, enhancing the understanding of user interactions with search systems.
+
+- **Parameter Optimization Techniques**
+ - The paper discusses various parameter optimization strategies, providing a framework for improving the performance of retrieval models through empirical tuning.
+
+- **Inclusion of Non-Textual Features**
+ - It explores the integration of non-textual relevance features into the PRF, expanding the scope of traditional text-based retrieval models to include additional contextual information.
+
+## Approach
+- **Model Derivation**
+ - The authors derive various models from the basic PRF, including the Binary Independence Model and BM25. This involves mathematical formulations and transformations to express the probability of relevance in terms of document and query features.
+
+- **Empirical Analysis**
+ - The paper discusses empirical results from various studies that validate the effectiveness of the proposed models. It references experimental findings that support the theoretical claims made about the models.
+
+## Dataseet/Testing
+- **General Testing Methodology**
+ - **Relevance Judgments**: The models are evaluated against relevance judgments that indicate which documents are relevant to specific queries.
+ - **Performance Metrics**: Common information retrieval metrics such as Precision, Recall, Average Precision, Mean Reciprocal Rank, and Discounted Cumulative Gain (DCG) are used to assess the performance of the models.
+ - **Comparative Experiments**: The paper discusses how the models perform in comparison to other established models, drawing on results from various retrieval experiments conducted in the literature.
+
+## Results
+- **Validation of the Probabilistic Relevance Framework (PRF)**
+ - The paper demonstrates that the PRF provides a solid theoretical foundation for understanding document retrieval and relevance. The models derived from the PRF, particularly BM25, are shown to be effective in ranking documents based on their relevance to user queries.
+
+- **Effectiveness of BM25 and BM25F**
+ - BM25 is highlighted as one of the most successful text-retrieval algorithms, with empirical evidence supporting its effectiveness in various information retrieval tasks. BM25F, which incorporates document metadata and structure, is also shown to enhance retrieval performance, particularly in web search contexts.
+
+## Findings
+- **Effectiveness of PRF Models**: The paper finds that the Probabilistic Relevance Framework (PRF) and its derived models, particularly BM25 and BM25F, are highly effective for document retrieval tasks. These models provide a robust theoretical foundation for understanding relevance and ranking in information retrieval.
+- **Parameter Robustness**: The models are shown to be relatively robust to variations in parameter settings, indicating that small changes do not significantly impact retrieval performance. However, the paper emphasizes that careful parameter optimization can lead to improved results.
+- **Integration of Non-Textual Features**: The findings suggest that incorporating non-textual features (such as document age, type, and link information) into the PRF can enhance retrieval effectiveness, providing a more comprehensive assessment of document relevance.
+
## Limitations
-- **Assumptions of Relevance:** The paper assumes relevance is a binary property, which may not capture the nuances of user needs where relevance can be graded or context-dependent.
-- **Independence Assumptions:** The model relies on conditional independence between terms, which is often not true in practice. This can lead to oversimplifications and inaccuracies in relevance estimation.
-- **Lack of Explicit Probability Estimates:** While the model focuses on ranking documents, it does not provide a mechanism for estimating the actual probability of relevance for each document, which can be crucial in certain retrieval scenarios.
+- **Lack of Specific Dataset Testing**: The paper does not utilize a specific dataset for testing the models, relying instead on empirical results from various studies. This may limit the ability to generalize findings to specific contexts or datasets.
+- **Assumptions of Independence**: The models make certain assumptions about the independence of terms and features, which may not hold true in all cases. This could affect the accuracy of relevance estimations in practice.
+- **Positional Information**: The paper acknowledges the challenges of incorporating positional information into the models, which may limit their effectiveness in scenarios where the position of terms significantly impacts relevance.
+
+## Scope
+- **Future Research Directions**: The paper identifies several areas for future research, including the exploration of more sophisticated models that can better account for term dependencies, the integration of additional non-textual features, and the development of methods for effective parameter optimization.
+
# Dense Passage Retrieval for Open-Domain Question Answering
**Domain**: RAG
@@ -20,15 +83,96 @@ explores the theoretical underpinnings, development, and extensions of the Proba
**DOI**: [https://doi.org/10.18653/v1/2020.emnlp-main.550](https://doi.org/10.18653/v1/2020.emnlp-main.550)
+**Published**: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2020)
+
+**Authors**
+- [Vladimir Karpukhin](https://www.webofscience.com/wos/author/record/10012394), [Barlas Oguz](https://www.webofscience.com/wos/author/record/26266761), [Patrick Lewis](https://www.webofscience.com/wos/author/record/42545040), [Ledell Wu](https://www.webofscience.com/wos/author/record/37173104), [Sergey Edunov](https://www.webofscience.com/wos/author/record/7562279), [Wen-tau Yih](https://www.webofscience.com/wos/author/record/18031680), _Facebook AI_
+- [Sewon Min](https://www.webofscience.com/wos/author/record/25540339), _University of Washington_
+- [Danqi Chen](https://www.webofscience.com/wos/author/record/37174488), _Princeton University_
+
## Summary
-The paper presents an innovative approach to passage retrieval for answering open-domain questions. It addresses limitations in traditional sparse vector models like TF-IDF and BM25 by introducing dense representations trained using a dual-encoder framework. This framework uses embeddings learned from question-passage pairs to improve retrieval accuracy. Dense Passage Retrieval (DPR) is shown to significantly outperform BM25, achieving superior performance on top-20 and top-100 passage retrieval accuracy across multiple benchmarks. The study's key contributions include the effective use of a dual-encoder architecture optimized for inner product similarity between questions and passages, without requiring extensive pretraining. DPR’s robustness is demonstrated through strong empirical results on datasets like Natural Questions and TriviaQA, where it achieves state-of-the-art results in passage retrieval and end-to-end question answering. Additionally, the research highlights that dense retrieval methods benefit from careful training setups, such as in-batch negatives, which improve retrieval precision.
+The paper presents an innovative approach to passage retrieval for answering open-domain questions. It addresses limitations in traditional sparse vector models like TF-IDF and BM25 by introducing dense representations trained using a dual-encoder framework. This framework uses embeddings learned from question-passage pairs to improve retrieval accuracy. Dense Passage Retrieval (DPR) is shown to significantly outperform BM25, achieving superior performance on top-20 and top-100 passage retrieval accuracy across multiple benchmarks. The study's key contributions include the effective use of a dual-encoder architecture optimized for inner product similarity between questions and passages, without requiring extensive pretraining. DPR's robustness is demonstrated through strong empirical results on datasets like Natural Questions and TriviaQA, where it achieves state-of-the-art results in passage retrieval and end-to-end question answering. Additionally, the research highlights that dense retrieval methods benefit from careful training setups, such as in-batch negatives, which improve retrieval precision.
+
+## Issues Targeted
+The paper aims to address these issues by proposing a dense passage retrieval model (DPR) that can effectively learn from a smaller number of question-passage pairs without the need for extensive pretraining, thereby improving retrieval accuracy and overall QA performance.
+
+- **Inefficiency of Traditional Retrieval Methods**
+ - Traditional sparse vector space models (e.g., TF-IDF, BM25) are the standard for passage retrieval in open-domain question answering (QA).
+ - These methods often struggle with semantic matching, particularly with synonyms or paraphrases.
+
+- **Performance Degradation in QA Systems**
+ - There is a significant performance drop when transitioning from retrieval to reading comprehension in QA systems.
+ - For example, the exact match score on SQuAD v1.1 drops from above 80% to less than 40% when relying solely on traditional retrieval methods.
+
+- **Limitations of Dense Retrieval Approaches**
+ - Previous dense retrieval methods, such as ORQA, required complex pretraining and did not fully optimize the context encoder using question-answer pairs.
+ - Dense retrieval methods had not been shown to outperform traditional methods in open-domain QA prior to this work.
+
+## Contribution/Novelty
+The paper contributes to the field by providing a robust and efficient dense retrieval method that simplifies the training process while achieving superior performance in open-domain question answering tasks.
+
+- **Introduction of Dense Passage Retrieval (DPR)**
+ - The paper presents a novel dense passage retrieval model that utilizes dense representations for efficient passage retrieval in open-domain question answering.
+
+- **Effective Use of Dual-Encoder Framework**
+ - The DPR employs a simple dual-encoder architecture that learns embeddings from a limited number of question-passage pairs, demonstrating that effective retrieval can be achieved without extensive pretraining.
+
+## Approach
+- **Dual-Encoder Architecture**
+ - **Question Encoder (E_Q)**: Maps input questions to dense vector representations.
+ - **Passage Encoder (E_P)**: Maps text passages to dense vector representations.
+ - Both encoders are based on the BERT model, specifically using the representation from the [CLS] token.
+
+- **Similarity Measurement**
+ - The similarity between a question and a passage is computed using the dot product of their respective vector representations:
+ \text{sim}(q, p) = E_Q(q)^\top E_P(p)
+ - This approach allows for efficient retrieval of passages that are semantically relevant to the input question.
+
+## Results
+- **Passage Retrieval Performance---Top-20 and Top-100 Retrieval Accuracy**: The DPR significantly outperforms the traditional BM25 retrieval method across multiple datasets:
+ - **Natural Questions (NQ)**:
+ - DPR: 78.4% (Top-20), 85.4% (Top-100)
+ - BM25: 59.1% (Top-20), 73.7% (Top-100)
+ - **TriviaQA**:
+ - DPR: 79.4% (Top-20), 85.0% (Top-100)
+ - BM25: 66.9% (Top-20), 76.7% (Top-100)
+ - **WebQuestions (WQ)**:
+ - DPR: 73.2% (Top-20), 81.4% (Top-100)
+ - BM25: 55.0% (Top-20), 71.1% (Top-100)
+ - **CuratedTREC (TREC)**:
+ - DPR: 79.8% (Top-20), 89.1% (Top-100)
+ - BM25: 70.9% (Top-20), 84.1% (Top-100)
+ - ***SQuAD***:
+ - DPR: 63.2% (Top-20), 77.2% (Top-100)
+ - BM25: 68.8% (Top-20), 80.0% (Top-100)
+
+- **End-to-End Question Answering Performance---Exact Match (EM) Accuracy**
+ - **Natural Questions (NQ)**: 41.5% EM
+ - **TriviaQA**: 56.8% EM
+ - **WebQuestions (WQ)**: 34.6% EM
+ - **CuratedTREC (TREC)**: 25.9% EM
+ - **SQuAD**: 29.8% EM
+
+## Findings
+- **Effectiveness of Dense Retrieval**: The Dense Passage Retriever (DPR) significantly outperforms traditional sparse retrieval methods (e.g., BM25) across multiple datasets, demonstrating that dense representations can effectively capture semantic relationships and improve retrieval accuracy.
+
+- **Training Efficiency**: The paper shows that a relatively small number of question-passage pairs can be sufficient for training a high-quality dense retriever, challenging the notion that extensive pretraining is necessary.
+
+- **Direct Correlation Between Retrieval and QA Performance**: Higher retrieval precision directly translates to improved end-to-end question answering accuracy, indicating that enhancing the retrieval component is crucial for overall system performance.
+
+- **Generalization Across Datasets**: The DPR model exhibits good generalization capabilities, performing well on smaller datasets when trained on larger ones, suggesting its robustness in various open-domain QA contexts.
## Limitations
-- **Dependence on Pre-trained Models**: The Dense Passage Retriever (DPR) relies on the BERT pre-trained model, which may limit its performance on domains or tasks that differ significantly from the data used for pre-training.
-- **Training Data Requirements**: Although the paper claims that a small number of question-passage pairs can yield good results, the need for labeled pairs still poses a challenge, especially in domains where such data is scarce.
-- **Computational Intensity**: The training process, particularly for the dense representations, is computationally intensive. While the retrieval process is efficient, the initial indexing of passages requires significant resources and time.
-- **Generalization Issues**: The model's performance may degrade when applied to datasets that differ from those used during training, indicating potential overfitting to the training data.
-- **Evaluation Metrics**: The reliance on specific evaluation metrics (e.g., exact match) may not fully capture the model's performance in real-world applications, where nuanced understanding and flexibility are required.
+- **Sensitivity to Salient Phrases**: While DPR excels at capturing semantic relationships, it may struggle with highly specific or salient phrases that are critical for certain questions, potentially leading to missed answers.
+
+- **Dependence on Quality of Training Data**: The performance of the DPR is contingent on the quality and relevance of the training data. If the question-passage pairs are not representative of the target domain, retrieval accuracy may suffer.
+
+- **Computational Resources for Indexing**: Although retrieval is efficient during inference, the initial indexing of dense vectors can be resource-intensive and time-consuming, particularly for large datasets.
+
+- **Limited Exploration of Alternative Architectures**: The paper primarily focuses on the dual-encoder architecture based on BERT, leaving room for exploration of other architectures or enhancements that could further improve performance.
+
+## Scope
+- **Integration with Other Models**: The DPR can be combined with other models, such as generative models or more complex reader architectures, to create hybrid systems that leverage the strengths of both retrieval and generation.
# Learning Transferable Visual Models From Natural Language Supervision
**Domain**: OCR
@@ -37,16 +181,77 @@ The paper presents an innovative approach to passage retrieval for answering ope
**DOI**: [http://dx.doi.org/10.48550/arXiv.2103.00020](http://dx.doi.org/10.48550/arXiv.2103.00020)
+**Published**: International Conference on Machine Learning (ICML), Vol 139, (2021)
+
+**Authors**:
+- [Alec Radford](https://www.webofscience.com/wos/author/record/27544330), [Jong Wook Kim](https://www.webofscience.com/wos/author/record/34863713), [Chris Hallacy](https://www.webofscience.com/wos/author/record/20688300), [Aditya Ramesh](https://www.webofscience.com/wos/author/record/14032406), [Gabriel Goh](https://www.webofscience.com/wos/author/record/8052350), [Sandhini Agarwal](https://www.webofscience.com/wos/author/record/18634916), [Girish Sastry](https://www.webofscience.com/wos/author/record/13790459), [Amanda Askell](https://www.webofscience.com/wos/author/record/19694444), [Pamela Mishkin](https://www.webofscience.com/wos/author/record/12046580), [Jack Clark](https://www.webofscience.com/wos/author/record/63528259), [Gretchen Krueger](https://www.webofscience.com/wos/author/record/24882257), [Ilya Sutskever](https://www.webofscience.com/wos/author/record/16383034), _OpenAI_
+
## Summary
The paper introduces CLIP (Contrastive Language-Image Pretraining), a scalable framework for training visual models directly from raw text-image pairs. Unlike traditional methods that rely on pre-defined object categories, CLIP uses a dataset of 400 million image-text pairs to train both an image and a text encoder jointly, predicting the alignment of an image with its corresponding text. This contrastive approach allows the model to generalize without task-specific training, enabling zero-shot transfer to various computer vision tasks. The study demonstrates CLIP’s efficacy by testing it across more than 30 datasets, including tasks like image classification, OCR, and action recognition. CLIP achieves competitive results against state-of-the-art supervised models, matching ImageNet accuracy of ResNet50 in a zero-shot setting. Furthermore, CLIP exhibits robustness to distribution shifts, outperforming standard models under such conditions. Despite its strengths, the authors highlight limitations, such as computational demands and challenges in handling complex or abstract tasks.
+## Issues Targeted
+- **Limited Generality and Usability of Current Computer Vision Systems**: Current computer vision systems are trained to predict a fixed set of predetermined object categories, which limits their generality and usability.
+- **Need for Large-Scale Labeled Data**: Current computer vision systems require large-scale labeled data to train, which can be time-consuming and expensive to obtain.
+- **Limited Robustness to Distribution Shift**: Current computer vision systems can be brittle and prone to errors when faced with distribution shift, where the test data has a different distribution than the training data.
+
+## Contribution/Novelty
+- **Contrastive Language-Image Pre-training (CLIP)**: The paper introduces a new pre-training method called CLIP, which uses contrastive learning to align visual and language representations.
+- **Large-Scale Pre-Training on WebImageText Dataset**: The paper pre-trains CLIP on a large-scale dataset of 400 million image-text pairs, which is a significant contribution to the field.
+- **Zero-Shot Transfer Learning**: The paper demonstrates the ability of CLIP to perform zero-shot transfer learning on a wide range of computer vision tasks, including image classification, object detection, and segmentation.
+
+## Approach
+- **Model Architecture**: The approach utilizes two main components: an image encoder and a text encoder.
+ - **Image Encoder**: The image encoder can be based on architectures like ResNet or Vision Transformer (ViT).
+ - **Text Encoder**: The text encoder is based on a Transformer architecture, which processes the text input and generates embeddings.
+
+- **Training Methodology**:
+ - The model is trained from scratch using a contrastive objective, which is more efficient than traditional predictive objectives.
+ - The training involves constructing batches of (image, text) pairs and optimizing a symmetric cross-entropy loss over the similarity scores of the embeddings.
+
+## Dataset/Testing
+- **WebImageText (WIT) Dataset**:
+ - **Size**: The dataset consists of 400 million (image, text) pairs.
+ - **Source**: The pairs are collected from various publicly available sources on the internet.
+ - **Coverage**: To ensure a broad representation of visual concepts, the dataset includes up to 20,000 pairs per query from a set of 500,000 queries.
+ - **Quality**: The dataset aims to leverage the rich supervision available in natural language descriptions associated with images, which is more extensive than traditional crowd-labeled datasets.
+
+- **Zero-Shot Transfer Evaluation**:
+ - The CLIP model is evaluated on over 30 different computer vision datasets without any additional training on those datasets.
+ - The model uses natural language to specify the classes or tasks for evaluation, allowing it to perform zero-shot learning.
+ - During testing, the names or descriptions of the target dataset’s classes are embedded using the text encoder, which synthesizes a linear classifier for the task.
+
+- **Evaluation on Standard Datasets**:
+ - The performance of the CLIP model is compared against existing benchmarks on standard datasets such as ImageNet, CIFAR10, UCF101, and others.
+ - The evaluation includes various tasks such as object classification, action recognition, optical character recognition (OCR), and more.
+
+## Result
+- **Zero-Shot Performance**:
+ - CLIP achieves a top-1 accuracy of 76.2% on ImageNet in a zero-shot setting, matching the performance of the original ResNet50 model, which was trained on 1.28 million labeled examples.
+ - The model significantly improves upon previous zero-shot transfer methods, such as Visual N-Grams, which had a top-1 accuracy of only 11.5% on ImageNet.
+
+- **Robustness to Distribution Shift**:
+ - CLIP demonstrates improved robustness to natural distribution shifts compared to standard ImageNet models. It reduces the gap between in-distribution accuracy and out-of-distribution accuracy by up to 75%.
+
+## Findings
+- **Effective Zero-Shot Learning**:
+ - CLIP demonstrates strong zero-shot learning capabilities, achieving competitive performance on various computer vision tasks without requiring task-specific training data.
+
+- **Robustness to Distribution Shift**:
+ - The model shows improved robustness to natural distribution shifts compared to traditional models, indicating that it can generalize better to unseen data distributions.
+
+- **Wide Range of Task Learning**:
+ - CLIP learns to perform a diverse set of tasks during pre-training, including object classification, action recognition, and optical character recognition (OCR), showcasing its versatility.
+
## Limitations
-- **Performance Comparison**:
- - Zero-shot CLIP's performance is often only competitive with a linear classifier on ResNet-50 features, which is below the overall state-of-the-art (SOTA).
- - Significant improvements are needed for CLIP to reach SOTA performance across evaluation suites, estimated to require around a 1000x increase in compute.
-- **Evaluation Methodology**:
- - The reliance on validation sets during development may not reflect true zero-shot scenarios, as it introduces a form of bias.
- - The selection of evaluation datasets may be co-adapted with CLIP's capabilities, potentially skewing results.
+- **Evaluation Bias**:
+ - The evaluation datasets used may be co-adapted with the capabilities of CLIP, raising concerns about the generalizability of the results to entirely new tasks or datasets.
+
+- **Limited Few-Shot Learning Performance**:
+ - While CLIP excels in zero-shot settings, its performance in few-shot learning scenarios may not be as strong, as indicated by counter-intuitive drops in performance when transitioning from zero-shot to few-shot settings.
+
+## Scope
+- **Benchmarking and Evaluation**:
+ - The need for standardized benchmarks to evaluate zero-shot transfer capabilities and broader task learning in computer vision is emphasized, which could help in assessing the true performance of models like CLIP.
# C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models
**Domain**: RAG
@@ -55,12 +260,91 @@ The paper introduces CLIP (Contrastive Language-Image Pretraining), a scalable f
**DOI**: [https://doi.org/10.48550/arXiv.2402.03181](https://doi.org/10.48550/arXiv.2402.03181)
+**Published**: ICML'24: Proceedings of the 41st International Conference on Machine Learning, 2024
+
+**Authors**:
+- [Mintong Kang](https://dl.acm.org/profile/99661151219), _University of Illinois at Urbana-Champaign_
+- [Nezihe Merve Gurel](https://dl.acm.org/profile/99659461543), _Delft University of Technology, Netherlands_
+- [Ning Yu](https://dl.acm.org/profile/99661466339), _Netflix Eyeline Studios_
+- [Dawn Song](https://dl.acm.org/profile/99661230842), _University of California, Berkeley_
+- [Bo Li](https://dl.acm.org/profile/88158680557), _University of Illinois at Urbana-Champaign and University of Chicago_
+
## Summary
The paper introduces C-RAG, a framework designed to certify and provide theoretical guarantees for the generation risks associated with retrieval-augmented language models (RAG). The authors address critical issues like hallucinations and reliability in large language models (LLMs), focusing on whether RAG models can effectively minimize generation risks compared to standard LLMs.
+## Issues Targeted
+- **Trustworthiness of Large Language Models (LLMs)**
+ - LLMs exhibit hallucinations and misalignments, leading to unreliable and untrustworthy outputs.
+
+- **Generation Risks in Retrieval-Augmented Language Models (RAG)**
+ - The theoretical understanding of generation risks in RAG models remains unexplored.
+ - The paper investigates whether RAG can effectively reduce generation risks compared to vanilla LLMs.
+
+- **Need for Certifiable Risk Control**
+ - There is a need for provable guarantees on the generation risks of both RAG and vanilla LLMs.
+ - The paper aims to establish sufficient conditions that enable RAG models to reduce generation risks.
+
+## Contribution/Novelty
+- **Introduction of C-RAG Framework**
+ - The paper proposes a novel framework called C-RAG (Certified Generation Risks for Retrieval-Augmented Language Models) to certify generation risks in RAG models.
+
+- **Conformal Risk Analysis**
+ - It provides a conformal risk analysis for RAG models, establishing a method to certify an upper confidence bound on generation risks, referred to as conformal generation risk.
+
+- **Theoretical Guarantees**
+ - The paper offers theoretical guarantees on conformal generation risks for general bounded risk functions, particularly under test distribution shifts, which is a significant advancement in the field.
+
+## Approach
+- **Constrained Generation Protocol**: A constrained generation protocol is proposed for RAG models, which involves:
+ - Configuring parameters such as the number of retrieved examples (N_rag), the size of the generation set (λ_g), and a similarity threshold for generation diversity (λ_s).
+ - This protocol allows for controlled generation outputs based on specific configurations.
+
+- **Conformal Risk Analysis**: The authors employ conformal risk analysis to:
+ - Certify an upper confidence bound on generation risks, termed as conformal generation risk.
+ - Use test statistics from in-distribution calibration samples to control generation risks.
+
+- **Empirical Evaluation**: The approach includes extensive empirical evaluations across four widely-used NLP datasets and various state-of-the-art retrieval models to:
+ - Validate the soundness and tightness of the proposed conformal generation risk guarantees.
+ - Demonstrate that RAG consistently achieves lower conformal generation risks compared to vanilla LLMs.
+
+## Dataset/Testing
+**Datasets**
+- **AESLC (Annotated Enron Subject Line Corpus)**: Contains email messages from employees of the Enron Corporation, focusing on generating email subject lines from email bodies.
+- **CommonGen**: A dataset for commonsense reasoning that includes descriptions generated from a set of concepts, constructed through crowdsourcing and existing caption corpora.
+- **DART (Data Record to Text)**: A large-scale dataset designed for generating text from structured data records, consisting of RDF triples and annotated sentence descriptions.
+- **E2E (End-to-End Generation)**: Contains examples in the restaurant domain, where the task is to generate natural language descriptions from meaning representations (MRs) that describe various aspects of a restaurant.
+
+**Testing**
+- **External Knowledge Base Construction**
+ - The external knowledge base used for retrieval augmentation is constructed from a collection of 30 public datasets across 9 distinct categories, totaling over 6 million documents.
+
+- **Empirical Evaluation**: The C-RAG framework is empirically evaluated by:
+ - Conducting experiments on the aforementioned datasets using different retrieval models.
+ - Measuring the performance of the RAG models in terms of conformal generation risks and comparing them against vanilla LLMs.
+
+## Result
+- **Comparison of RAG and Vanilla LLMs**
+ - RAG models achieved significantly lower conformal generation risks compared to vanilla LLMs across all datasets.
+ - The results confirm the theoretical findings that RAG can effectively reduce generation risks when the quality of the retrieval model and transformer is non-trivial.
+- **Impact of Retrieval Model Quality**
+ - Among the evaluated retrieval models, the text-embedding-ada-002 and supervised fine-tuned embedding models outperformed other baselines in achieving low conformal generation risks.
+ - The results indicate that the quality of the retrieval model directly influences the effectiveness of the RAG framework in mitigating generation risks.
+- **Robustness Under Distribution Shifts**
+ - The conformal generation risk guarantees remain sound and tight even under distribution shifts, demonstrating the robustness of the C-RAG framework.
+ - The empirical risks increase linearly with the Hellinger distance, indicating that the framework can effectively handle variations in input distributions.
+
+## Findings
+- **Lower Generation Risks with RAG**: RAG models consistently achieve lower conformal generation risks compared to vanilla LLMs, confirming the theoretical assertion that retrieval can enhance generation reliability.
+- **Multi-dimensional Configuration Benefits**: Adjusting multiple RAG parameters (e.g., number of retrieved examples, generation set size, and diversity thresholds) can further enhance the effectiveness of the model in controlling generation risks.
+
## Limitations
-- **Probability of Guarantee**: The C-RAG framework provides high-confidence risk bounds, but there is still a possibility of generations with excessive risks. More calibration samples may be needed to achieve a higher confidence level and mitigate outlier occurrences.
-- **Trade-off with External Knowledge Base Size**: While a larger external knowledge base can reduce conformal generation risk, it may also increase the time complexity of KNN searching and the space complexity for storing examples, leading to a trade-off between generalization/utility and inference efficiency.
+- **Calibration Data Collection Challenges**: Collecting in-distribution calibration samples can be resource-intensive and may introduce latency, especially in real-time applications.
+- **High Confidence Level Requirements**: The C-RAG framework provides high-confidence risk bounds, but this may lead to the existence of generations with excessive risks if calibration samples are insufficient.
+- **Trade-offs with Large Knowledge Bases**: While a larger external knowledge base can improve retrieval quality, it may also increase the time complexity of KNN searching and the space complexity of storing examples, leading to potential efficiency trade-offs.
+
+## Scope
+- **Application in Safety-Critical Domains**: The findings suggest that the C-RAG framework can be particularly beneficial in safety-critical applications where trustworthiness and reliability of language model outputs are paramount.
+- **Future Work on Time-Series Data**: The paper highlights the potential for future research to extend conformal risk analysis to time-series data, which remains an unexplored area but is crucial for practical deployments.
# Atlas: Few-shot Learning with Retrieval Augmented Language Models
**Domain**: RAG
@@ -69,14 +353,133 @@ The paper introduces C-RAG, a framework designed to certify and provide theoreti
**DOI**: [https://doi.org/10.48550/arXiv.2208.03299](https://doi.org/10.48550/arXiv.2208.03299)
+**Published**: The Journal of Machine Learning Research, Volume 24, Issue 1, (2023)
+
+**Authors**:
+- [Gautier Izacard](https://www.webofscience.com/wos/author/record/9388385), [Patrick Lewis](https://www.webofscience.com/wos/author/record/42545040), [Maria Lomeli](https://www.webofscience.com/wos/author/record/11123547), [Lucas Hosseini](https://www.webofscience.com/wos/author/record/22646054), [Fabio Petroni](https://www.webofscience.com/wos/author/record/26540253), [Timo Schick](https://www.webofscience.com/wos/author/record/28712492), [Jane Dwivedi-Yu](https://www.webofscience.com/wos/author/record/39002585), [Armand Joulin](https://www.webofscience.com/wos/author/record/9336621), [Sebastian Riedel](https://www.webofscience.com/wos/author/record/16130105), [Edouard Grave](https://www.webofscience.com/wos/author/record/21956629), _Meta AI_
+
## Summary
The paper presents Atlas, a retrieval-augmented language model designed to excel in few-shot learning tasks, particularly those requiring extensive knowledge, such as question answering and fact-checking. Unlike traditional large language models that rely heavily on vast parameter counts to store knowledge, Atlas utilizes a dual-encoder architecture for document retrieval, allowing it to achieve impressive performance with significantly fewer parameters (11B) compared to models like PaLM (540B). The authors demonstrate that Atlas can achieve over 42% accuracy on the Natural Questions dataset using only 64 training examples, outperforming larger models by 3% while being 50 times smaller. The study emphasizes the importance of joint pre-training of the retriever and language model components, exploring various training objectives and pretext tasks to enhance few-shot performance. Through extensive experiments across multiple benchmarks, including MMLU, KILT, and TriviaQA, Atlas establishes new state-of-the-art results in several tasks, showcasing its adaptability, interpretability, and efficiency. The findings suggest that retrieval-augmented models like Atlas can effectively decouple memorization from generalization, making them a promising approach for knowledge-intensive natural language processing tasks.
+## Issues Targeted
+- **Few-shot Learning Limitations**:
+ - It is unclear whether effective few-shot learning necessitates vast knowledge stored in model parameters, leading to questions about the relationship between memorization and generalization.
+- **Retrieval-Augmented Models**:
+ - While retrieval-augmented models excel in knowledge-intensive tasks, their few-shot learning capabilities have not been adequately demonstrated.
+- **Parameter Efficiency**:
+ - The challenge of achieving strong performance in few-shot settings with models that have significantly fewer parameters compared to state-of-the-art models.
+
+## Contribution/Novelty
+- **Introduction of Atlas**:
+ - The paper presents Atlas, a retrieval-augmented language model specifically designed for few-shot learning, demonstrating strong performance with significantly fewer parameters compared to existing models.
+- **Joint Pre-training Approach**:
+ - A thorough study on the design and training of retrieval-augmented language models, emphasizing the importance of jointly pre-training the retriever and language model for improved few-shot performance.
+- **Effective Training Techniques**:
+ - Exploration of various pre-training tasks and training objectives that enhance the few-shot capabilities of the model, including the use of Likelihood Distillation for training the retriever.
+
+## Results
+- Atlas achieves state-of-the-art results on several benchmarks, including:
+ - **Natural Questions**: +2.8% improvement in few-shot settings.
+ - **TriviaQA**: +3.3% improvement.
+ - **FEVER**: +5.1% improvement.
+ - Competitive performance on MMLU, matching models with 15 times more parameters.
+
+## Approach
+- **Architecture**: The model consists of two main components:
+ - **Retriever**: Utilizes a dual-encoder architecture based on the Contriever model to retrieve relevant documents from a large corpus based on the input query.
+ - **Language Model**: Employs a T5 sequence-to-sequence architecture, specifically using the Fusion-in-Decoder method to process the retrieved documents along with the query.
+
+- **Text-to-Text Framework**:
+ - All tasks are framed in a text-to-text format, where the input is a text query and the output is a generated text response. This allows for a unified approach to various NLP tasks.
+
+- **Joint Pre-training**:
+ - The retriever and language model are jointly pre-trained using unsupervised data, which helps the model learn to effectively utilize retrieved documents during downstream tasks.
+
+- **Training Objectives**: Several loss functions are explored to train the retriever in conjunction with the language model, including:
+ - **Attention Distillation (ADist)**: Uses attention scores from the language model to guide the retriever.
+ - **End-to-End Training of Multi-Document Reader and Retriever (EMDR²)**: Treats retrieved documents as latent variables to optimize the retriever.
+ - **Likelihood Distillation (LDist)**: Trains the retriever to predict the relevance of documents based on their contribution to the language model's output.
+ - **Leave-One-Out Likelihood Distillation (LOOL)**: Evaluates the impact of removing each document from the retrieved set on the language model's predictions.
+
+- **Pretext Tasks**: The model is pre-trained using various tasks, such as:
+ - **Masked Language Modeling**: To enhance the model's understanding of language structure.
+ - **Prefix Language Modeling**: To improve retrieval capabilities.
+ - **Title-to-Section Generation**: To learn relationships between document titles and their content.
+
+- **Fine-tuning Strategies**: The model employs different fine-tuning strategies based on the amount of training data available, including:
+ - **Query-Side Fine-tuning**: Fixes the document encoder and only trains the query encoder for few-shot settings.
+ - **Standard Fine-tuning**: Fully updates both the retriever and language model for larger datasets.
+
+- **Evaluation**:
+ - The approach is evaluated across multiple benchmarks, including Natural Questions, TriviaQA, FEVER, and KILT, demonstrating its effectiveness in both few-shot and full data settings.
+
+## Dataset/Testing
+**Dataset**
+
+- **KILT Evaluation Suite**: A collection of 11 datasets corresponding to various tasks, including:
+ - **Natural Questions**: For question answering.
+ - **TriviaQA**: Another question answering dataset.
+ - **HotpotQA**: A multi-hop question answering dataset.
+ - **FEVER**: A fact-checking dataset.
+ - **Zero Shot RE**: For slot filling.
+ - **T-REx**: For entity linking.
+ - **Wizard of Wikipedia**: For dialogue generation.
+
+- **Massively-Multitask Language Understanding (MMLU)**:
+ - Contains 57 multi-choice question answering datasets sourced from real examinations designed for humans, covering a wide range of topics.
+
+- **Common Crawl**:
+ - Additional documents from the 2020-10 Common Crawl dump were used to enhance the retrieval index.
+
+- **Wikipedia Dumps**:
+ - The model utilized various versions of Wikipedia dumps, including the December 2021 dump, for pre-training and as a retrieval index.
+
+- **TempLAMA**:
+ - A dataset constructed from time-sensitive cloze questions to assess the model's temporal sensitivity and updateability.
+
+**Testing**
+
+- **Few-Shot Learning**: The model was fine-tuned on a limited number of examples (e.g., 64 examples) from the datasets to evaluate its performance in few-shot settings.
+- **Full Data Set Fine-tuning**: The model was also tested in a full data set setting, where it was fine-tuned on the entire training set available for each task.
+- **Evaluation Metrics**: The performance was measured using standard metrics such as exact match (EM) for question answering tasks and accuracy for classification tasks.
+
+## Result
+- **Performance on Natural Questions**:
+ - **64-shot Setting**: Atlas achieved an accuracy of 42.4%, outperforming the closed-book T5 model and a 540B parameter model (PaLM) by nearly 3 percentage points.
+ - **Full Data Set Setting**: Atlas reached an accuracy of 60.4%, establishing a new state-of-the-art result.
+
+- **Performance on TriviaQA**:
+ - **64-shot Setting**: Atlas achieved an accuracy of 74.5%.
+ - **Full Data Set Setting**: The model reached an accuracy of 79.8%, demonstrating significant improvements over previous models.
+
+- **Performance on FEVER**:
+ - **64-shot Setting**: Atlas achieved an accuracy of 64.3%.
+ - **Full Data Set Setting**: The model reached an accuracy of 78%, which is within 1.5% of the state-of-the-art model (ProoFVer).
+
+- **Performance on KILT Tasks**:
+ - Atlas demonstrated strong performance across various KILT tasks, achieving competitive results in both few-shot and full fine-tuning settings.
+
+- **Massively-Multitask Language Understanding (MMLU)**:
+ - Atlas outperformed the closed-book T5 models across all sizes (770M, 3B, and 11B parameters) in both few-shot and full data settings.
+ - In the 5-shot setting, Atlas achieved:
+ - **770M**: 38.9%
+ - **3B**: 42.3%
+ - **11B**: 43.4%
+ - In the full/transfer setting, Atlas achieved:
+ - **770M**: 56.3%
+ - **3B**: 59.9%
+ - **11B**: 65.8%
+
+## Findings
+- **Importance of Joint Pre-training**: The study found that jointly pre-training the retriever and language model is crucial for enhancing few-shot performance, leading to better integration of retrieved information.
+- **Efficiency in Memory Usage**: The model can maintain performance while using compressed indices, demonstrating the potential for efficient memory management in retrieval-augmented systems.
+
## Limitations
-- **Complexity of Fine-tuning**
- - The fine-tuning process may require careful tuning of hyperparameters, which can be resource-intensive and may not be straightforward for all users.
- - The need for joint training of the retriever and language model adds complexity to the training process.
-- **Scalability Issues**: As the size of the document index increases, the computational resources required for retrieval and processing may become a bottleneck, limiting scalability in real-world applications.
+- **Few-Shot Learning Constraints**: While Atlas performs well in few-shot settings, the model may still struggle with tasks that require extensive domain-specific knowledge or highly specialized information not present in the training data.
+- **Computational Overhead**: The need to refresh the document index during training can introduce computational overhead, particularly in large-scale settings, although strategies to mitigate this were explored.
+
+## Scope
+- **Exploration of Temporal Knowledge**: The ability to update the model's knowledge base in real-time opens up opportunities for research into temporal knowledge representation and reasoning, which could be crucial for applications requiring up-to-date information.
# REST: Retrieval-Based Speculative Decoding
**Domain**: RAG
@@ -85,16 +488,84 @@ The paper presents Atlas, a retrieval-augmented language model designed to excel
**DOI**: [https://doi.org/10.18653/v1/2024.naacl-long.88](https://doi.org/10.18653/v1/2024.naacl-long.88)
+**Published**: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024)
+
+**Authors**: Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He
+
## Summary
The paper introduces Retrieval-Based Speculative Decoding (REST), a novel algorithm aimed at enhancing the efficiency of language model generation. Unlike traditional speculative decoding methods that rely on a smaller draft language model, REST utilizes a retrieval mechanism to generate draft tokens from a datastore of existing knowledge. This approach allows for significant speed improvements in text and code generation, achieving speedups of 1.62x to 2.36x on 7B and 13B language models without requiring additional training. The method constructs a Trie from retrieved candidates and employs a tree attention mechanism for verification, ensuring that the generated sequences maintain high quality while minimizing computational overhead.
+## Issues Targeted
+- **Inefficiency in Language Model Inference**
+ - High inference costs associated with autoregressive token generation.
+ - Frequent reloading of large language models (LLMs) from High-Bandwidth Memory (HBM) to on-chip cache, leading to time-consuming processes.
+- **Limitations of Existing Approaches**
+ - Previous speculative decoding methods rely on smaller language models, which may not be efficient or effective.
+ - The need for additional training steps and GPU memory for these smaller models complicates the implementation.
+
+## Contribution/Novelty
+- **Introduction of REST (Retrieval-Based Speculative Decoding)**: Proposes a novel algorithm that utilizes retrieval mechanisms to generate draft tokens, replacing the need for a smaller language model in speculative decoding.
+- **Tree Attention Mechanism**: Introduces a carefully designed attention mask (tree attention) to optimize the verification process of draft tokens, ensuring that shared prefixes are computed only once, thus improving computational efficiency.
+
+## Approach
+- **Datastore Construction**
+ - A datastore ( D = (c_i, t_i) ) is constructed, where ( c_i ) represents a context and ( t_i ) represents the corresponding continuation of that context. This datastore is built from either pretraining data or instruction-tuning data.
+
+- **Token Retrieval Process**
+ - During inference, the current context ( s = (x_1, ..., x_t) ) is used as a query to retrieve context-continuation pairs from the datastore.
+ - An exact-match method is employed to find contexts in ( D ) that match the longest suffix of ( s ), ensuring efficient retrieval with minimal overhead.
+
+- **Draft Token Construction**
+ - The retrieved results ( S ) include possible continuations of the context ( s ). A Trie data structure is constructed from these candidates to select high-frequency prefixes as draft tokens.
+ - The Trie allows for efficient prioritization of tokens based on their frequency, ensuring that the most relevant draft tokens are selected for verification.
+
+## Dataset/Testing
+**Dataset**
+
+- **HumanEval**
+ - A dataset consisting of 164 human-written Python programming problems.
+ - The goal is to generate code solutions using provided docstrings as prompts.
+- **MT-Bench**
+ - A dataset containing 80 multi-turn questions designed to emulate real-world multi-turn dialogues.
+ - This dataset is used to evaluate the performance of the language models in conversational contexts.
+
+**Testing**
+- **Models Tested**
+ - The experiments were conducted using two language models: CodeLlama and Vicuna, specifically testing both 7B and 13B configurations.
+
+- **Experimental Setup**
+ - The performance of REST was compared against standard autoregressive decoding and speculative decoding methods.
+ - Different sampling strategies were employed, including greedy sampling and nucleus sampling, to assess the generation speed and quality of outputs.
+
+- **Metrics for Evaluation**
+ - **Mean Token Time**: The average generation time for one token, used to measure the speed of the models.
+ - **Mean Generated Length**: The ratio of the length of generated tokens to the number of forward steps taken by the original LLM, indicating the efficiency of the generation process.
+
+## Result
+- **Speed Improvements**: REST demonstrated significant speed enhancements compared to standard autoregressive decoding and speculative decoding methods:
+ - For CodeLlama on the HumanEval benchmark:
+ - REST achieved a speedup of 2.12× to 2.36× with greedy sampling.
+ - Mean Token Time for CodeLlama 7B with REST was 11.82 ms/token.
+ - Mean Token Time for CodeLlama 13B with REST was 19.53 ms/token.
+ - For Vicuna on the MT-Bench benchmark:
+ - REST achieved a speedup of 1.62× to 1.77× with greedy sampling.
+ - Mean Token Time for Vicuna 7B with REST was 15.12 ms/token.
+ - Mean Token Time for Vicuna 13B with REST was 25.08 ms/token.
+
+- **Comparison with Baselines**: The results showed that REST outperformed both standard autoregressive and speculative decoding methods in terms of speed:
+ - For example, the baseline autoregressive method for CodeLlama 7B had a Mean Token Time of 27.89 ms/token, while REST reduced it to 11.82 ms/token.
+ - Similarly, for Vicuna 7B, the baseline autoregressive method had a Mean Token Time of 25.48 ms/token, which REST improved to 15.12 ms/token.
+
+## Findings
+- **Efficiency of Retrieval Mechanism**: The use of a retrieval datastore allowed for efficient draft token generation without the need for a smaller language model, simplifying the integration process.
+- **Impact of Datastore Size**: Larger datastores contributed to improved retrieval accuracy and generation speed, indicating that the size and quality of the datastore are crucial for the effectiveness of REST.
+
## Limitations
-- **Dependence on Datastore Quality**
- - The performance of REST is directly influenced by the accuracy and completeness of the datastore.
- - A higher quality datastore may be required for better alignment with the LLM, potentially necessitating the use of content generated by the LLM itself.
+- **Lack of In-Context Abilities**: REST may struggle with tasks that require understanding of personalized or context-specific variables, particularly in code generation scenarios.
+- **Randomness in Sampling**: The performance of REST with nucleus sampling was not as strong as with greedy sampling, suggesting that the inherent randomness can affect the quality of generated outputs.
-- **Lack of In-Context Abilities**: REST may struggle with tasks that require understanding of context, such as retrieving personalized variable names in code generation.
-This limitation raises questions about how retrieval methodologies can effectively handle complex contextual requirements.
+## Scope
+- **Exploration of Datastore Construction**: Future work could focus on constructing datastores from content generated by the LLM itself to improve alignment and relevance.
# Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
**Domain**: RAG
@@ -103,15 +574,107 @@ This limitation raises questions about how retrieval methodologies can effective
**DOI**: [http://dx.doi.org/10.48550/arXiv.2005.11401](http://dx.doi.org/10.48550/arXiv.2005.11401)
+**Published**: NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems (2020)
+
+**Authors**:
+- [Patrick Lewis](https://www.webofscience.com/wos/author/record/42545040), [Tim Rocktäschel](https://www.webofscience.com/wos/author/record/27109647), [Sebastian Riedel](https://www.webofscience.com/wos/author/record/16130105), _University College London_
+- [Ethan Perez](https://www.webofscience.com/wos/author/record/26678214), _New York University_
+- [Aleksandra Piktus](https://www.webofscience.com/wos/author/record/12993763), [Fabio Petroni](https://www.webofscience.com/wos/author/record/26540253), [Vladimir Karpukhin](https://www.webofscience.com/wos/author/record/10012394), [Naman Goyal](https://www.webofscience.com/wos/author/record/42324794), [Heinrich Küttler](https://www.webofscience.com/wos/author/record/39265683), [Mike Lewis](https://www.webofscience.com/wos/author/record/25130691), [Wen-tau Yih](https://www.webofscience.com/wos/author/record/18031680), [Douwe Kiela](https://www.webofscience.com/wos/author/record/24159714), _Facebook AI Research_
+
## Summary
The paper presents Retrieval-Augmented Generation (RAG), a novel approach that combines pre-trained parametric memory (a sequence-to-sequence model) with non-parametric memory (a dense vector index of Wikipedia) to enhance knowledge-intensive natural language processing (NLP) tasks. RAG models utilize a retriever to access relevant documents based on input queries and a generator to produce outputs conditioned on both the input and the retrieved documents. The authors explore two formulations of RAG: RAG-Sequence, which uses the same retrieved document for the entire output sequence, and RAG-Token, which allows different documents for each token generated.
-## Limitations
-- **Performance on Specific Tasks**: Although RAG models set state-of-the-art results on certain open-domain QA tasks, their performance may not generalize across all knowledge-intensive tasks.
+## Issues Targeted
+- **Limited Knowledge Access and Manipulation**
+ - Pre-trained language models struggle to access and manipulate factual knowledge effectively.
+ - Performance on knowledge-intensive tasks is inferior compared to task-specific architectures.
+
+- **Provenance and Knowledge Updating**
+ - Difficulty in providing provenance for decisions made by language models.
+ - Challenges in updating the model's world knowledge as new information becomes available.
+
+- **Hallucination in Generated Text**
+ - Language models may produce "hallucinations," or factually incorrect outputs, due to reliance on implicit knowledge.
+
+## Contribution/Noelty
+- **General-Purpose Fine-Tuning Recipe**
+ - A general-purpose fine-tuning approach is introduced, allowing RAG models to be fine-tuned on a wide range of knowledge-intensive NLP tasks, enhancing their versatility.
+
+- **Two RAG Formulations**: The paper explores two formulations of RAG:
+ - **RAG-Sequence**: Uses the same retrieved document for generating the entire output sequence.
+ - **RAG-Token**: Allows different documents to be used for generating each token, providing greater flexibility and specificity in generation.
+
+## Approach
+- **Retrieval Mechanism**
+ - The retriever component (DPR) is responsible for retrieving relevant documents based on the input query. It uses a bi-encoder architecture to compute dense representations for both documents and queries.
+ - The top-K documents are retrieved using Maximum Inner Product Search (MIPS) based on the query representation.
+
+- **Generation Mechanism**
+ - The generator (BART) produces output sequences conditioned on both the input and the retrieved documents.
+ - Two models are proposed for generating text:
+ - **RAG-Sequence**: Uses the same retrieved document for generating the entire output sequence.
+ - **RAG-Token**: Allows different documents to be used for generating each token, enabling more nuanced and contextually relevant responses.
+
+- **Decoding Strategies**: Different decoding strategies are employed for RAG-Sequence and RAG-Token:
+ - **RAG-Token**: Standard autoregressive decoding with a beam search for each token.
+ - **RAG-Sequence**: Requires a more complex decoding process, involving scoring hypotheses across multiple documents.
+
+## Dataset/Testing
+**Dataset**
+
+- **Wikipedia Dump**
+ - A single Wikipedia dump (December 2018) is used as the non-parametric knowledge source. Each article is split into disjoint 100-word chunks, resulting in a total of 21 million documents.
-- **Scalability Concerns**: The approach may face scalability issues when dealing with larger datasets or more complex tasks, particularly in terms of retrieval efficiency.
+- **Open-Domain Question Answering Datasets**: The RAG models are evaluated on several popular open-domain question answering datasets:
+ - Natural Questions (NQ)
+ - TriviaQA (TQA)
+ - WebQuestions (WQ)
+ - CuratedTrec (CT)
-- **Potential for Misuse**: The ability to generate factual content raises concerns about the potential misuse of the technology for generating misleading or harmful information.
+- **Abstractive Question Answering Dataset**
+ - **MS-MARCO**: This dataset is used for evaluating the natural language generation capabilities of the RAG models in an open-domain abstractive QA setting.
+
+- **Jeopardy Question Generation Dataset**
+ - **SearchQA**: This dataset is utilized for the task of generating Jeopardy questions, which involves generating questions based on provided answer entities.
+
+- **Fact Verification Dataset**
+ - **FEVER**: This dataset is used to assess the models' ability to classify claims as supported, refuted, or unverifiable based on retrieved evidence from Wikipedia.
+
+**Testing**
+
+- **Fine-Tuning and Evaluation**
+ - The RAG models are fine-tuned on the respective datasets using input-output pairs (e.g., questions and answers for QA tasks).
+ - The models are evaluated based on metrics such as Exact Match (EM) scores for QA tasks, BLEU and ROUGE scores for generation tasks, and label accuracy for classification tasks.
+
+- **Retrieval and Generation**
+ - During testing, the models retrieve the top K documents for each query and generate responses based on the retrieved documents, allowing for the assessment of their performance in real-world scenarios.
+
+## Result
+- **Open-Domain Question Answering**: RAG models achieved state-of-the-art results on several open-domain QA tasks:
+ - **Natural Questions (NQ)**: RAG-Token scored 44.1 and RAG-Sequence scored 44.5 in Exact Match (EM).
+ - **TriviaQA (TQA)**: RAG-Sequence achieved 56.8/68.0 on the standard and Wiki test sets, respectively.
+ - **WebQuestions (WQ)**: RAG-Token scored 45.5 and RAG-Sequence scored 45.2.
+ - **CuratedTrec (CT)**: RAG-Sequence scored 52.2.
+
+- **Abstractive Question Answering**
+ - On the MS-MARCO NLG task, RAG-Sequence outperformed BART by 2.6 BLEU points and 2.6 ROUGE-L points, demonstrating improved natural language generation capabilities.
+
+- **Jeopardy Question Generation**
+ - RAG-Token outperformed RAG-Sequence and BART on the Q-BLEU-1 metric, indicating better performance in generating Jeopardy questions.
+ - Human evaluations showed that RAG was more factual in 42.7% of cases compared to BART, and RAG was found to be more specific by a significant margin.
+
+- **Fact Verification (FEVER)**
+ - RAG models achieved results within 4.3% of state-of-the-art models, which are complex pipeline systems with substantial engineering and retrieval supervision.
+ - RAG demonstrated strong performance without requiring supervision on retrieved evidence.
+
+## Findings
+- **Effective Retrieval Mechanism**: The integration of a differentiable retrieval mechanism significantly enhances the model's ability to access relevant knowledge, leading to better performance on knowledge-intensive tasks.
+
+## Limitations
+- **Complexity of Decoding**: The decoding process, especially for RAG-Sequence, can be computationally intensive, requiring multiple forward passes for hypothesis scoring, which may limit efficiency in real-time applications.
+
+## Scope
+- **Broader Applications**: The RAG framework can be extended to various NLP tasks beyond those evaluated in the paper, such as dialogue systems, summarization, and other forms of question answering.
# REALM: retrieval-augmented language model pre-training
**Domain**: Foundation model + RAG
@@ -120,18 +683,84 @@ The paper presents Retrieval-Augmented Generation (RAG), a novel approach that c
**DOI**: [https://doi.org/10.48550/arXiv.2002.08909](https://doi.org/10.48550/arXiv.2002.08909)
+**Published**: ICML'20: Proceedings of the 37th International Conference on Machine Learning, (2020)
+
+**Authors**: [Kelvin Guu](https://www.webofscience.com/wos/author/record/21663196), [Kenton Lee](https://www.webofscience.com/wos/author/record/10555389), [Zora Tung](https://www.webofscience.com/wos/author/record/30511512), [Panupong Pasupat](https://www.webofscience.com/wos/author/record/13084225), [Ming-Wei Chang](https://www.webofscience.com/wos/author/record/53082543), _Google Research_
+
## Summary
The paper presents REALM (Retrieval-Augmented Language Model), a novel framework that enhances language model pre-training by integrating a learned knowledge retriever. Unlike traditional language models that store knowledge implicitly within their parameters, REALM allows the model to explicitly retrieve and utilize information from a large corpus, such as Wikipedia, during both pre-training and inference. This approach not only improves the model's ability to access relevant knowledge but also enhances interpretability and modularity. The authors demonstrate the effectiveness of REALM by fine-tuning it on the challenging task of Open-domain Question Answering (Open-QA), achieving state-of-the-art results across multiple benchmarks and outperforming existing models by a significant margin.
The paper details the architecture of REALM, which consists of a neural knowledge retriever and a knowledge-augmented encoder, and describes the training process that involves backpropagating through the retrieval step. The authors also address computational challenges associated with large-scale document retrieval and propose strategies to optimize performance. Through extensive experiments, REALM shows substantial improvements in accuracy and retrieval effectiveness, highlighting its potential for advancing natural language processing tasks that require extensive world knowledge.
+## Issues Targeted
+- **Implicit Knowledge Storage**: Traditional language models store knowledge implicitly in their parameters, making it difficult to interpret and understand what knowledge is captured.
+- **Scalability of Knowledge**: As the need for more world knowledge increases, larger neural networks are required, which can be slow and expensive to train.
+- **Modularity and Interpretability**: There is a need for a more modular and interpretable way to capture knowledge, rather than relying solely on the parameters of a neural network.
+- **Retrieval Efficiency**: The challenge of efficiently retrieving relevant documents from a large corpus (e.g., millions of documents) during the pre-training and inference stages.
+
+## Contribution/Novelty
+- **Introduction of REALM Framework**: The paper presents the Retrieval-Augmented Language Model (REALM), which integrates a neural knowledge retriever into the language model pre-training process, allowing for explicit retrieval of knowledge from a large corpus.
+- **Unsupervised Training of Knowledge Retriever**: For the first time, the paper demonstrates how to pre-train a knowledge retriever in an unsupervised manner using masked language modeling as the learning signal, enabling the model to learn effective retrieval strategies without labeled data.
+- **End-to-End Learning**: REALM allows for end-to-end learning by backpropagating through the retrieval step, which considers millions of documents, thus optimizing both the retriever and the language model jointly.
+
+## Approach
+- **Retrieve-Then-Predict Framework**: REALM employs a two-step process:
+ - **Retrieve**: The model first retrieves relevant documents from a large knowledge corpus (e.g., Wikipedia) based on the input query.
+ - **Predict**: It then uses the retrieved documents to inform its predictions for the output.
+
+- **Generative Process**:
+ - The approach is formalized as a generative process where the model learns a distribution ( p(y | x) ) over possible outputs ( y ) given an input ( x ). This is achieved by:
+ - Retrieving documents ( z ) from the knowledge corpus ( Z ) based on the input ( x ).
+ - Conditioning the prediction on both the retrieved documents ( z ) and the input ( x ) to generate the output ( y ).
+
+- **Masked Language Modeling (MLM)**:
+ - During pre-training, the model uses a masked language modeling objective, where it predicts missing tokens in a sentence. This helps the model learn to encode both syntactic and semantic information as well as world knowledge.
+
+- **Neural Knowledge Retriever**:
+ - The knowledge retriever is defined using a dense inner product model that computes relevance scores between the input and documents. It uses embedding functions to map both the input and documents into a shared vector space.
+
+- **Knowledge-Augmented Encoder**:
+ - After retrieving documents, the model combines the input and the retrieved document into a single sequence, which is then processed by a Transformer architecture to perform cross-attention before predicting the output.
+- **End-to-End Training**:
+ - The entire model, including both the knowledge retriever and the knowledge-augmented encoder, is trained end-to-end by maximizing the log-likelihood of the correct output. This allows gradients to flow through both components, optimizing their performance jointly.
+- **Maximum Inner Product Search (MIPS)**:
+ - To efficiently retrieve the top documents, the model employs MIPS algorithms, which allow for sub-linear time complexity in searching through the document embeddings.
+
+## Dataset/Testing
+- **Natural Questions (NQ)**: A dataset consisting of naturally occurring Google queries and their corresponding answers. The authors focus on questions categorized as "short answer type" with a maximum of five tokens.
+- **WebQuestions**: This dataset is collected from the Google Suggest API, expanding a seed question into related questions. It is designed to evaluate the model's ability to answer open-domain questions.
+- **CuratedTrec**: A collection of question-answer pairs drawn from real user queries issued on search engines like MSNSearch and AskJeeves. The answers are defined as regular expressions to account for multiple correct answers or variations.
+
+## Result
+- **Performance on Open-QA Benchmarks**: REALM achieved state-of-the-art results on three popular Open-domain Question Answering (Open-QA) benchmarks:
+ - **Natural Questions (NQ)**:
+ - REALM (X = Wikipedia, Z = Wikipedia): 39.2% accuracy
+ - REALM (X = CC-News, Z = Wikipedia): 40.4% accuracy
+ - **WebQuestions (WQ)**:
+ - REALM (X = Wikipedia, Z = Wikipedia): 40.2% accuracy
+ - REALM (X = CC-News, Z = Wikipedia): 40.7% accuracy
+ - **CuratedTrec (CT)**:
+ - REALM (X = Wikipedia, Z = Wikipedia): 46.8% accuracy
+ - REALM (X = CC-News, Z = Wikipedia): 42.9% accuracy
+
+- **Comparison with Other Models**:
+ - REALM outperformed all previous systems by a significant margin (4-16% absolute accuracy) across all three benchmarks.
+ - For example, the largest T5 model (T5-11B) achieved 34.5% on NQ, while REALM surpassed it with 39.2% accuracy.
+
+## Findings
+- **Enhanced Performance**: REALM significantly improves the accuracy of Open-domain Question Answering systems, outperforming state-of-the-art models by 4-16% on various benchmarks.
+- **Importance of Inductive Biases**: Strategies such as salient span masking and the use of a null document were found to be crucial in guiding the model towards meaningful retrievals, improving overall performance.
+
## Limitations
-- **Dependence on Quality of Knowledge Corpus**:
- - The effectiveness of REALM is heavily reliant on the quality and comprehensiveness of the knowledge corpus (e.g., Wikipedia).
- - If the corpus lacks relevant information, the model's performance may degrade.
-- **Limited Generalization to Other Domains**:
- - The experiments primarily focus on Open-domain Question Answering (Open-QA), which may not generalize well to other NLP tasks or domains.
- - The model's performance in specialized domains or with domain-specific knowledge is not thoroughly evaluated.
+- **Computational Complexity**: The need to consider millions of documents during retrieval poses significant computational challenges, which may limit the scalability of the approach in resource-constrained environments.
+- **Staleness of MIPS Index**: The asynchronous refresh of the MIPS index can lead to a stale index, which may negatively impact the retrieval effectiveness if not managed properly.
+
+## Scope
+- The paper suggests several avenues for future work, including:
+- Extending REALM to structured knowledge bases to improve reasoning capabilities.
+- Exploring multilingual settings where knowledge retrieval in high-resource languages can aid low-resource languages.
+- Investigating multi-modal approaches that incorporate images or videos as additional sources of knowledge.
+
# Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
**Domain**: RAG
@@ -140,16 +769,98 @@ The paper details the architecture of REALM, which consists of a neural knowledg
**DOI**: [https://doi.org/10.18653/v1/2021.eacl-main.74](https://doi.org/10.18653/v1/2021.eacl-main.74)
+**Published**: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main VolumeProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)
+
+**Authors**: [Gautier Izacard](https://www.webofscience.com/wos/author/record/9388385), [Edouard Grave](https://www.webofscience.com/wos/author/record/21956629), _Facebook AI Research_
+
## Summary
The paper titled investigates the integration of generative models with passage retrieval techniques to enhance open domain question answering (QA) systems. The authors highlight that while generative models have shown competitive performance without external knowledge, they often require extensive resources due to their large parameter sizes. By incorporating passage retrieval, particularly from sources like Wikipedia, the authors demonstrate that their approach, termed Fusion-in-Decoder, significantly improves performance on benchmarks such as Natural Questions and TriviaQA. The method retrieves multiple passages and utilizes a sequence-to-sequence model to generate answers, effectively aggregating evidence from these passages.
+## Issues Targeted
+- **High Resource Requirements**
+ - Generative models for open domain question answering require models with billions of parameters.
+ - These models are expensive to train and query.
+- **Lack of External Knowledge Utilization**
+ - Previous generative models do not leverage external sources of knowledge, limiting their effectiveness.
+- **Inefficiency in Evidence Aggregation**
+ - Extractive models struggle with aggregating and combining evidence from multiple passages.
+ - There is a need for a method that can efficiently combine information from various sources.
+
+## Contribution/Novelty
+- **Fusion-in-Decoder Approach**
+ - Introduces a novel method called Fusion-in-Decoder, which processes retrieved passages independently in the encoder while aggregating evidence in the decoder.
+ - This architecture allows for efficient handling of a large number of passages, scaling linearly with the number of passages rather than quadratically.
+
+- **Combining Generative and Retrieval Models**
+ - Explores the synergy between generative models and retrieval-based approaches, leveraging external knowledge sources (e.g., Wikipedia) to enhance answer generation.
+ - Provides evidence that generative models can outperform traditional extractive models in open domain question answering tasks.
+
+## Approach
+- **Two-Step Process**: The approach consists of two main steps:
+ - Retrieval of Support Passages
+ - Answer Generation Using a Generative Model
+
+- **Passage Retrieval**: Utilizes two methods for retrieving support passages:
+ - **BM25**: A traditional information retrieval method that ranks passages based on term frequency and inverse document frequency.
+ - **Dense Passage Retrieval (DPR)**: A more advanced method that uses dense vector representations computed with BERT to rank passages based on the dot product between query and passage representations.
+
+- **Generative Model**
+ - Employs a sequence-to-sequence (seq2seq) model, specifically pretrained models like T5 or BART, to generate answers.
+ - The model takes as input the question along with the retrieved passages.
+
+- **Fusion-in-Decoder Architecture**
+ - Each retrieved passage is concatenated with the question and processed independently in the encoder.
+ - The decoder performs attention over the concatenated representations of all retrieved passages, allowing for effective evidence fusion.
+ - This architecture enables the model to aggregate information from multiple passages while maintaining computational efficiency.
+
+- **Training and Fine-Tuning**
+ - The model is fine-tuned on specific datasets (Natural Questions, TriviaQA, SQuAD) using a constant learning rate and dropout.
+ - The training process involves sampling target answers and optimizing the model to improve its performance on the validation set.
+
+- **Evaluation Metrics**
+ - The performance of the model is evaluated using the Exact Match (EM) metric, which assesses the accuracy of generated answers against a list of acceptable answers.
+
+## Dataset/Testing
+**Dataset**
+
+- **Natural Questions (NQ)**
+ - Contains questions corresponding to Google search queries.
+ - The open-domain version is created by discarding answers with more than 5 tokens.
+- **TriviaQA**
+ - Comprises questions gathered from trivia and quiz-league websites.
+The unfiltered version is used for open-domain question answering.
+
+- **SQuAD v1.1**
+ - A reading comprehension dataset where annotators write questions based on paragraphs extracted from Wikipedia.
+ - The validation set is used as the test set, with 10% of the training set kept for validation.
+
+**Testing**
+
+- **Evaluation Metrics**
+- The model's performance is evaluated using the Exact Match (EM) metric, which determines the accuracy of generated answers by checking if they match any acceptable answers after normalization.
+
+- **Training and Testing Setup**
+ - The models are trained and evaluated using the same preprocessing techniques, leading to passages of 100 words that do not overlap.
+ - During training, the model is fine-tuned on each dataset independently, and the best model is selected based on performance on the validation set.
+
+- **Retrieval Process**
+ - For testing, the model retrieves 100 passages (unless stated otherwise) and truncates them to 250-word pieces.
+ - Different retrieval methods (BM25 for SQuAD and DPR for NQ and TriviaQA) are employed to gather relevant passages for answering the questions.
+
+## Result
+The performance of the Fusion-in-Decoder model improves significantly with an increase in the number of retrieved passages:
+- For TriviaQA, increasing from 10 to 100 passages leads to a 6% improvement in performance.
+- For Natural Questions, a 3.5% improvement is observed when increasing the number of passages.
+
+## Findings
+- **Effectiveness of Fusion-in-Decoder**: The Fusion-in-Decoder approach demonstrates significant improvements in performance for open domain question answering by effectively combining evidence from multiple retrieved passages.
+- **Scalability**: The method scales well with the number of retrieved passages, showing that performance continues to improve as more passages are included, unlike many extractive models that plateau after a certain number of passages.
+
## Limitations
-- **Limited Integration of Retrieval and Generation**:
- - The proposed method processes passages independently in the encoder, which may limit the model's ability to leverage inter-passage relationships effectively.
- - Future work could explore more integrated approaches that combine retrieval and generation more seamlessly.
-- **Scalability Concerns**:
- - Although the method scales well with the number of retrieved passages, there may be practical limits to how many passages can be effectively processed, especially in real-time applications.
- - The computational cost may increase significantly with larger numbers of passages, potentially leading to latency issues.
+- **Generalization to Other Domains**: The findings are primarily based on specific datasets (Natural Questions, TriviaQA, SQuAD), and the generalizability of the approach to other domains or types of questions remains to be fully explored.
+
+## Scope
+- **Combining with Other Modalities**: Future work could explore the integration of multimodal data (e.g., images, videos) alongside text to enhance the model's capabilities in answering questions that require diverse forms of evidence.
# Retrieval Augmented Code Generation and Summarization
**Domain**: RAG
@@ -158,12 +869,126 @@ The paper titled investigates the integration of generative models with passage
**DOI**: [https://doi.org/10.18653/v1/2021.findings-emnlp.232](https://doi.org/10.18653/v1/2021.findings-emnlp.232)
+**Published**: Findings of the Association for Computational Linguistics: EMNLP 2021 (2021)
+
+**Authors**:
+- [Md Rizwan Parvez](https://www.webofscience.com/wos/author/record/26939024), [Wasi Uddin Ahmad](https://www.webofscience.com/wos/author/record/2156449), [Kai-Wei Chang](https://www.webofscience.com/wos/author/record/2239945), _University of California, Los Angeles_
+- [Saikat Chakraborty](https://www.webofscience.com/wos/author/record/49321037), [Baishakhi Ray](https://www.webofscience.com/wos/author/record/27610985), _Columbia University_
+
## Summary
The paper presents REDCODER, a retrieval-augmented framework designed to enhance code generation and summarization tasks for software developers. By mimicking the behavior of developers who often recall and adapt previously written code or summaries, REDCODER retrieves relevant code snippets or summaries from a database and incorporates them into the generation process. The framework employs a two-step approach: first, a retriever module identifies relevant code or summaries, and then a generator module uses this augmented input to produce the desired output. The authors conducted extensive experiments on benchmark datasets in Java and Python, demonstrating that REDCODER significantly improves the quality of generated code and summaries compared to existing models, achieving notable increases in Exact Match and BLEU scores. The uniqueness of REDCODER lies in its ability to work with both unimodal and bimodal retrieval databases, allowing it to leverage high-quality source code and natural language descriptions effectively. The results indicate that the integration of retrieved information enhances the performance of code generation and summarization tasks, validating the framework's approach to automating software development processes. The paper concludes with a discussion on the potential for extending REDCODER to other code automation tasks, highlighting its contributions to the field of software engineering and natural language processing.
+## Issues Targeted
+- **Poor Code Quality**: Many generated code snippets suffer from low quality, which affects their usability and correctness.
+- **Complexity of Code Generation and Summarization**: Generating source code and summarizing it are complex tasks that require understanding various programming language constructs at lexical, syntax, and semantic levels.
+- **Limited Utilization of High-Quality Data**: Existing approaches do not effectively leverage high-quality source code and their descriptions available in open-source repositories during the generation process.
+- **Challenges with Long Source Code**: The performance of existing models tends to degrade when dealing with longer source code, indicating a need for better handling of such cases.
+
+## Contribution/Novelty
+- **Introduction of REDCODER Framework**: The paper presents REDCODER, a novel retrieval-augmented framework for code generation and summarization that enhances the generation process by incorporating relevant code and summaries retrieved from a database.
+
+- **Support for Unimodal and Bimodal Retrieval**: The framework can work with retrieval databases that include both unimodal (only code or natural language descriptions) and bimodal instances (code-description pairs), allowing for greater flexibility in data usage.
+
+- **Two-Step Process**: REDCODER employs a two-step process where a retriever first retrieves relevant information, which is then processed by a generator to produce the final output. This modular design allows for configurability and adaptability.
+
+## Approach
+- **Two-Step Process**:
+ - **Retriever Module**:
+ - The first step involves a retriever that retrieves relevant source code or summaries from a database based on the input (natural language text for code generation or code snippet for summarization).
+ - The retriever is designed using a dense retrieval technique, specifically a modified version of the Dense Passage Retriever (DPR), which utilizes two different encoders for encoding queries and documents.
+ - It can handle both unimodal and bimodal retrieval, allowing it to work with various types of data.
+
+ - **Generator Module**:
+ - The second step involves a generator that processes the retrieved information along with the original input to generate the target output (either code or summary).
+ - The generator used in REDCODER is PLBART, a sequence-to-sequence Transformer model pre-trained on a large collection of source code and natural language descriptions.
+
+- **Augmentation of Input**:
+ - The retrieved code or summaries are concatenated with the original input to form an augmented input sequence, which is then fed into the generator.
+ - This augmentation allows the generator to leverage relevant context from the retrieved data, enhancing its output quality.
+
+- **Training and Fine-Tuning**:
+ - The retriever module (SCODE-R) is fine-tuned using parallel examples of code and summaries, optimizing its ability to distinguish relevant documents from irrelevant ones without relying on "hard" negatives.
+ - The generator (SCODE-G) is trained to generate outputs based on the augmented input sequences.
+
+## Dataset/Testing
+**Dataset**
+
+- **CodeXGLUE**:
+ - Language: Java and Python
+ - Tasks: Code generation and summarization
+ - Statistics:
+ - Training: 164,923 (Java), 251,820 (Python)
+ - Validation: 5,183 (Java), 13,914 (Python)
+ - Test: 10,955 (Java), 14,918 (Python)
+ - The dataset is curated from CodeSearchNet and filtered for noisy examples.
+
+- **Concode**:
+ - Language: Java
+ - Task: Code generation
+ - Statistics:
+ - Training: 100,000
+ - Validation: 2,000
+ - Test: 2,000
+ - This dataset includes additional context such as environment variables and methods in the input summaries.
+
+**Testing**
+
+- The REDCODER framework was tested using the aforementioned datasets by evaluating its performance on both code generation and summarization tasks.
+
+- The evaluation metrics used include:
+ - **BLEU**: For measuring the quality of generated text against reference texts.
+ - **CodeBLEU**: A metric specifically designed to evaluate the correctness of generated code.
+ - **Exact Match (EM)**: The percentage of generated outputs that exactly match the reference outputs.
+
+- The empirical results were compared against various baseline models, including retrieval-based and generative models, to demonstrate the effectiveness of the REDCODER framework.
+
+## Result
+- **Code Generation Results**:
+ - **CodeXGLUE Dataset**:
+ - **REDCODER** achieved:
+ - Exact Match (EM): 8.95
+ - BLEU: 26.92
+ - CodeBLEU: 31.15
+ - **REDCODER-EXT** achieved:
+ - EM: 10.21
+ - BLEU: 28.98
+ - CodeBLEU: 33.18
+ - **Concode Dataset**:
+ - **REDCODER** achieved:
+ - EM: 23.4
+ - BLEU: 41.6
+ - CodeBLEU: 43.4
+ - **REDCODER-EXT** achieved:
+ - EM: 23.3
+ - BLEU: 42.5
+ - CodeBLEU: 43.4
+
+- **Code Summarization Results**:
+ - **CodeXGLUE Dataset**:
+ - **REDCODER achieved**:
+ - BLEU: 50.4
+ - ROUGE-L: 58.8
+ - **REDCODER-EXT** achieved:
+ - BLEU: 50.4
+ - ROUGE-L: 58.7
+ - **SCODE-R** (retrieval-only model) achieved:
+ - BLEU: 48.0
+ - ROUGE-L: 55.7
+
+## Findings
+- **Effectiveness of Retrieval-Augmented Generation**:
+ - The REDCODER framework significantly improves code generation and summarization tasks by effectively leveraging retrieved code and summaries.
+ - Empirical results show substantial gains in performance metrics (e.g., BLEU, CodeBLEU, Exact Match) compared to baseline models.
+
+- **Modular Design Benefits**:
+ - The two-step process (retrieval followed by generation) allows for a flexible and configurable framework that can adapt to various retrieval and generation models.
+
## Limitations
-- **Limited to Specific Programming Languages**: The experiments were primarily conducted on Java and Python, which may limit the generalizability of the findings to other programming languages.
-- **Performance on Long Code**: The performance of the generator (PLBART) decreases with increasing code length, indicating challenges in handling longer code snippets effectively.
+- **Handling of Long Code Snippets**: While REDCODER improves performance, the results indicate that the generator's performance may still degrade with longer code snippets, suggesting a need for further optimization in handling such cases.
+
+## Scope
+- **Future Enhancements**: The authors express interest in extending REDCODER to support additional code automation tasks, such as code translation and other forms of code generation.
+- **Broader Application**: The framework could be adapted for use in various software engineering applications beyond code generation and summarization, such as code review, debugging, and documentation generation.
# DocPrompting: Generating Code by Retrieving the Docs
**Domain**: RAG
@@ -172,19 +997,89 @@ The paper presents REDCODER, a retrieval-augmented framework designed to enhance
**DOI**: [https://doi.org/10.48550/arXiv.2207.05987](https://doi.org/10.48550/arXiv.2207.05987)
+**Published**: The Eleventh International Conference on Learning Representations (2023)
+
+**Authors**: Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, Graham Neubig
+
## Summary
The paper introduces a novel approach called DocPrompting, which enhances natural language to code generation (NL → code) by leveraging code documentation. Traditional code generation models struggle to generalize to unseen functions and libraries due to their reliance on training data that may not include all available APIs. In contrast, human programmers often consult documentation when encountering new functions. DocPrompting addresses this gap by first retrieving relevant documentation based on a natural language intent and then generating code using both the intent and the retrieved documentation. The authors demonstrate that this method significantly improves the performance of existing models, such as CodeT5 and GPT-Neo, across various benchmarks, including Python and Bash, achieving notable gains in execution-based evaluations. The paper also details the implementation of DocPrompting, which consists of a retriever that selects relevant documents and a generator that produces code snippets based on these documents. The experiments conducted show that DocPrompting consistently outperforms baseline models that do not utilize documentation, highlighting its effectiveness in enabling models to generate code for previously unseen functions. The authors provide new benchmarks for retrieval-based code generation and emphasize the potential for further improvements through better retriever and generator designs.
+## Issues Targeted
+- **Inability to Generalize to Unseen Functions and Libraries**
+ - Existing code generation models cannot handle new functions and libraries that were not present in their training data.
+ - Even familiar functions may have unseen arguments, leading to limitations in code generation.
+
+- **Dependence on Training Data**
+ - Current models rely heavily on input-output pairs from training data, which restricts their ability to adapt to new or updated APIs.
+
+- **Lack of Reference to Documentation**
+ - Human programmers often refer to documentation when encountering new functions, but existing models do not utilize this resource effectively.
+
+## Contribution/Novelty
+- **Introduction of DocPrompting**: The paper presents DocPrompting, a novel approach for generating code by explicitly retrieving and utilizing code documentation based on natural language intents.
+
+- **Retrieve-Then-Generate Paradigm**: DocPrompting employs a retrieve-then-generate framework, where relevant documentation is first retrieved and then used to inform the code generation process, enhancing the model's ability to handle unseen functions and libraries.
+
+- **General Applicability**: The approach is general and can be applied to any programming language and is agnostic to the underlying neural model, making it versatile across different coding environments.
+
+## Approach
+- **Retrieve-Then-Generate Paradigm**: The process follows a retrieve-then-generate paradigm:
+ - **Retrieval of Documentation**:
+ - Given a natural language (NL) intent, a document retriever identifies and retrieves relevant pieces of code documentation from a documentation pool.
+ - The retriever computes similarity scores between the NL intent and each document in the pool to select the top-k relevant documents.
+ - **Code Generation**:
+ - The retrieved documentation is then used as context for a code generator, which generates the corresponding code snippet based on the NL intent and the retrieved documents.
+
+- **Use of External Documentation Pool**:
+ - The documentation pool serves as an external resource that can be updated frequently with new content, allowing the model to leverage the latest documentation without needing to retrain.
+
+## Dataset/Testing
+**Dataset**
+
+- **CoNaLa (Code/Natural Language Challenge)**
+ - A popular benchmark for natural language to Python code generation.
+ - The dataset consists of pairs of StackOverflow questions (natural language intents) and their corresponding Python code snippets (answers).
+ - The authors re-split the dataset to ensure that every example in the development and test sets includes at least one Python function that was not seen in the training data, enhancing the generalization testing.
+
+- **tldr (Too Long; Didn't Read)**
+ - A newly curated dataset for shell scripting that includes natural language descriptions and corresponding Bash command lines.
+ - The dataset contains 1,879 unique Bash commands and 9,187 natural language to Bash pairs.
+ - The training, development, and test sets are constructed with completely disjoint commands to test the generalizability of the code generation models.
+
+**Testing**
+
+- **Execution-Based Evaluation**
+ - The models were evaluated using execution-based metrics, where the generated code snippets were tested against predefined unit tests to assess their functional correctness.
+ - For CoNaLa, the authors used manually written unit tests for 100 examples from the test set to measure the pass rate (pass@k) of the generated code.
+
+- **Performance Metrics**
+ - Various metrics were employed to evaluate the performance of the models, including:
+ - BLEU score for measuring the quality of generated Python code.
+ - Command name accuracy (CMD Acc), exact match (EM), token-level F1, and character-level BLEU (charBLEU) for the tldr dataset.
+ - Function recall and unseen function recall to assess the model's ability to generate correct function calls, especially for those not seen during training.
+
+## Result
+- **Performance on CoNaLa Dataset**
+ - **BLEU Score Improvement**: DocPrompting improved the BLEU score of CodeT5 by 1.65 points over the state-of-the-art baseline.
+ - **Function Recall**: The recall of generated function names was significantly higher with DocPrompting, achieving 18.30 for unseen functions compared to 9.03 for the base CodeT5 model.
+ - **Execution-Based Evaluation**: The pass@k metrics showed consistent improvements with DocPrompting, with a 2.85% improvement on pass@1 and a 4.45% improvement on pass@5.
+
+- **Performance on tldr Dataset**
+ - **Command Name Accuracy (CMD Acc)**: The accuracy for command name prediction improved significantly with DocPrompting across various models---For example, CodeT5 with DocPrompting achieved a CMD Acc of 30.72% compared to 14.60% without it.
+ - **Exact Match (EM)**: The exact match rate also saw substantial gains, with CodeT5+DocPrompting achieving 9.15% EM compared to 2.18% without it.
+ - **Character-Level BLEU (charBLEU)**: The charBLEU score improved from 21.50% to 33.83% with the addition of DocPrompting.
+
+## Findings
+- **Effectiveness of Documentation Retrieval**: The study found that leveraging documentation significantly improves the accuracy and generalization of code generation models, allowing them to handle unseen functions and libraries effectively.
+- **Generalization Capability**: Models using DocPrompting demonstrated a marked improvement in their ability to generalize to unseen functions, as evidenced by higher recall rates for unseen function calls in the CoNaLa dataset.
+
+
## Limitations
-- **Dependence on Documentation Quality**:
- - The effectiveness of DocPrompting heavily relies on the quality and comprehensiveness of the retrieved documentation.
- - If the documentation is outdated, incomplete, or poorly written, it may lead to inaccurate code generation.
-- **Generalization to New Libraries**:
- - While DocPrompting aims to generalize to unseen functions and libraries, it may still struggle with entirely new libraries that lack sufficient documentation.
- - The approach assumes that relevant documentation is available for all potential new functions, which may not always be the case.
-- **Retrieval Performance Variability**:
- - The performance of the retrieval component can vary significantly based on the chosen retriever (sparse vs. dense).
- - The paper indicates that BM25 performs well for the tldr dataset but not as effectively for CoNaLa, suggesting that the choice of retriever is critical and context-dependent.
+- **Retrieval Latency**: Although the retrieval process is efficient, it introduces additional computation time during inference, which may be a concern in real-time applications.
+- **Limited Scope of Benchmarks**: The benchmarks used (CoNaLa and tldr) may not cover all possible use cases or programming scenarios, potentially limiting the generalizability of the findings to other domains or languages.
+
+## Scope
+- **Future Research Directions**: The paper suggests that future work could focus on enhancing the retrieval mechanisms, such as developing stronger retrievers or integrating joint training for the retriever and generator to minimize cascading errors.
# Retrieval-Augmented Generation for Large Language Models: A Survey
**Domain**: RAG
@@ -201,13 +1096,112 @@ The paper provides a comprehensive survey of Retrieval-Augmented Generation (RAG
**DOI**: [http://dx.doi.org/10.1145/3130348.3130375](http://dx.doi.org/10.1145/3130348.3130375)
+**Published**: SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (2001)
+
+**Authors**: [John D Lafferty](https://dl.acm.org/profile/81100055408), [ChengXiang Zhai](https://dl.acm.org/profile/81100540023), _Carnegie Mellon University, Pittsburgh_
+
## Summary
The paper presents a novel framework for information retrieval that integrates document and query models through a probabilistic ranking function grounded in Bayesian decision theory. This approach enhances the traditional language modeling methods by estimating both document and query language models, allowing for a more nuanced understanding of user preferences, context, and word semantics. A key innovation introduced is the use of Markov chains to estimate query models, which improves upon previous translation models by addressing issues related to data sparsity and context independence. The authors evaluate their methods using TREC collections, demonstrating significant performance improvements over standard query expansion techniques, particularly for short queries in web data. The framework emphasizes risk minimization in the retrieval process, allowing for a flexible and general approach to ranking documents based on their relevance to user queries. By leveraging the strengths of language modeling and incorporating user-specific knowledge, the proposed methods show promise in enhancing the effectiveness of information retrieval systems. The experiments conducted validate the efficacy of the Markov chain method for query expansion, highlighting its potential to outperform traditional models like TF-IDF and Rocchio in various retrieval scenarios.
+## Issues Targeted
+- **Integration of Document and Query Models**: The need for a framework that effectively combines document models and query models in information retrieval.
+
+- **Risk Minimization in Retrieval**: The challenge of casting the retrieval problem in terms of risk minimization using Bayesian decision theory.
+
+- **User Preferences and Context Modeling**: The necessity to model user preferences, query context, synonymy, and word senses in query language models.
+
+## Contribution/Novelty
+- **New Framework for Information Retrieval**: Introduction of a probabilistic framework that integrates document and query models using Bayesian decision theory, focusing on risk minimization.
+
+- **Extension of Language Modeling Approach**: Development of an operational retrieval model that extends the language modeling approach by estimating both document and query language models.
+
+- **Markov Chain Method for Query Expansion**: Proposal of a novel method using Markov chains to estimate query models, allowing for the incorporation of user preferences and context in a more dynamic manner.
+
+- **Semantic Smoothing Techniques**: Introduction of a new approach to semantic smoothing that addresses the limitations of traditional translation models, enabling better handling of synonyms and word senses.
+
+## Approach
+- **Probabilistic Framework**
+ - The paper employs a probabilistic framework for information retrieval that integrates document and query models based on Bayesian decision theory.
+
+- **Risk Minimization**
+ - The retrieval problem is framed in terms of risk minimization, where the goal is to minimize the expected loss associated with presenting documents to users.
+
+- **Language Modeling**
+ - A language modeling approach is utilized, where:
+ - A multinomial model is estimated for each document.
+ - A query likelihood is computed based on the document language model.
+ - The Kullback-Leibler divergence is used to compare document and query models.
+
+- **Estimation of Document and Query Models**
+ - The approach involves estimating both document language models and query language models, allowing for a more comprehensive understanding of the retrieval process.
+
+- **Markov Chain Method**
+ - A novel Markov chain method is introduced for expanding query models:
+ - The method simulates a random walk through documents and words, allowing for the estimation of translation probabilities between words and query terms.
+ - This method captures user browsing behavior and incorporates document relevance.
+
+- **Smoothing Techniques**
+ - The paper discusses various smoothing techniques to ensure non-zero probabilities for query terms not present in documents, including:
+ - Linear interpolation with background models.
+ - Semantic smoothing to incorporate synonyms and word sense information.
+
+## Dataset/Testing
+**Dataset**
+
+- **TREC Collections**: The paper evaluates the proposed methods using three different TREC (Text REtrieval Conference) testing collections:
+ - **AP Collection**: Specifically, the AP89 collection on disk 1, which includes topics 1–50.
+ - **TREC8 Ad Hoc Task Collection**: This collection includes topics 401–450 on disks 4 & 5 (CR).
+ - **TREC8 Web Track Collection**: This collection also includes topics 401–450, focusing on web data.
+
+**Testing**
+- **Query Selection**
+ - The evaluation setup approximates real-world retrieval scenarios by using the titles of each TREC topic description as queries. These titles typically contain an average of 2.5 words, with one to four words each.
+
+- **Pre-Indexing**
+ - All documents and queries were pre-indexed using the indexing approach described in the paper, which includes calculating necessary statistics at index time.
+
+- **Tokenization**
+ - The documents and queries were tokenized using the Porter stemmer, but no stopword list was utilized to test the robustness of the modeling techniques.
+
+- **Performance Metrics**
+ - The performance of the proposed methods was assessed using various metrics, including:
+ - Non-interpolated average precision.
+ - Recall at 1,000 documents.
+ - Initial precision (interpolated precision at 0% recall).
+
+- **Comparative Analysis**
+ - The results were compared against standard query expansion methods and strong baseline TF-IDF systems to demonstrate the effectiveness of the proposed approach.
+
+## Result
+- **Performance Improvement with Query Models**: The introduction of the query translation model significantly improved retrieval performance compared to the basic language model:
+ - **AP89 Collection**:
+ - Average Precision (AvgPr) improved from 0.188 (simple LM) to 0.201 (query model), a 7% increase.
+ - Further improvement to 0.232 (query model with pseudo-feedback), a 23% increase.
+ - **TREC8 Collection**:
+ - AvgPr improved from 0.241 (simple LM) to 0.266 (query model), a 10% increase.
+ - Further improvement to 0.294 (query model with pseudo-feedback), a 22% increase.
+ - **Web Collection**:
+ - AvgPr improved from 0.244 (simple LM) to 0.275 (query model), a 13% increase.
+ - Further improvement to 0.304 (query model with pseudo-feedback), a 25% increase.
+
+- **Comparison with TF-IDF and Rocchio**: The query translation model was compared with traditional TF-IDF models and the Rocchio method:
+ - On the **AP89 Collection**, the query model performed worse than TF-IDF with Rocchio (AvgPr of 0.201 vs. 0.230), but the query model with pseudo-feedback showed a slight improvement (0.232).
+ - On the **TREC8 Collection**, the query model (AvgPr of 0.266) performed similarly to TF-IDF (0.256), but the query model with pseudo-feedback (0.294) showed a 15% improvement.
+ - On the **Web Collection**, the query model (AvgPr of 0.275) outperformed TF-IDF (0.226) by 22%, and the query model with pseudo-feedback (0.304) achieved a 35% improvement over Rocchio.
+
+## Findings
+- **Effective Integration of Models**: The proposed framework successfully integrates document and query models, demonstrating that a probabilistic approach based on risk minimization can enhance information retrieval.
+- **Significant Performance Improvements**: The use of query translation models and the Markov chain method for query expansion resulted in substantial improvements in retrieval performance across various datasets, particularly for short queries.
+
## Limitations
-- **Potential Overfitting**: The models may be prone to overfitting, especially when using a limited number of documents for feedback in the Markov chain method.
+- **Computational Complexity**: The Markov chain method, while effective, may introduce computational complexity, particularly in terms of the time required to compute probabilities and perform the random walk.
+- **Short Query Focus**: The methods were primarily evaluated on short queries, which may not generalize as effectively to longer, more complex queries.
-- **Context-Independence of Translation Models**: The translation probabilities used in the models are context-independent, which limits their ability to handle word-sense ambiguity and contextual nuances.
+## Scope
+- **Future Research Directions**: The paper suggests several avenues for future research, including:
+ - Exploring more sophisticated methods for estimating translation probabilities that can better handle word sense ambiguity and context.
+ - Investigating the integration of additional user context and feedback mechanisms to further enhance retrieval relevance.
+ - Expanding the framework to accommodate more complex query types and longer queries.
# A Neural Corpus Indexer for Document Retrieval
**Domain**: RAG
@@ -216,13 +1210,106 @@ The paper presents a novel framework for information retrieval that integrates d
**DOI**: [https://doi.org/10.48550/ARXIV.2206.02743](https://doi.org/10.48550/ARXIV.2206.02743)
+**Published**: Advances in Neural Information Processing Systems (2022)
+
+**Authors**: [Yujing Wang](https://www.webofscience.com/wos/author/record/55724650), [Yingyan Hou](https://www.webofscience.com/wos/author/record/42287047), [Haonan Wang](https://www.webofscience.com/wos/author/record/68019245), [Ziming Miao](https://www.webofscience.com/wos/author/record/11504231), [Shibin Wu](https://www.webofscience.com/wos/author/record/46951651), [Hao Sun](https://www.webofscience.com/wos/author/record/45223724), [Qi Chen](https://www.webofscience.com/wos/author/record/32038774), [Yuqing Xia](https://www.webofscience.com/wos/author/record/15124357), [Chengmin Chi](https://www.webofscience.com/wos/author/record/44005151), [Guoshuai Zhao](https://www.webofscience.com/wos/author/record/2458591), [Zheng Liu](https://www.webofscience.com/wos/author/record/61070361), [Xing Xie](https://www.webofscience.com/wos/author/record/45717534), [Hao Allen Sun](https://www.webofscience.com/wos/author/record/45277652), [Weiwei Deng](https://www.webofscience.com/wos/author/record/62347619), [Qi Zhang](https://www.webofscience.com/wos/author/record/47965805), [Mao Yang](https://www.webofscience.com/wos/author/record/51623825), _Microsoft_
+
## Summary
The paper presents the Neural Corpus Indexer (NCI), an innovative end-to-end deep neural network designed to enhance document retrieval performance by directly generating relevant document identifiers for specific queries. Traditional document retrieval methods often rely on separate indexing and retrieval stages, which can limit optimization for final retrieval targets. NCI addresses this limitation by employing a sequence-to-sequence architecture that integrates training and indexing, utilizing techniques such as a prefix-aware weight-adaptive decoder, query generation, and semantic document identifiers. Empirical results demonstrate that NCI significantly outperforms existing methods, achieving notable improvements in recall metrics on benchmark datasets like NQ320k and TriviaQA. The authors highlight the advantages of NCI, including its ability to capture deep interactions between queries and documents, and its potential to serve as a comprehensive solution for next-generation information retrieval systems. By optimizing the entire retrieval process within a unified framework, NCI reduces the dependency on traditional indexing methods and enhances the efficiency of document retrieval, making it a promising approach for future research and applications in the field.
+## Issues Targeted
+- **Inefficiency of Traditional Index-Retrieve Paradigms**: Traditional document retrieval methods often follow a rigid index-retrieve paradigm that is not optimized for final retrieval targets.
+
+- **Limited Recall Performance**: Existing methods struggle with recall performance, which is crucial for the effectiveness of web search engines.
+
+- **Inability to Capture Document Semantics**: Term-based retrieval approaches fail to capture the semantics of documents, leading to poor retrieval of similar documents with different wordings.
+
+- **Deep Query-Document Interaction Challenges**: Current models do not effectively incorporate deep interactions between queries and documents, limiting their performance.
+
+- **Need for End-to-End Models**: There is a need for end-to-end models that can directly retrieve relevant candidates without relying on explicit indexing.
+
+## Contribution/Novelty
+- **Introduction of Neural Corpus Indexer (NCI)**: The paper proposes a novel end-to-end differentiable document retrieval model called Neural Corpus Indexer (NCI), which directly retrieves relevant document identifiers for a given query.
+
+- **Unified Training and Indexing**: NCI unifies the training and indexing stages into a single deep neural network, allowing for end-to-end optimization using realistic query-document pairs.
+
+- **Prefix-Aware Weight-Adaptive (PAWA) Decoder**: The introduction of a novel decoder architecture, the prefix-aware weight-adaptive (PAWA) decoder, which customizes token predictions based on different prefixes, enhancing the model's ability to generate relevant document identifiers.
+
+- **Utilization of Query Generation Techniques**: The paper employs a query generation network to create augmented query-document pairs, which helps the model better understand document semantics and improves training effectiveness.
+
+- **Semantic Document Identifiers**: NCI leverages hierarchical k-means clustering to generate semantic identifiers for documents, ensuring that similar documents have closely related identifiers, which aids in the retrieval process.
+
+## Approach
+- **Components of NCI**
+ - **Query Generation:**: A query generation network is employed to create diverse query-document pairs. This includes:
+ - **DocT5Query**: A sequence-to-sequence transformer model that generates queries based on document content.
+ - **Document as Query**: Using the first 64 terms of each document and additional random segments as queries to enhance semantic awareness.
+ - **Encoder**: The encoder follows a standard transformer architecture, processing the input query to produce a query embedding.
+ - **Prefix-Aware Weight-Adaptive (PAWA) Decoder**: A novel decoder architecture that adapts the weights for token predictions based on the prefix of the identifier being generated. This allows the model to differentiate between the same tokens appearing in different contexts.
+
+- **Hierarchical k-Means for Semantic Identifiers**: Documents are represented by semantic identifiers generated through a hierarchical k-means algorithm, which organizes documents into a tree structure. This ensures that similar documents have closely related identifiers.
+
+- **Consistency-Based Regularization**: A consistency-based regularization loss is applied during training to reduce overfitting by ensuring that the model's predictions are consistent across different forward passes.
+
+- **Inference via Beam Search**: During inference, the model retrieves the top N relevant documents using beam search constrained by the hierarchical structure of semantic identifiers, ensuring that only valid identifiers are generated.
+
+## Dataset/Testing
+**Dataset**
+- **Natural Questions (NQ) Dataset**
+ - Specifically, the version referred to as NQ320k.
+ - Contains 320,000 query-document pairs.
+ - Documents are sourced from Wikipedia pages, and the queries are natural language questions.
+ - The dataset includes a predetermined training and validation split for evaluation.
+
+- **TriviaQA Dataset**
+ - A reading comprehension dataset that includes 78,000 query-document pairs.
+ - Queries may have multiple answers, and the documents are also gathered from the Wikipedia domain.
+
+**Testing**
+- The performance of the Neural Corpus Indexer (NCI) model is assessed using widely accepted metrics for information retrieval, including:
+ - **Recall@N**: Measures how often the desired document is found in the top N retrieved candidates.
+ - **Mean Reciprocal Rank (MRR)**: Calculates the reciprocal of the rank at which the first relevant document is retrieved.
+ - **R-Precision**: Precision after R documents have been retrieved, where R is the number of relevant documents for the query.
+
+## Result
+- **Performance on Natural Questions (NQ320k) Dataset**
+ - The Neural Corpus Indexer (NCI) model achieved the following results:
+ - **Recall@1**: 65.86% (Base), 66.23% (Large), 70.46% (Ensemble)
+ - **Recall@10**: 85.20% (Base), 85.27% (Large), 89.35% (Ensemble)
+ - **Recall@100**: 92.42% (Base), 92.49% (Large), 94.75% (Ensemble)
+ - **Mean Reciprocal Rank (MRR@100)**: 73.12% (Base), 73.37% (Large), 77.82% (Ensemble)
+ - NCI outperformed all baseline methods, including:
+ - DSI (Base): 27.40% Recall@1
+ - SEAL (Base): 56.98% Recall@1
+ - BM25: 15.11% Recall@1
+ - Notably, NCI with fine-tuned query generation (w/ qg-ft) achieved 72.78% Recall@1, outperforming SEAL by 21.4%.
+
+- **Performance on TriviaQA Dataset**
+ - The NCI model achieved the following results:
+ - **Recall@5**: 90.49% (Base), 91.73% (Large), 94.60% (Ensemble)
+ - **Recall@20**: 94.45% (Base), 95.17% (Large), 96.89% (Ensemble)
+ - **Recall@100**: 96.94% (Base), 97.44% (Large), 98.20% (Ensemble)
+ - **R-Precision**: 73.90% (Base), 74.94% (Large), 80.84% (Ensemble)
+ - Again, NCI outperformed baseline methods, including:
+ - SEAL (Base): 86.3% Recall@5
+ - BM25: 56.91% Recall@5
+
+## Findings
+- **Significant Performance Improvement**: The Neural Corpus Indexer (NCI) model outperformed existing state-of-the-art methods in document retrieval, achieving substantial gains in recall and ranking metrics on both the NQ320k and TriviaQA datasets.
+
+- **Effectiveness of Components**: The study demonstrated that the novel components of NCI, such as the Prefix-Aware Weight-Adaptive (PAWA) decoder, query generation techniques, and consistency-based regularization, significantly contributed to its superior performance.
+
## Limitations
-- **Model Capacity Requirements**: The current implementation of the Neural Corpus Indexer (NCI) requires a larger model capacity to effectively scale to web-scale applications.
-- **Dependency on Augmented Queries**: The performance of NCI heavily relies on the quality and diversity of augmented queries generated during training.
-- **Limited Generalization**: The model may struggle to generalize well to unseen queries or documents that differ significantly from the training data.
+- **Model Capacity for Large-Scale Deployment**: The current implementation of NCI may require a larger model capacity to effectively scale to web-scale document retrieval systems, which could pose challenges in terms of computational resources.
+
+- **Inference Speed**: While the model demonstrates competitive performance, the inference speed needs improvement to handle online queries in real-time applications effectively.
+
+- **Updating Model-Based Index**: The paper notes difficulties in updating the model-based index when new documents are added to the system, which could limit the model's adaptability to dynamic datasets.
+
+## Scope
+- **Enhancing Model Capacity**: Future research could explore architectures like sparsely-gated Mixture of Experts (MoE) to increase model capacity without a proportional increase in computational cost.
+
+- **Semantic Clustering for Efficient Retrieval**: Investigating methods to group documents into semantic clusters could allow NCI to retrieve relevant cluster identifiers, improving efficiency in document retrieval.
# TIARA: Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base
**Domain**: RAG
@@ -231,14 +1318,142 @@ The paper presents the Neural Corpus Indexer (NCI), an innovative end-to-end dee
**DOI**: [https://doi.org/10.18653/v1/2022.emnlp-main.555](https://doi.org/10.18653/v1/2022.emnlp-main.555)
+**Published**: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (2022)
+
+**Authors**:
+- Yiheng Shu, Yuzhong Qu, _State Key Laboratory of Novel Software Technology_
+- Zhiwei Yu, Börje F. Karlsson, Chin-Yew Lin, _Microsoft Research_
+- Yuhan Li, _Nankai University_
+- Tingting Ma, _Harbin Institute of Technology_
+
## Summary
-The experimental results demonstrate that TIARA significantly outperforms previous state-of-the-art models on benchmark datasets like GrailQA and WebQuestionsSP, achieving improvements of at least 4.1 and 1.1 F1 points, respectively. Notably, TIARA excels in zero-shot generalization scenarios, showcasing its robustness in handling unseen queries. The paper highlights the importance of contextual retrieval and constrained decoding in enhancing the capabilities of PLMs for KBQA, ultimately contributing to a more effective and reliable system for querying large-scale knowledge bases.
+The experimental results demonstrate that TIARA significantly outperforms previous state-of-the-art models on benchmark datasets like GrailQA and WebQuestionsSP, achieving improvements of at least 4.1 and 1.1 F1 points, respectively. Notably, TIARA excels in zero-shot generalization scenarios, showcasing its robustness in handling unseen queries. The paper highlights the importance of contextual retrieval and constrained decoding in enhancing the capabilities of PLMs for KBQA, ultimately contributing to a more effective and reliable system for querying large-scale knowledge bases.
+
+## Issues Targeted
+- **Understanding Semantics**
+ - Difficulty in comprehending the semantics of both questions and relevant knowledge from the knowledge base (KB).
+ - Challenges in linking user queries to the appropriate KB items due to the large size and complexity of KBs.
+
+- **Logical Form Generation**
+ - Ensuring that the generated logical forms are both semantically and syntactically correct.
+ - The need for executable logical forms that conform to the specifications of the KB.
+
+- **Generalization Challenges**
+ - Addressing the limitations of existing models that assume a strong correspondence between training and test distributions (i.i.d. assumption).
+ - The requirement for compositional generalization to handle novel combinations of schema items.
+ - The necessity for zero-shot generalization to manage previously unseen items or domains.
+
+## Contribution/Novelty
+- **Introduction of TIARA Model**
+ - The paper presents a novel KBQA model named TIARA, which utilizes multi-grained retrieval to enhance the performance of pre-trained language models (PLMs) in question answering over large knowledge bases.
+
+- **Multi-grained Retrieval Approach**: TIARA employs a multi-grained retrieval strategy that focuses on three key aspects:
+ - **Entity Retrieval**: Enhances mention detection to improve entity linking, especially in zero-shot scenarios.
+ - **Exemplary Logical Form Retrieval**: Retrieves logical forms that provide semantic and structural context, aiding in KB grounding and valid logical form generation.
+ - **Schema Retrieval**: Independently retrieves relevant schema items, which serve as a semantic supplement to logical forms, allowing for better handling of complex queries.
+
+- **Constrained Decoding Mechanism**
+ - The introduction of constrained decoding to control the output space during logical form generation, reducing the likelihood of generating invalid logical forms or schema items. This mechanism ensures that the generated outputs conform to the KB specifications.
+
+## Approach
+- **Multi-grained Retrieval**: The TIARA model employs a multi-grained retrieval strategy that focuses on three main components:
+ - **Entity Retrieval**:
+ - Utilizes a standard pipeline consisting of mention detection, candidate generation, and entity disambiguation.
+ - Enhances mention detection by treating it as a span classification task, improving performance in zero-shot scenarios.
+ - **Exemplary Logical Form Retrieval**:
+ - Enumerates and ranks logical forms starting from potential entities and their neighborhoods (up to two hops).
+ - Uses a ranking mechanism based on a cross-encoder to score pairs of questions and candidate logical forms.
+ - **Schema Retrieval**:
+ - Independently retrieves relevant schema items (classes and relations) using a dense schema retrieval method.
+ - Employs a cross-encoder to learn the interaction between the question and schema items, ensuring that the retrieved schema items are semantically relevant.
+
+- **Target Logical Form Generation**
+ - The model uses a transformer-based sequence-to-sequence model, specifically T5, for generating target logical forms.
+ - The input to the T5 model consists of the concatenated question, retrieved entities, exemplary logical forms, and schema items.
+ - The output is the target logical form, generated through fine-tuning with a cross-entropy objective.
+
+- **Constrained Decoding**
+ - Implements constrained decoding during the generation process to reduce errors in logical form generation.
+ - Validates the generated tokens against a set of allowed operators and schema items stored in a trie (prefix tree), ensuring that only valid options are considered during decoding.
+
+## Dataset/Testing
+**Dataset**
+
+- **GrailQA**
+ - A large-scale knowledge base question answering (KBQA) dataset based on Freebase.
+ - Contains 64,331 questions annotated with logical forms.
+ - Specifically designed to evaluate three levels of generalization:
+ - **i.i.d. (Independent and Identically Distributed)**
+ - **Compositional Generalization**
+ - **Zero-shot Generalization**
+ - Questions in GrailQA can involve up to 4-hop relations and may include functions for counting, superlatives, and comparatives.
+
+- **WebQuestionsSP (WebQSP)**
+ - A widely used semantic parsing dataset that includes 4,937 questions sourced from the Google Suggest API.
+ - The dataset is utilized for evaluating the model's performance in generating logical forms and answering questions.
+
+**Testing**
+
+- **Empirical Evaluation**: The TIARA model was tested through extensive experiments on the aforementioned datasets. The evaluation metrics used include:
+ - **Exact Match (EM)**: Measures the percentage of predictions that match the ground truth exactly.
+ - **F1 Score**: A harmonic mean of precision and recall, providing a balance between the two metrics.
+
+## Result
+- **Performance on GrailQA**
+ - **Overall Results:**
+ - TIARA achieved an F1 score of **78.5%** and an exact match (EM) of **73.0%** on the hidden test set of GrailQA.
+ - This performance outperformed previous state-of-the-art (SOTA) methods by at least **4.1 F1 points**.
+ - **Generalization Settings**:
+ - **i.i.d.**: TIARA scored **87.8% F1**.
+ - **Compositional**: TIARA scored **69.2% F1**.
+ - **Zero-shot**: TIARA scored **68.0% F1**.
+ - Notably, TIARA improved by **4.7 F1 points** in zero-shot generalization compared to previous models.
+
+- **Performance on WebQuestionsSP (WebQSP)**
+ - **Overall Results**:
+ - TIARA achieved an F1 score of **76.7%** and a hits@1 score of **73.9%** on the WebQSP test set.
+ - This performance also surpassed previous SOTA methods by **1.1 F1 points**.
+ - **Comparison with Oracle Annotations**:
+ - When using oracle entity linking annotations, TIARA's F1 score increased to **78.9%**.
+
+- **Ablation Studies**: The ablation studies demonstrated the importance of various components:
+ - Removing exemplary logical form retrieval (w/o ELF) resulted in a **26.5 F1 point** drop in zero-shot settings.
+ - Removing schema retrieval (w/o Schema) led to a **2.7 F1 point** decrease overall.
+ - Constrained decoding (w/o CD) contributed to a **0.4 F1 point** improvement overall.
+
+- **Entity and Schema Retrieval Performance**
+ - **Entity Retrieval**: TIARA's entity retrieval achieved a precision of **87.2%**, recall of **88.6%**, and F1 score of **85.4%** on the GrailQA validation set.
+ - **Schema Retrieval**: TIARA outperformed the baseline in schema retrieval, achieving a recall of **95.8%** for classes and **92.0%** for relations.
+
+- **Error Analysis**: The paper also included an error analysis, identifying the main sources of errors:
+ - **Entity Retriever Errors**: 46% due to mention detection failures or high ambiguity.
+ - **Syntactic Errors**: 26% related to rare operators or complex logical forms.
+ - **Semantic Errors**: 12% from incorrect schema item selection, particularly in zero-shot instances.
+
+## Findings
+- **Effectiveness of Multi-grained Retrieval**:
+ - The multi-grained retrieval approach significantly improves the robustness of the TIARA model in understanding and generating logical forms for KBQA.
+ - The integration of entity retrieval, exemplary logical form retrieval, and schema retrieval provides comprehensive contextual support for the PLM.
+
+- **Impact of Constrained Decoding**:
+ - Constrained decoding effectively reduces generation errors by ensuring that only valid operators and schema items are considered during logical form generation.
+ - This mechanism enhances the syntactic and semantic correctness of the generated outputs.
+
+- **Generalization Capabilities**:
+ - TIARA demonstrates strong performance across various generalization settings, including i.i.d., compositional, and zero-shot scenarios, outperforming previous state-of-the-art methods.
+ - The model's ability to handle previously unseen items and complex queries is notably improved.
## Limitations
-- **Retrieval Efficiency**:
- - The retrieval efficiency of the proposed method needs further optimization.
- - Logical form enumeration takes more than 7 seconds per question without caching, which may not meet practical requirements.
+- **Dependence on Logical Form Annotations**: The model relies on annotated logical forms for training, which requires extensive and specialized data collection efforts, making it less scalable for other domains or knowledge bases.
+
+- **Retrieval Efficiency**: The retrieval process, particularly logical form enumeration, can be time-consuming, taking more than 7 seconds per question without caching. This may limit practical applications in real-time scenarios.
+
+- **Zero-shot Performance**: While TIARA improves zero-shot performance, it still faces challenges with high ambiguity in entity retrieval and incorrect schema item selection, particularly in unseen instances.
+## Scope
+- **Future Research Directions**:
+ - The paper suggests exploring methods to bridge the gap between unstructured natural language and structured knowledge bases during the pre-training phase of PLMs.
+ - Investigating more efficient retrieval mechanisms to enhance the speed and scalability of the model for practical applications.
# Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
**Domain**: RAG
@@ -247,20 +1462,103 @@ The experimental results demonstrate that TIARA significantly outperforms previo
**DOI**: [https://doi.org/10.48550/arXiv.2310.11511](https://doi.org/10.48550/arXiv.2310.11511)
+**Publication**: The Twelfth International Conference on Learning Representations (2024)
+
+**Authors**:
+- [Akari Asai](https://openreview.net/profile?id=~Akari_Asai2), [Zeqiu Wu](https://openreview.net/profile?id=~Zeqiu_Wu1), [Yizhong Wang](https://openreview.net/profile?id=~Yizhong_Wang2), [Hannaneh Hajishirzi](https://openreview.net/profile?id=~Hannaneh_Hajishirzi1), _University of Washington_
+- [Avirup Sil](https://openreview.net/profile?id=~Avirup_Sil1), _IBM Research AI_
+
## Summary
The paper introduces a novel framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG), designed to enhance the quality and factual accuracy of large language models (LLMs) through on-demand retrieval and self-reflection. Traditional Retrieval-Augmented Generation (RAG) methods often retrieve fixed passages indiscriminately, which can lead to irrelevant or low-quality outputs. In contrast, Self-RAG employs a mechanism where the model generates special reflection tokens to determine the necessity of retrieval, evaluate the relevance of retrieved passages, and critique its own outputs. This adaptive approach allows the model to tailor its responses based on the specific requirements of the task, significantly improving factual accuracy and citation precision across various tasks, including open-domain question answering and long-form generation. Experimental results demonstrate that Self-RAG outperforms state-of-the-art LLMs and retrieval-augmented models, including ChatGPT and Llama2-chat, across multiple benchmarks. The framework not only enhances the model's ability to generate accurate and verifiable information but also allows for customizable behavior during inference, enabling users to adjust the model's focus on factual accuracy versus creativity based on the task at hand. Overall, Self-RAG represents a significant advancement in the field of LLMs, addressing the persistent issue of factual inaccuracies in generated content.
+## Issues Targeted
+- **Factual Inaccuracies in LLMs**
+ - Large language models (LLMs) often produce responses with factual errors due to reliance on parametric knowledge.
+
+- **Limitations of Conventional RAG Approaches**
+ - Indiscriminate retrieval of passages can lead to low-quality responses.
+ - Fixed retrieval methods do not adapt to the necessity of retrieval for specific tasks.
+
+- **Need for Improved Factuality and Citation Accuracy**
+ - There is a demand for better factual accuracy and citation precision in long-form text generation.
+
+- **Control Over Generation Process**
+ - Existing models lack mechanisms for self-reflection and critique during the generation process, which can enhance output quality.
+
+## Contribution/Novelty
+- **Introduction of Self-RAG Framework**: The paper presents a novel framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances the quality and factuality of LLMs through adaptive retrieval and self-reflection.
+
+- **On-Demand Retrieval Mechanism**: Self-RAG allows LLMs to retrieve relevant passages on-demand, rather than relying on a fixed number of retrieved documents, improving the relevance and utility of the information used in generation.
+
+- **Reflection Tokens for Self-Critique**: The framework introduces special tokens, known as reflection tokens, which enable the model to assess the necessity of retrieval, evaluate the relevance of retrieved passages, and critique its own generated outputs.
+
+## Approach
+- **Self-Reflective Retrieval-Augmented Generation (Self-RAG) Framework**:
+ - The core approach is the Self-RAG framework, which integrates retrieval, generation, and self-reflection to improve the performance of large language models (LLMs).
+
+- **End-to-End Training**:
+ - Self-RAG trains a single arbitrary LLM in an end-to-end manner, allowing it to learn to retrieve relevant passages, generate text, and reflect on its own outputs.
+
+- **Reflection Tokens**: Special tokens, called reflection tokens, are introduced to facilitate self-assessment. These tokens are categorized into:
+ - **Retrieval Tokens**: Indicate whether retrieval is necessary.
+ - **Critique Tokens**: Evaluate the quality of the generated output and its support from retrieved passages.
+
+- **Parallel Processing of Retrieved Passages**
+ - Self-RAG processes multiple retrieved passages in parallel, evaluating their relevance and generating corresponding task outputs concurrently.
+
+- **Critique and Selection Process**
+ - After generating outputs, the model generates critique tokens to assess its own output quality and selects the best response based on factuality and overall quality.
+
+- **Customizable Decoding Algorithm**
+ - The framework includes a customizable decoding algorithm that allows for flexible adjustments in retrieval frequency and model behavior based on user-defined constraints and preferences.
+
+- **Training with Reflection Tokens**
+ - The model is trained on a diverse dataset interleaved with reflection tokens and retrieved passages, enabling it to learn to generate and utilize these tokens effectively during inference.
+
+## Dataset/Testing
+**Dataset**
+
+- **Diverse Instruction-Following Datasets**: The training data consists of a variety of instruction-following input-output pairs sampled from multiple sources, including:
+ - **Open-Instruct Processed Data**: A collection of instruction-following datasets.
+ - **Knowledge-Intensive Datasets**: Such as Natural Questions, Wizard of Wikipedia, and FEVER.
+
+- **Specific Datasets for Evaluation**: The paper evaluates the Self-RAG framework on several specific datasets across different tasks:
+ - **Closed-set Tasks**:
+ - **PubHealth**: A fact verification dataset related to public health.
+ - **ARC-Challenge**: A multiple-choice reasoning dataset created from scientific exams.
+ - **Short-form Generation Tasks**:
+ - **PopQA**: An open-domain question answering dataset.
+ - **TriviaQA-Unfiltered**: Another open-domain QA dataset.
+ - **Long-form Generation Tasks**:
+ - **Biography Generation Task**: Evaluating the generation of biographical content.
+ - **ALCE-ASQA**: A long-form question answering task that requires citation.
+
+## Result
+**Task-Specific Results**
+ - **Closed-set Tasks**:
+ - On the **PubHealth** dataset, Self-RAG achieved higher accuracy compared to baseline models, indicating better performance in fact verification.
+ - In the **ARC-Challenge**, Self-RAG also outperformed other models, showcasing its effectiveness in multiple-choice reasoning tasks.
+ - **Short-form Generation Tasks**:
+ - In **PopQA**, Self-RAG outperformed models like ChatGPT and Llama2-chat, particularly excelling in answering rare entity queries.
+ - For **TriviaQA-Unfiltered**, Self-RAG showed significant gains in generating accurate answers compared to baseline models.
+ - **Long-form Generation Tasks**:
+ - In the biography generation task, Self-RAG achieved higher FactScore, indicating improved factual accuracy in long-form text generation.
+ - For the **ALCE-ASQA** task, Self-RAG exhibited higher citation precision and recall compared to other models, demonstrating its capability in generating long-form answers with proper citations.
+ - **Citation Accuracy**
+ - Self-RAG showed significant improvements in citation accuracy, outperforming all models except ChatGPT in citation precision, which measures whether the model-generated claims are fully supported by cited evidence.
+
+## Findings
+- **Enhanced Performance**: Self-RAG significantly outperformed existing state-of-the-art LLMs and retrieval-augmented models across various tasks, demonstrating improvements in factual accuracy, citation precision, and overall generation quality.
+
+- **Effective Use of Reflection Tokens**: The introduction of reflection tokens allowed the model to self-assess its outputs, leading to better decision-making regarding retrieval and generation quality.
+
+- **Adaptive Retrieval**: The on-demand retrieval mechanism improved the relevance of the information used in generation, reducing the likelihood of generating irrelevant or off-topic content.
+
## Limitations
-- **Dependence on Retrieved Passages**:
- - The effectiveness of Self-RAG heavily relies on the quality and relevance of the retrieved passages.
- - If the retrieval model fails to provide relevant information, the output quality may degrade significantly.
-- **Potential for Factual Inaccuracies**:
- - Despite improvements in factual accuracy, Self-RAG can still generate outputs that are not fully supported by the citations.
- - The model may produce plausible-sounding but incorrect information if the retrieved passages are misleading or incorrect.
-- **Complexity of Implementation**:
- - The framework introduces additional complexity in terms of training and inference due to the integration of reflection tokens and the need for a retriever model.
- - This complexity may hinder practical deployment in real-world applications where simplicity and efficiency are crucial.
-- **Generalization to New Tasks**: The ability of Self-RAG to generalize to new, unseen tasks or domains remains uncertain, particularly if those tasks differ significantly from the training data.
+- **Complexity of Implementation**: The integration of multiple components (retriever, generator, and critic) adds complexity to the model, which may pose challenges in deployment and maintenance.
+
+## Scope
+- **Integration with Other Techniques**: There is scope for integrating Self-RAG with other advancements in natural language processing, such as reinforcement learning from human feedback (RLHF) or multi-modal models, to further enhance its capabilities.
# Precise Zero-Shot Dense Retrieval without Relevance Labels
**Domain**: RAG
@@ -269,12 +1567,132 @@ The paper introduces a novel framework called Self-Reflective Retrieval-Augmente
**DOI**: [https://doi.org/10.18653/v1/2023.acl-long.99](https://doi.org/10.18653/v1/2023.acl-long.99)
+**Published**: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2023)
+
+**Authors**: [Luyu Gao](https://www.webofscience.com/wos/author/record/21713311), [Jamie Callan](https://www.webofscience.com/wos/author/record/19869862), _Language Technologies Institute, Carnegie Mellon University_
+- [Xueguang Ma](https://www.webofscience.com/wos/author/record/32617627), [Jimmy Lin](https://www.webofscience.com/wos/author/record/34920494), _David R. Cheriton School of Computer Science, University of Waterloo_
+
## Summary
The paper presents a novel approach called HyDE (Hypothetical Document Embeddings) aimed at improving zero-shot dense retrieval systems, which traditionally struggle without relevance labels. The authors propose a two-step process where an instruction-following language model, such as InstructGPT, generates a hypothetical document based on a given query. This document, although potentially containing inaccuracies or hallucinations, captures relevance patterns. Subsequently, an unsupervised contrastively learned encoder, like Contriever, encodes this hypothetical document into an embedding vector, which is then used to retrieve similar real documents from a corpus based on vector similarity. The experimental results demonstrate that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and achieves performance comparable to fine-tuned models across various tasks, including web search, question answering, and fact verification, as well as in multiple languages. The authors emphasize that their method requires no supervision and can be implemented using existing models without any modifications, making it a practical solution for emerging search tasks that lack relevance data.
+## Issues Targeted
+- **Challenges of Zero-Shot Dense Retrieval**:
+ - Difficulty in creating effective zero-shot dense retrieval systems without relevance labels.
+ - The inherent complexity of zero-shot learning and encoding relevance.
+
+- **Dependence on Relevance Labels:**
+ - Traditional methods often rely on large, manually labeled datasets (e.g., MS MARCO) which may not be available in all scenarios.
+ - The limitations of existing datasets that restrict commercial use and applicability in real-world search tasks.
+
+- **Inefficiency of Current Approaches:**
+ - Existing zero-shot retrieval methods often underperform compared to supervised models.
+ - The need for a solution that works out-of-the-box and generalizes across various tasks without requiring extensive training or supervision.
+
+- **Handling Hallucinations in Generated Content:**
+ - The challenge of filtering out inaccuracies or "hallucinations" in documents generated by language models.
+ - Ensuring that the generated hypothetical documents still capture relevant patterns despite potential factual errors.
+
+- **Generalization Across Languages and Tasks:**
+ - The necessity for a retrieval system that performs well not only in English but also in non-English languages.
+ - Addressing the diverse nature of emerging search tasks and queries.
+
+- **Unsupervised Learning Approaches:**
+ - Exploring self-supervised representation learning methods to improve retrieval effectiveness without supervision.
+ - The need for a robust method that leverages unsupervised contrastive learning techniques effectively.
+
+## Contribution/Novelty
+- **Introduction of HyDE (Hypothetical Document Embeddings)**:
+ - The paper proposes a novel framework called HyDE that enables effective zero-shot dense retrieval without the need for relevance labels.
+ - HyDE leverages the capabilities of instruction-following language models to generate hypothetical documents that capture relevance patterns.
+
+- **Decomposition of Dense Retrieval Tasks**:
+ - The approach decomposes the dense retrieval process into two distinct tasks:
+ - A generative task performed by an instruction-following language model (e.g., InstructGPT).
+ - A document-document similarity task executed by a contrastive encoder (e.g., Contriever).
+ - This separation allows for more effective handling of relevance modeling.
+
+- **Utilization of Unsupervised Contrastive Learning**:
+ - HyDE employs an unsupervised contrastive encoder to filter out hallucinations from the generated documents, ensuring that the embeddings focus on relevant content.
+ - This method allows for the retrieval of real documents based on the generated hypothetical document embeddings.
+
+- **No Need for Supervision or Fine-Tuning**:
+ - HyDE operates without requiring any relevance supervision or fine-tuning of the underlying models, making it a practical solution for real-world applications.
+ - The method can be implemented using existing models "out of the box," which enhances its accessibility and usability.
+
+## Approach
+- **Step 1: Generating Hypothetical Documents**
+ - **Instruction-Following Language Model**:
+ - A generative model, such as InstructGPT, is prompted with a query to generate a hypothetical document that answers the question.
+ - The generated document is not real and may contain inaccuracies (hallucinations), but it is designed to capture relevance patterns similar to actual documents.
+
+- **Step 2: Encoding the Generated Document**
+ - **Unsupervised Contrastive Encoder**:
+ - An unsupervised contrastive encoder (e.g., Contriever) is used to encode the generated hypothetical document into an embedding vector.
+ - This encoder serves as a lossy compressor, filtering out extraneous details and focusing on relevant content.
+
+- **Retrieval Process**:
+ - The embedding vector from the generated document is used to identify a neighborhood in the corpus embedding space.
+ - Real documents are retrieved based on vector similarity to the generated document's embedding, leveraging the document-document similarity captured during the contrastive pre-training of the encoder.
+
+- **No Relevance Labels Required**:
+ - The entire process is conducted without any relevance supervision, making it fully zero-shot.
+ - The approach does not require fine-tuning or training new models, allowing for immediate application of existing models.
+
+- **Expectation-Based Document Vector Calculation**:
+ - The approach involves sampling multiple hypothetical documents and averaging their embeddings to create a robust query vector.
+ - This expectation-based method enhances the reliability of the generated document representation.
+
+## Result
+- **Performance Comparison**:
+ - HyDE significantly outperformed the state-of-the-art unsupervised dense retriever, Contriever, across multiple evaluation metrics and datasets.
+ - The results indicate that HyDE's performance is comparable to fine-tuned retrieval models, demonstrating its effectiveness as a zero-shot retrieval system.
+
+- **Web Search Tasks**: In experiments on TREC DL19 and DL20 datasets:
+ - **HyDE Results**:
+ - mAP: 41.8 (DL19), 38.2 (DL20)
+ - nDCG@10: 61.3 (DL19), 57.9 (DL20)
+ - Recall@1k: 88.0 (DL19), 84.4 (DL20)
+ - **Contriever Results**:
+ - mAP: 24.0 (both DL19 and DL20)
+ - nDCG@10: 44.5 (DL19), 42.1 (DL20)
+ - **BM25 Results**:
+ - mAP: 30.1 (DL19), 28.6 (DL20)
+
+- **Low-Resource Retrieval Tasks**: In a selection of low-resource tasks from the BEIR benchmark:
+ - **HyDE Results**:
+ - nDCG@10: 69.1 (SciFact), 46.6 (Arguana), 59.3 (TREC-COVID), 27.3 (FiQA), 36.8 (DBPedia), 44.0 (TREC-NEWS), 22.3 (Climate-Fever)
+ - **Contriever Results**:
+ - nDCG@10: 64.9 (SciFact), 37.9 (Arguana), 27.3 (TREC-COVID), 24.5 (FiQA), 29.2 (DBPedia), 34.8 (TREC-NEWS), 15.5 (Climate-Fever)
+
+- **Multilingual Retrieval:**: In experiments on the Mr.TyDi dataset:
+ - **HyDE Results**:
+ - MRR@100: 41.7 (Swahili), 30.6 (Korean), 30.7 (Japanese), 41.3 (Bengali)
+ - **mContriever Results**:
+ - MRR@100: 38.3 (Swahili), 22.3 (Korean), 19.5 (Japanese), 35.3 (Bengali)
+
+## Findings
+- **Effectiveness of HyDE**:
+ - The HyDE framework significantly improves zero-shot dense retrieval performance compared to existing unsupervised models like Contriever.
+ - It achieves results comparable to fine-tuned models across various tasks, demonstrating its capability to generalize effectively.
+
+- **Generative Model Utility**:
+ - The use of an instruction-following language model (InstructGPT) to generate hypothetical documents captures relevance patterns, which enhances the retrieval process.
+ - The approach successfully filters out hallucinations through the unsupervised contrastive encoder, ensuring that the embeddings focus on relevant content.
+
+- **Multilingual Performance**:
+ - HyDE performs well in multiple languages, indicating its versatility and applicability in diverse linguistic contexts.
+
+- **No Need for Supervision**:
+ - The framework operates without requiring relevance labels or fine-tuning, making it a practical solution for real-world applications where such resources may be limited.
+
## Limitations
-- **Dependence on Language Models**: The HyDE method relies heavily on real-time generation from large language models (LLMs), which may not be suitable for tasks requiring high throughput or low latency.
-- **Potential Bias**: The generated documents may reflect biases present in the LLMs, potentially skewing the search results.
+- **Dependence on Language Model Quality**: The performance of HyDE is contingent on the quality of the instruction-following language model used. If the model generates poor or irrelevant hypothetical documents, it may negatively impact retrieval effectiveness.
+
+- **Real-Time Generation Constraints**: The reliance on real-time generation from large language models may not be suitable for applications requiring high throughput or low latency, potentially limiting its deployment in time-sensitive scenarios.
+
+## Scope
+- **Future Research Directions**: The paper suggests potential extensions of the HyDE approach to more complex tasks, such as multi-hop retrieval and conversational search, indicating a pathway for future research and development.
+
# Corrective Retrieval Augmented Generation
**Domain**: Corrective Retrieval Augmented Generation
@@ -283,6 +1701,8 @@ The paper presents a novel approach called HyDE (Hypothetical Document Embedding
**DOI**: [https://doi.org/10.48550/arXiv.2401.15884](https://doi.org/10.48550/arXiv.2401.15884)
+**Published**: ICLR 2025 Conference Withdrawn Submission **WITHDRAWN**
+
## Summary
The paper introduces Corrective Retrieval Augmented Generation (CRAG), a novel approach designed to enhance the robustness of large language models (LLMs) by addressing the issue of hallucinations and inaccuracies that arise from reliance on retrieved documents. CRAG incorporates a lightweight retrieval evaluator that assesses the quality of retrieved documents and triggers corrective actions based on their relevance, categorized as Correct, Incorrect, or Ambiguous. When the retrieved documents are deemed correct, they undergo a knowledge refinement process to extract essential information. Conversely, if they are incorrect, CRAG resorts to large-scale web searches for supplementary knowledge. The method is designed to be plug-and-play, allowing it to be integrated seamlessly with existing retrieval-augmented generation frameworks. Experimental results across four diverse datasets demonstrate that CRAG significantly improves the performance of standard retrieval-augmented generation (RAG) and state-of-the-art approaches like Self-RAG. The findings highlight CRAG's adaptability and generalizability in both short- and long-form generation tasks, showcasing its effectiveness in mitigating the challenges posed by inaccurate retrievals. The paper concludes by emphasizing the importance of self-correction mechanisms in enhancing the reliability of generative models while acknowledging the need for further advancements in retrieval evaluation capabilities.
@@ -301,16 +1721,142 @@ The paper introduces Corrective Retrieval Augmented Generation (CRAG), a novel a
**DOI**: [https://doi.org/10.18653/v1/2022.naacl-main.194](https://doi.org/10.18653/v1/2022.naacl-main.194)
+**Published**: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022)
+
+**Authors**; [Michael Glass](https://www.webofscience.com/wos/author/record/659899), [Gaetano Rossiello](https://www.webofscience.com/wos/author/record/29366662), [Md Faisal Mahbub Chowdhury](https://www.webofscience.com/wos/author/record/2268597), [Ankita Rajaram Naik](https://www.webofscience.com/wos/author/record/45043390), [Pengshan Cai](https://www.webofscience.com/wos/author/record/19717931), [Alfio Gliozzo](https://www.webofscience.com/wos/author/record/2118999), _IBM Research AI_
+
## Summary
The paper presents a novel approach called Re2G (Retrieve, Rerank, Generate), which enhances the performance of generative language models by integrating retrieval and reranking mechanisms into a BART-based sequence-to-sequence generation framework. The authors argue that while large transformer models like GPT-3 and T5 have shown impressive capabilities, they can be further improved by leveraging non-parametric memory through retrieval from a corpus of passages. Re2G combines neural initial retrieval with a reranking process that allows for the merging of results from different retrieval methods, such as BM25 and neural approaches, thereby improving the quality of the generated outputs. The system is trained end-to-end using a novel variation of knowledge distillation, which utilizes only the ground truth of the target sequence output. The experimental results demonstrate significant improvements across four diverse tasks—zero-shot slot filling, question answering, fact checking, and dialog—achieving relative gains of 9% to 34% over previous state-of-the-art models on the KILT leaderboard. The paper highlights the effectiveness of the reranking mechanism and the benefits of ensembling retrieval methods, ultimately establishing Re2G as a leading approach in knowledge-intensive natural language processing tasks. The authors have made their code available as open source to facilitate further research and development in this area.
+## Issues Targeted
+- **Knowledge Limitations of Large Transformers**: Large transformers like GPT-3 and T5 require extensive parameter spaces to store knowledge, which can be computationally expensive.
+
+- **Inefficiency in Knowledge Retrieval**: Traditional models struggle with efficiently retrieving relevant knowledge from large corpora, leading to suboptimal performance in knowledge-intensive tasks.
+
+- **Integration of Retrieval and Generation**: Existing models often do not effectively combine retrieval mechanisms with generative capabilities, limiting their performance on tasks requiring external knowledge.
+
+- **Reranking Challenges**: Difficulty in merging retrieval results from different sources with incomparable scoring systems, which can hinder the effectiveness of the retrieval process.
+
+## Contribution/Novelty
+- **Introduction of Re²G Framework**: The paper presents a novel framework called Re²G (Retrieve, Rerank, Generate) that integrates neural initial retrieval and reranking into a BART-based sequence-to-sequence generation model.
+
+- **Enhanced Reranking Mechanism**: The proposed reranking approach allows for the merging of retrieval results from different sources (e.g., BM25 and neural retrieval) with incomparable scores, improving the overall retrieval quality.
+
+- **End-to-End Training Methodology**: A unique variation of knowledge distillation is introduced to train the initial retrieval, reranker, and generation components using only the ground truth of the target sequence output, enabling a more efficient training process.
+
+## Approach
+- **Neural Initial Retrieval**
+ - The framework employs a neural retrieval mechanism to initially retrieve relevant passages from a large corpus based on the input query.
+
+- **Reranking Mechanism**
+ - A reranking component is utilized to improve the quality of the retrieved passages. This component merges results from different retrieval methods (e.g., BM25 and neural retrieval) and ranks them based on their relevance to the query.
+
+- **BART-based Sequence-to-Sequence Generation**
+ - The generation component is based on the BART (Bidirectional and Auto-Regressive Transformer) model, which generates the final output sequence by conditioning on the reranked passages.
+
+- **Training Phases**: The training process consists of several phases:
+ - **DPR Training**: Initial training of the Dense Passage Retrieval (DPR) model using provenance ground truth.
+ - **Reranking Training**: Training the reranker using the results from the initial retrieval.
+ - **Generation Training**: Training the BART model on the target output sequence.
+ - **Full End-to-End Training**: Combining all components and training them together to optimize performance.
+
+- **Knowledge Distillation**
+ - The approach incorporates online knowledge distillation, where the reranker serves as a teacher model to provide labels to the DPR student model, enhancing the retrieval performance.
+
+- **Inference Process**
+ - During inference, the query is encoded, and the top passages from both the DPR and BM25 retrieval methods are obtained. These passages are then reranked, and the top passages are used to generate the final output through the BART model.
+
+## Dataset/Testing
+**Dataset**
+
+- **T-REx**
+ - Task: Slot Filling
+ - Description: Provides input as a head entity and relation, expecting the output to be the entity or term that fills the slot.
+
+- **Natural Questions**
+ - Task: Question Answering
+ - Description: An open version of the dataset where relevant Wikipedia pages must be found through a retrieval step.
+
+- **TriviaQA**
+ - Task: Question Answering
+ - Description: Similar to Natural Questions, it requires retrieving relevant information from Wikipedia.
+
+- **FEVER**
+ - Task: Fact Checking
+ - Description: A classification task that is framed as a generation task, where the model generates either "SUPPORTS" or "REFUTES" based on the evidence.
+
+- **Wizard of Wikipedia**
+ - Task: Dialog
+ - Description: Involves generating conversational responses based on a short dialog history and relevant Wikipedia content.
+
+**Testing**
+- The Re²G framework was tested by evaluating its performance on the aforementioned datasets from the KILT benchmark. The authors measured various performance metrics, including:
+
+ - R-Precision
+ - Recall@5
+ - Accuracy
+ - F1 Score
+ - KILT-specific metrics (KILT-AC, KILT-F1, KILT-Rouge-L for dialog tasks)
+- The results were compared against previous state-of-the-art models on the KILT leaderboard, demonstrating the effectiveness of the Re²G approach across different knowledge-intensive tasks.
+
+## Result
+- **T-REx (Slot Filling)**
+ - **R-Precision**: 81.24
+ - **Recall@5**: 88.58
+ - **Accuracy**: 86.60
+ - **F1 Score**: 89.20
+ - **KILT-AC**: 75.66
+ - **KILT-F1**: 77.08
+ - **Relative Gain**: 9% over previous state-of-the-art.
+
+- **Natural Questions (Question Answering)**
+ - **R-Precision**: 70.92
+ - **Recall@5**: 74.79
+ - **Accuracy**: 46.70
+ - **F1 Score**: 62.44
+ - **KILT-AC**: 39.23
+ - **KILT-F1**: 50.90
+ - **Relative Gain**: 31% over previous state-of-the-art.
+
+- **TriviaQA (Question Answering)**
+ - **R-Precision**: 72.01
+ - **Recall@5**: 73.16
+ - **Accuracy**: 74.01
+ - **F1 Score**: 80.86
+ - **KILT-AC**: 56.04
+ - **KILT-F1**: 60.91
+ - **Relative Gain**: 34% over previous state-of-the-art.
+
+- **FEVER (Fact Checking)**
+ - **R-Precision**: 90.06
+ - **Recall@5**: 92.91
+ - **Accuracy**: 91.05
+ - **KILT-AC**: 80.56
+ - **Relative Gain**: 22% over previous state-of-the-art.
+
+- **Wizard of Wikipedia (Dialog)**
+ - **R-Precision**: 56.48
+ - **Recall@5**: 74.00
+ - **Rouge-L**: 17.29
+ - **F1 Score**: 19.35
+ - **KILT-RL**: 11.37
+ - **KILT-F1**: 12.75
+ - **Relative Gain**: 10% over previous state-of-the-art.
+
+## Findings
+- **Significant Performance Improvements**: The Re²G framework achieved substantial relative gains (9% to 34%) over previous state-of-the-art models across various tasks in the KILT benchmark, demonstrating its effectiveness in knowledge-intensive NLP tasks.
+
+- **Effectiveness of Reranking**: The incorporation of a reranking mechanism allowed for better merging of retrieval results from different sources, leading to improved retrieval quality and overall performance.
+
## Limitations
-- **Dependence on Ground Truth Completeness**:
- - The model's performance is significantly affected by the completeness of the ground truth data.
- - Instances of ambiguity in head entities and multiple possible fillers for relations can lead to errors in output.
-- **Challenges in End-to-End Training**:
- - The end-to-end training process presents challenges, particularly in ensuring that the query encoder's gradients are effectively utilized.
- - The proposed solutions to address this issue (combining scores, freezing the query encoder, and online knowledge distillation) may not universally apply across all datasets.
+- **Challenges in TriviaQA**: The model experienced a decline in retrieval metrics for the TriviaQA dataset, suggesting potential issues with the completeness of the provenance ground truth for this specific task.
+
+- **Limited Improvement in Some Tasks**: While the model showed significant gains in most tasks, the improvements were less pronounced in certain areas, such as the Wizard of Wikipedia dialog task, where it ranked second to a newer system.
+
+- **Complexity of Training**: The end-to-end training process, while beneficial, adds complexity to the training pipeline, which may require careful tuning and optimization.
+
+## Scope
+- **Future Research Directions**: The paper suggests that further experiments could be conducted on domain adaptation of the Re²G framework for specific tasks like question answering or dialog, which may provide insights into its application in real-world scenarios.
# Active Retrieval Augmented Generation
**Domain**: RAG
@@ -319,17 +1865,93 @@ The paper presents a novel approach called Re2G (Retrieve, Rerank, Generate), wh
**DOI**: [https://doi.org/10.18653/v1/2023.emnlp-main.495](https://doi.org/10.18653/v1/2023.emnlp-main.495)
+**Published**: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)
+
+**Authors**:
+- [Zhengbao Jiang](https://aclanthology.org/people/z/zhengbao-jiang/), [Frank Xu](https://aclanthology.org/people/f/frank-f-xu/), [Luyu Gao](https://aclanthology.org/people/l/luyu-gao/), [Zhiqing Sun](https://aclanthology.org/people/z/zhiqing-sun/), [Yiming Yang](https://aclanthology.org/people/y/yiming-yang/), [Jamie Callan](https://aclanthology.org/people/j/jamie-callan/), [Graham Neubig](https://aclanthology.org/people/g/graham-neubig/), _Language Technologies Institute, Carnegie Mellon University_
+- [Qian Liu](https://aclanthology.org/people/q/qian-liu/), _Sea AI Lab_
+- [Jane Dwivedi-Yu](https://aclanthology.org/people/j/jane-dwivedi-yu/), _FAIR, Meta_
+
## Summary
The paper presents a novel approach called Forward-Looking Active Retrieval Augmented Generation (FLARE), which enhances the capabilities of large language models (LMs) by integrating an active retrieval mechanism during the text generation process. Traditional retrieval-augmented LMs typically retrieve information only once based on the initial input, which can be limiting for long-form generation tasks that require ongoing access to relevant information. FLARE addresses this limitation by allowing the model to actively decide when and what to retrieve based on the confidence of the generated content. By predicting the upcoming sentence and using it as a query for retrieval, FLARE can gather additional information dynamically, thereby improving the accuracy and relevance of the generated text. The authors conducted comprehensive experiments across four knowledge-intensive long-form generation tasks, demonstrating that FLARE outperforms existing retrieval methods, including both single-time and multi-time retrieval baselines. The results indicate that FLARE's active retrieval strategy significantly enhances the model's performance, particularly in tasks requiring complex reasoning and information synthesis. The paper concludes by highlighting the effectiveness and generalizability of FLARE, suggesting future directions for improving active retrieval strategies and developing efficient architectures for integrating information retrieval with language generation.
+## Issues Targeted
+- **Hallucination in Language Models**: Large language models (LMs) often generate factually inaccurate content, known as hallucination.
+
+- **Limitations of Existing Retrieval-Augmented Models**: Current models typically use a single retrieval step based on the input, which is insufficient for long-form generation tasks that require ongoing information gathering.
+
+- **Need for Active Retrieval**: There is a necessity for models that can actively decide when and what to retrieve during the generation process, rather than relying on passive retrieval methods.
+
+## Contribution/Novelty
+- **Introduction of Active Retrieval Augmented Generation Framework**: The paper proposes a novel framework for active retrieval augmented generation, which allows models to dynamically decide when and what information to retrieve during the text generation process.
+
+- **Forward-Looking Active Retrieval Method (FLARE)**: The introduction of FLARE, a method that anticipates future content by generating a temporary next sentence and using it as a query for retrieval, represents a significant advancement in retrieval strategies.
+
+- **Iterative Retrieval Process**: FLARE enables an iterative process where the model can continuously gather relevant information throughout the generation, addressing the limitations of single-time retrieval methods.
+
+- **Confidence-Based Retrieval Triggering**: The framework incorporates a mechanism that triggers retrieval based on the confidence of generated tokens, ensuring that information is only retrieved when the model is uncertain, thus optimizing the retrieval process.
+
+## Approach
+- **Forward-Looking Active Retrieval (FLARE)**: FLARE is the core method proposed, which involves the following steps:
+ - **Temporary Sentence Generation**: The model generates a temporary next sentence based on the user input and previously generated content.
+ - **Confidence Assessment**: The model checks the generated sentence for low-confidence tokens (tokens with low probability).
+ - **Dynamic Retrieval**: If low-confidence tokens are detected, the temporary sentence is used as a query to retrieve relevant documents from an external knowledge corpus.
+ - **Regeneration**: The model regenerates the next sentence using the retrieved documents, ensuring that the output is informed by accurate and relevant information.
+
+- **Iterative Process**: The approach is iterative, allowing the model to continuously generate sentences, assess confidence, retrieve information, and regenerate until the end of the generation task is reached.
+
+- **Query Formulation Strategies**: Two methods for query formulation are employed:
+ - **FLARE with Retrieval Instructions**: The model generates explicit search queries when additional information is needed, guided by retrieval-related instructions.
+ - **Direct FLARE**: The model directly uses the generated temporary sentence as a query for retrieval, enhancing the relevance of the information retrieved.
+
+- **Confidence-Based Active Retrieval**
+ - The approach incorporates a confidence threshold to determine when to trigger retrieval, ensuring that unnecessary retrievals are avoided and that the model only seeks additional information when it lacks knowledge.
+
+## Dataset/Testing
+- **2WikiMultihopQA**
+ - **Task**: Multihop Question Answering
+ - **Description**: This dataset contains complex questions that require reasoning across multiple Wikipedia articles to arrive at the final answer.
+
+- **StrategyQA**
+ - **Task**: Commonsense Reasoning
+ - **Description**: A collection of yes/no questions that require commonsense knowledge to generate accurate answers.
+
+- **ASQA**
+ - **Task**: Long-Form Question Answering
+ - **Description**: This dataset consists of ambiguous questions that can have multiple interpretations, requiring comprehensive answers that cover all possible aspects.
+
+- **WikiAsp**
+ - **Task**: Open-Domain Summarization
+ - **Description**: A dataset designed for generating aspect-based summaries about entities from Wikipedia, focusing on gathering information from multiple sources.
+
+**Testing**
+- **Few-Shot In-Context Learning**: The authors employed few-shot in-context learning techniques, where they used a limited number of examples from each dataset to guide the model's performance during evaluation.
+
+- **Evaluation Metrics**: The performance of FLARE was assessed using various metrics specific to each task, including Exact Match (EM), F1 score, ROUGE, and Disambiguation F1 (D-F1), among others.
+
+## Results
+- **Task-Specific Results:**
+ - **2WikiMultihopQA**:
+ - FLARE achieved a significant improvement in Exact Match (EM) scores compared to all baseline methods, indicating its strong capability in handling complex multihop reasoning questions.
+ - **StrategyQA**:
+ - FLARE showed competitive performance, surpassing single-time retrieval models and demonstrating its effectiveness in commonsense reasoning tasks.
+ - **ASQA**:
+ - In the long-form question answering task, FLARE provided comprehensive answers that addressed multiple interpretations of ambiguous questions, outperforming baseline models.
+ - **WikiAsp**:
+ - FLARE excelled in generating aspect-based summaries, achieving higher scores in metrics such as ROUGE and named entity-based F1 compared to the baselines.
+
+## Findings
+- **Confidence-Based Retrieval**: The approach of triggering retrieval based on the confidence of generated tokens was effective, as it minimized unnecessary retrievals and focused on enhancing the accuracy of the generated content.
+
+- **Importance of Forward-Looking Queries**: The findings highlighted that using future-oriented queries (i.e., the next sentence) for retrieval was more beneficial than relying on past context, leading to improved generation quality.
+
## Limitations
-- **Increased Overheads**:
- - Interleaving generation and retrieval can increase computational overhead and costs.
- - Each retrieval requires activating the language model multiple times, which can be inefficient.
-- **Performance in Specific Datasets**:
- - FLARE did not provide significant gains on certain datasets like Wizard of Wikipedia and ELI5.
- - The Wizard of Wikipedia dataset involves relatively short outputs, making multiple retrievals unnecessary.
- - ELI5 requires in-depth answers to open-ended questions, which presents challenges in grounding generation in retrieval.
+- ***Performance on Certain Datasets***: The paper noted that FLARE did not provide significant gains on datasets like Wizard of Wikipedia and ELI5, where the output is relatively short or where grounding generation in retrieval is challenging.
+
+- **Increased Computational Overhead**: The interleaving of generation and retrieval processes can increase computational overhead, as the model needs to be activated multiple times for each retrieval, which may impact efficiency.
+
+## Scope
+- **Future Research Directions**: The paper suggests exploring better strategies for active retrieval, such as refining the mechanisms for determining when to retrieve and improving the efficiency of the integration of retrieval and generation.
# Fine Tuning vs. Retrieval Augmented Generation for Less Popular Knowledge
**Domain**: PEFT + RAG
@@ -338,16 +1960,100 @@ The paper presents a novel approach called Forward-Looking Active Retrieval Augm
**DOI**: [https://doi.org/10.1145/3673791.3698415](https://doi.org/10.1145/3673791.3698415)
+**Published**: SIGIR-AP (2024)
+
+**Authors**:
+- [Heydar Soudani](https://www.semanticscholar.org/author/Heydar-Soudani/2165569122), [Faegheh Hasibi](https://www.semanticscholar.org/author/Faegheh-Hasibi/1951737), _Radboud University, Nijmegen_
+- [E. Kanoulas](https://www.semanticscholar.org/author/E.-Kanoulas/1713134), _University of Amsterdam_
+
## Summary
The paper investigates the effectiveness of two prominent approaches—Retrieval Augmented Generation (RAG) and Fine-Tuning (FT)—in enhancing the performance of language models (LMs) when dealing with less popular or low-frequency knowledge. The authors conducted extensive experiments on twelve different LMs, exploring various fine-tuning methods, data augmentation techniques, and retrieval models. The findings reveal that while fine-tuning improves performance across various entities, RAG significantly outperforms FT, especially for the least popular factual knowledge. Additionally, the success of both approaches is enhanced by optimizing retrieval and data augmentation techniques. The study highlights that fine-tuning, although beneficial for smaller models, requires substantial resources, leading to the proposal of a new method called Stimulus RAG (SRAG), which effectively eliminates the need for costly fine-tuning and data augmentation.
The research emphasizes the importance of customizing LMs for less-resourced domains, particularly in applications like question answering systems that require accurate responses about specialized knowledge. The results indicate that while fine-tuning can enhance the accuracy of LMs, RAG provides a more effective solution for integrating less popular knowledge. The paper concludes that the SRAG approach not only surpasses the performance of fine-tuned models but also offers a cost-effective alternative for enriching LMs with factual knowledge, thereby addressing the challenges associated with data scarcity in specialized domains.
+## Issues Targeted
+- **Performance Diminishment on Low-Frequency Concepts**: Language Models (LMs) struggle with less popular or low-frequency concepts, particularly in domain-specific applications.
+
+- **Comparison of Knowledge Injection Methods**: The paper investigates the effectiveness of two prominent approaches: Retrieval Augmented Generation (RAG) and Fine-Tuning (FT) for enhancing LMs' performance on less popular knowledge.
+
+- **Temporal Degradation**: LMs may experience degradation in performance over time, particularly with less frequently encountered knowledge.
+
+## Contribution/Novelty
+- **Comprehensive Comparison of RAG and FT**: The paper provides an extensive evaluation of Retrieval Augmented Generation (RAG) and Fine-Tuning (FT) methods for question answering over less popular factual knowledge, highlighting their effectiveness across various setups.
+
+- **Introduction of Stimulus RAG (SRAG)**: The paper proposes a novel RAG approach called Stimulus RAG, which enhances the performance of LMs by guiding them to generate correct responses based on hints extracted from retrieved documents. This approach aims to eliminate the need for costly fine-tuning processes.
+
+## Approach
+- **Knowledge Injection with Fine-Tuning (FT)**
+ - **Data Augmentation**: Utilizes synthetic question-answer (QA) pairs generated from relevant documents to fine-tune LMs.
+ - **QA Generation Methods**:
+ - **End-to-End (E2E) QA Generation**: A fine-tuned sequence-to-sequence model generates QA pairs directly from documents.
+ - **Prompt-Based QA Generation**: An instruction-tuned LM generates QA pairs using a structured prompt.
+
+- **Knowledge Injection with Retrieval Augmented Generation (RAG)**
+ - **Retriever Component**: Identifies and ranks relevant documents from a document corpus based on the input query using both sparse (BM25) and dense (DPR, Contriever) retrieval models.
+ - **Generator Component**: A generative LM synthesizes answers based on the retrieved documents and the input query.
+
+- **Stimulus RAG (SRAG) Approach**
+ - A novel RAG method that enhances the generation of responses by providing hints extracted from the top-ranked retrieved documents.
+ - **Hint Extraction**: The most relevant sentence from the retrieved documents is identified and used as a hint to guide the LM in generating accurate responses.
+
+## Dataset/Testing
+**Dataset**
+- **PopQA**: An open-domain question-answering dataset focused on long-tail entities, constructed from diverse relationship types in Wikidata.
+
+- **WitQA**: Another open-domain QA dataset that emphasizes entity-centric questions, utilizing Wikipedia pageviews as a proxy for entity popularity.
+
+- **EntityQuestion (EQ)**: A QA dataset that covers long-tail entities, sampling knowledge triples from Wikidata based on frequency distributions.
+
+## Result
+- **Comparison of Fine-Tuning (FT) and Retrieval Augmented Generation (RAG)**
+ - RAG significantly outperformed FT across all tested models, particularly for less popular factual knowledge.
+ - Fine-tuned models combined with RAG either outperformed or performed on par with vanilla LMs using RAG in most cases.
+
+- **Effectiveness of Fine-Tuning Methods**
+ - Full Fine-Tuning (FT) was found to be more effective than Parameter Efficient Fine-Tuning (PEFT) for LMs with less than 2 billion parameters.
+ - PEFT preserved the reasoning ability of LMs and outperformed full FT when combined with RAG.
+
+- **Data Augmentation Method Impact**
+ - Prompt-based QA generation methods produced higher quality synthetic data compared to end-to-end (E2E) methods, leading to better performance in downstream tasks.
+
+- **LM Type and Size Analysis**
+ -Decoder-only models consistently outperformed encoder-decoder models of similar sizes.
+ -Smaller fine-tuned LMs with RAG could perform on par or better than larger LMs, demonstrating that a small fine-tuned LM (e.g., StableLM2) could outperform larger models (e.g., Llama3).
+
+- **Retrieval Model Performance**
+ - The performance of the RAG system improved with higher-performing retrieval models, indicating a direct correlation between retrieval effectiveness and overall QA accuracy.
+ - As the popularity of factual knowledge increased, the performance of the retriever decreased, highlighting the challenges in retrieving information for more popular entities.
+
+- **Stimulus RAG (SRAG) Performance**
+ - The proposed SRAG approach outperformed all other combinations of fine-tuning and RAG setups, demonstrating that guiding LMs with hints extracted from retrieved documents can enhance accuracy without the need for extensive fine-tuning.
+
+- **Overall Accuracy Trends**
+ - The results showed that RAG significantly increased accuracy for the least popular entities, while FT improved accuracy across all popularity levels.
+ - The accuracy of models decreased from less popular to more popular buckets, but increased again in the most popular bucket, suggesting that LMs can rely on their internal knowledge for popular entities.
+
+## Findings
+- **RAG vs. FT Performance**: Retrieval Augmented Generation (RAG) consistently outperformed Fine-Tuning (FT) in enhancing the performance of language models (LMs) on less popular factual knowledge.
+
+- **Impact of Fine-Tuning Methods**: Full Fine-Tuning (FT) was more effective than Parameter Efficient Fine-Tuning (PEFT) for smaller LMs, but PEFT maintained reasoning abilities when combined with RAG.
+
+- **Model Size and Type Effects**: Smaller fine-tuned LMs with RAG could outperform larger models, indicating that model size does not always correlate with performance, especially in the context of less popular knowledge.
## Limitations
- **Resource Intensity of Fine-Tuning**: Fine-tuning methods require significant computational resources and extensive training data, which may not be feasible for all applications, particularly in less-resourced domains.
- **Complexity of Implementation**: The proposed Stimulus RAG (SRAG) method, while effective, may introduce additional complexity in implementation compared to traditional fine-tuning or RAG methods.
+- **Retrieval Model Performance**: The effectiveness of the RAG system improved with better retrieval models, and the performance of retrieval models varied with the popularity of factual knowledge.
+
+- **Stimulus RAG (SRAG) Effectiveness**: The proposed SRAG approach outperformed all other configurations, demonstrating that providing hints from retrieved documents can significantly enhance accuracy.
+
+## Limitations
+- **Limited Exploration of Retrieval Models**: While the paper explored various retrieval models, the focus was primarily on a few specific models, potentially limiting the scope of findings regarding retrieval effectiveness.
+
+## Scope
+- **Exploration of Other Domains**: Testing the proposed methods on a wider range of datasets and domains to assess generalizability and effectiveness in different contexts.
+
# Evaluating Retrieval Quality in Retrieval-Augmented Generation
**Domain**: RAG
@@ -356,11 +2062,74 @@ The research emphasizes the importance of customizing LMs for less-resourced dom
**DOI**: [https://doi.org/10.1145/3626772.3657957](https://doi.org/10.1145/3626772.3657957)
+**Published**: SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (2024)
+
+**Authors**: [Alireza Salemi](https://www.webofscience.com/wos/author/record/47179810), [Hamed Zamani](https://www.webofscience.com/wos/author/record/18413635), _University of Massachusetts Amherst_
+
## Summary
The paper introduces a novel evaluation approach called eRAG for assessing retrieval models within Retrieval-Augmented Generation (RAG) systems. Traditional end-to-end evaluation methods are computationally expensive and often fail to correlate well with the downstream performance of RAG systems. eRAG addresses these issues by utilizing a large language model (LLM) to generate document-level relevance labels based on the output produced for each document in the retrieval list. This method not only enhances the correlation with downstream performance—showing improvements in Kendall’s tau correlation ranging from 0.168 to 0.494—but also significantly reduces computational costs, consuming up to 50 times less GPU memory compared to end-to-end evaluations.
The authors conducted extensive experiments across various datasets, demonstrating that eRAG consistently outperforms baseline methods in terms of correlation with the LLM's performance. The findings suggest that eRAG is more efficient in both inference time and memory utilization, making it a promising approach for evaluating retrieval models in RAG systems. The implementation of eRAG is made publicly available to facilitate further research in this domain.
+## Issues Targeted
+- **Challenges in Evaluating RAG Systems**
+ - Traditional end-to-end evaluation methods are computationally expensive.
+ - Lack of transparency in determining which retrieved documents contributed to the generated output.
+ - Resource-intensive processes requiring significant time and GPU memory, especially with large retrieval results.
+
+- **Correlation with Downstream Performance**
+ - Evaluation of retrieval model performance based on query-document relevance labels shows a small correlation with the downstream performance of RAG systems.
+ - Human annotations for relevance can be costly and impractical, leading to a lack of meaningful relationship between evaluated metrics and downstream performance.
+
+## Contribution/Novelty
+- **Introduction of eRAG Evaluation Method**: The paper proposes a novel evaluation approach called eRAG, which evaluates retrieval models in retrieval-augmented generation (RAG) systems by utilizing the large language model (LLM) to generate document-level annotations based on the output for each document.
+
+- **Document-Level Annotations**: eRAG generates relevance labels for each document in the retrieval list by evaluating the LLM's output against the ground truth labels of the downstream task, allowing for a more precise assessment of each document's contribution.
+
+- **Computational Efficiency**: eRAG offers substantial computational advantages, improving runtime and consuming up to 50 times less GPU memory than traditional end-to-end evaluation methods, making it a more efficient alternative.
+
+## Approach
+- **Utilization of Large Language Model (LLM)**
+ -Each document in the retrieval list is individually processed by the LLM, which generates an output based on the query and the document.
+ -The output generated for each document is then evaluated against the expected downstream task output (ground truth) to create relevance labels.
+
+- **Document-Level Annotation Generation**
+ - The relevance label for each document is derived from the LLM's performance on the downstream task, expressed mathematically as: [ G_q[d] = E_M(M(q, {d}), y) \quad \forall d \in R_k ] where ( G_q[d] ) is the relevance score for document ( d ), ( M ) is the LLM, ( q ) is the query, and ( y ) is the expected output.
+
+- **Evaluation Metrics**
+ - Various downstream task metrics (e.g., accuracy, exact match, ROUGE) are employed to obtain document-level annotations.
+ - These annotations are then aggregated using set-based or ranking metrics to produce a single evaluation score for the retrieval result list.
+
+## Dataset/Testing
+**Datasets**
+
+- Natural Questions (NQ)
+- TriviaQA
+- HotpotQA
+- FEVER
+- Wizard of Wikipedia (WoW)
+
+**Testing**
+
+- **Validation Set Usage**: Due to the unavailability of ground truth labels for the test set, the authors utilize the publicly accessible validation set from each of the mentioned datasets for their experiments.
+
+- **Retrieval Corpus**: The retrieval corpus employed is the Wikipedia dump associated with the KILT benchmark, with documents segmented into passages (maximum length of 100 words) for evaluation.
+
+- **Document-Level Relevance Labels**: The KILT benchmark provides document-level relevance labels (called Provenance) for its datasets, which are used to evaluate retrieval performance.
+
+## Result
+- **Correlation with Downstream Performance**
+ - The eRAG approach achieves significantly higher correlation with the downstream performance of RAG systems compared to baseline methods.
+ - Improvements in Kendall’s tau correlation range from **0.168** to **0.494** across the evaluated datasets.
+
+- **Computational Efficiency**: eRAG demonstrates significant computational advantages:
+ - **Runtime**: eRAG is, on average, 2.468 times faster than end-to-end evaluation.
+ - **Memory Consumption**: eRAG consumes up to 50 times less GPU memory compared to traditional end-to-end evaluation methods.
+
+## Findings
+- **Higher Correlation with Downstream Performance**: The eRAG method shows a significantly higher correlation with the downstream performance of RAG systems compared to traditional evaluation methods, indicating its effectiveness in assessing retrieval models.
+
+- **Efficiency Gains**: eRAG is computationally more efficient, being on average **2.468 times faster** and consuming up to **50 times less GPU memory** than end-to-end evaluation methods.
## Limitations
- **Dependency on LLM’s Internal Mechanisms**: eRAG evaluates retrieval quality based on the downstream task performance of the LLM. This creates a dependency on the LLM’s internal mechanisms, making it difficult to generalize results across different models. If an LLM processes retrieved documents differently, the evaluation may not accurately reflect retrieval effectiveness.
@@ -369,6 +2138,9 @@ The authors conducted extensive experiments across various datasets, demonstrati
- **Potential Sensitivity to LLM Size and Architecture**: The correlation analysis shows variability depending on the LLM size (T5-small vs. T5-base) and retrieval augmentation strategy (Fusion-in-Decoder vs. In-Prompt Augmentation). The lack of significant performance differences suggests that eRAG’s reliability across different architectures is not fully established.
+## Scope
+- **Future Research Directions**: The paper opens avenues for further research in improving the evaluation of retrieval models, particularly in exploring different LLM architectures and their impact on retrieval performance.
+
# Benchmarking Large Language Models in Retrieval-Augmented Generation
**Domain**: RAG
@@ -377,18 +2149,125 @@ The authors conducted extensive experiments across various datasets, demonstrati
**DOI**: https://doi.org/10.1609/aaai.v38i16.29728
+**Published**: AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence (2024)
+
+**Authors**: [Jiawei Chen](https://dl.acm.org/profile/99661467991), [Hongyu Lin](https://dl.acm.org/profile/99660092537), [Xianpei Han](https://dl.acm.org/profile/81447596396), [Le Sun](https://dl.acm.org/profile/81310489120), _Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China_
+
## Summary
The paper investigates the effectiveness of Retrieval-Augmented Generation (RAG) in enhancing the performance of large language models (LLMs) while addressing challenges such as factual hallucination and outdated knowledge. The authors establish a new benchmark, the Retrieval-Augmented Generation Benchmark (RGB), which evaluates LLMs on four fundamental abilities: noise robustness, negative rejection, information integration, and counterfactual robustness. The benchmark consists of instances generated from the latest news articles and external documents retrieved via search engines, allowing for a comprehensive assessment of LLMs' capabilities in utilizing retrieved information.
The evaluation of six state-of-the-art LLMs reveals that while RAG can improve response accuracy, significant limitations remain in the models' ability to handle noise, reject irrelevant information, integrate data from multiple sources, and identify factual errors in retrieved documents. The findings indicate that LLMs often struggle with noise confusion, fail to reject inappropriate answers, and lack the ability to effectively summarize information from various documents. The authors emphasize the need for further advancements in RAG methodologies to ensure reliable and accurate responses from LLMs, highlighting the importance of careful design and evaluation in the application of RAG techniques.
+## Issues Targeted
+- **Factual Hallucination**: LLMs often generate responses that are factually incorrect or irrelevant.
+
+- **Knowledge Outdating**: The internal knowledge of LLMs can become outdated, leading to inaccuracies in responses.
+
+- **Lack of Domain-Specific Expertise**: LLMs may not possess the necessary expertise in specific domains, affecting their performance in specialized tasks.
+
+- **Noise Robustness**: The ability of LLMs to extract useful information from documents that contain irrelevant or noisy information.
+
+## Contribution/Novelty
+- **Development of the Retrieval-Augmented Generation Benchmark (RGB)**: The paper introduces RGB, a new benchmark designed specifically to evaluate the RAG capabilities of LLMs. This benchmark is unique as it assesses four fundamental abilities: noise robustness, negative rejection, information integration, and counterfactual robustness.
+
+- **Systematic Evaluation of LLMs**: The paper conducts a comprehensive evaluation of six state-of-the-art LLMs using the RGB benchmark. This evaluation highlights the limitations of current models in effectively utilizing retrieved information and handling various challenges associated with RAG.
+
+- **Identification of Key Challenges**: Through the evaluation, the paper identifies specific challenges that LLMs face when applying RAG, such as confusion from similar information, failure to reject irrelevant answers, and difficulties in integrating information from multiple sources.
+
+## Approach
+ -**Creation of the Retrieval-Augmented Generation Benchmark (RGB)**
+ - **Corpus Development**: The RGB benchmark is constructed using the latest news articles to ensure that the instances reflect current knowledge and minimize bias from the internal knowledge of LLMs.
+ - **Query Generation**: Events, questions, and answers are generated from the news articles using prompts in ChatGPT, which helps filter out irrelevant content.
+
+- **Data Retrieval Process**
+ - **External Document Retrieval**: For each query, relevant documents are fetched using a search engine (Google's API). The top snippets are extracted and converted into text chunks for evaluation.
+ - **Document Classification**: The retrieved documents are categorized into positive (containing the answer) and negative (not containing the answer) based on their relevance to the query.
+
+- **Testbed Construction**: The RGB benchmark is divided into four testbeds, each targeting a specific ability required for RAG:
+ - **Noise Robustness**: Evaluates the model's ability to extract useful information from documents that contain noise.
+ - **Negative Rejection**: Assesses whether the model can decline to answer when no relevant information is available.
+ - **Information Integration**: Tests the model's capability to integrate information from multiple documents to answer complex questions.
+ - **Counterfactual Robustness**: Evaluates the model's ability to identify and handle factual errors in the retrieved documents.
+
+- **Evaluation Metrics**: The paper employs various metrics to assess the performance of LLMs:
+ - **Accuracy**: Measures the correctness of answers for noise robustness and information integration.
+ - **Rejection Rate**: Evaluates the model's ability to reject questions when only noisy documents are provided.
+ - **Error Detection Rate**: Assesses whether the model can identify factual errors in the documents.
+ - **Error Correction Rate**: Measures the model's ability to provide correct answers after identifying errors.
+
+## Dataset/Testing
+The paper does not utilize a pre-existing dataset; instead, it creates a new dataset specifically for the evaluation of Retrieval-Augmented Generation (RAG) capabilities in large language models (LLMs). The dataset is referred to as the **Retrieval-Augmented Generation Benchmark (RGB)**. Here are the key aspects of how the dataset was constructed and tested:
+
+- **Data Collection**: The RGB benchmark is constructed using latest news articles to ensure that the instances reflect current knowledge and minimize bias from the internal knowledge of LLMs.
+
+- **Question-Answer Generation**: The authors use prompts in ChatGPT to generate events, questions, and answers based on the collected news articles. This process helps filter out irrelevant content and ensures that the questions are relevant to the current context.
+
+- **External Document Retrieval**: For each generated query, relevant documents are fetched using a search engine (Google's API). The top snippets from these documents are extracted and converted into text chunks for evaluation.
+
+- **Document Classification**: The retrieved documents are categorized into:
+ - **Positive Documents**: Containing the correct answer to the query.
+ - **Negative Documents**: Not containing the answer but relevant to the query.
+
+- **Testbed Construction**: The RGB benchmark is divided into four testbeds, each targeting a specific ability required for RAG:
+ - Noise Robustness
+ - Negative Rejection
+ - Information Integration
+ - Counterfactual Robustness
+
+## Result
+- **Noise Robustness**
+ - **Accuracy Performance:**
+ - LLMs demonstrated varying levels of accuracy under different noise ratios (0 to 0.8). For example:
+ - ChatGPT: Accuracy decreased from 96.33% (0 noise) to 76.00% (0.8 noise).
+ - ChatGLM2-6B: Accuracy decreased from 91.33% (0 noise) to 57.33% (0.8 noise).
+ - **Challenges Identified**:
+ - LLMs struggled with long-distance information, evidence uncertainty, and concept confusion, leading to incorrect answers when noise was present.
+
+- **Negative Rejection**
+ - **Rejection Rates**:
+ - The highest rejection rates for LLMs when only noisy documents were provided were:
+ - ChatGPT: 45% (English), 43.33% (Chinese).
+ - Other models showed lower rejection rates, indicating difficulty in declining to answer when relevant information was absent.
+ - **Instruction Adherence**:
+ - LLMs often failed to strictly follow instructions for rejection, leading to unpredictable responses.
+
+- **Information Integration**
+ - **Accuracy Performance**:
+ - The highest accuracy for LLMs in integrating information from multiple documents was only around 60% without noise, dropping to 43% with noise.
+ - **Complex Questions**:
+ - LLMs struggled significantly with complex questions that required integrating information from multiple sources, especially when noise was present.
+
+- **Counterfactual Robustness**
+ - **Error Detection and Correction:**
+ - LLMs had difficulty identifying and correcting factual errors in retrieved documents. For instance:
+ - ChatGPT: 33.33% error detection rate and a low error correction rate.
+ - **Dependence on Retrieved Information:**
+ - Even when LLMs had the correct internal knowledge, they often prioritized incorrect information from retrieved documents over their own knowledge.
+
+## Findings
+- **Performance Variability**: The evaluation of six state-of-the-art LLMs revealed significant variability in performance across different tasks related to Retrieval-Augmented Generation (RAG), particularly in noise robustness, negative rejection, information integration, and counterfactual robustness.
+
+- **Noise Impact**: Increasing noise ratios in external documents led to a marked decrease in accuracy for LLMs. For instance, ChatGPT's accuracy dropped from 96.33% to 76.00% as noise increased.
+
+- **Challenges in Negative Rejection**: LLMs struggled to reject questions when no relevant information was available, with rejection rates being relatively low (e.g., 45% for ChatGPT). This indicates a tendency to generate answers even when the information is insufficient.
+
+- **Integration Difficulties**: LLMs exhibited weak performance in integrating information from multiple documents, with accuracy only reaching around 60% for simple questions and dropping significantly under noisy conditions.
+
+- **Counterfactual Robustness Issues**: LLMs had difficulty detecting and correcting factual errors in retrieved documents, often prioritizing incorrect information over their internal knowledge.
+
+
## Limitations
- **Noise Confusion**: LLMs exhibit difficulty in distinguishing relevant information from noisy documents, leading to inaccurate answers when similar but incorrect information is present.
+- **Instruction Adherence**: LLMs often failed to follow instructions for rejecting questions or identifying errors, leading to unpredictable and unreliable outputs.
+
- **Negative Rejection Challenges**: The models often fail to reject questions when no relevant information is available in the retrieved documents, resulting in misleading or incorrect responses.
- **Limited Understanding of Complex Queries**: The models show a lack of capability in comprehending and addressing complex questions, which can lead to merging errors, ignoring parts of the question, or misalignment in responses.
+## Scope
+- **Benchmarking Framework**: The introduction of the RGB benchmark provides a structured framework for evaluating RAG capabilities in LLMs, which can be utilized in future studies to assess new models or improvements in existing ones.
+
# How Much Knowledge Can You Pack Into the Parameters of a Language Model?
**Domain**: Foundation
@@ -396,21 +2275,83 @@ The evaluation of six state-of-the-art LLMs reveals that while RAG can improve r
**DOI**: [https://doi.org/10.18653/v1/2020.emnlp-main.437](https://doi.org/10.18653/v1/2020.emnlp-main.437)
+**Published**: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
+
+**Authors**: [Adam Roberts](https://www.webofscience.com/wos/author/record/16227422), [Colin Raffel](https://www.webofscience.com/wos/author/record/13305514), [Noam Shazeer](https://www.webofscience.com/wos/author/record/14537225), _Google_
+
## Summary
The paper explores the extent to which pre-trained language models can store and retrieve knowledge without relying on external sources. The authors fine-tune pre-trained models, specifically variants of the Text-to-Text Transfer Transformer (T5), to perform closed-book question answering—answering factual questions without accessing external knowledge bases. Their experiments demonstrate that model performance improves with increasing model size, with the largest model (T5-11B) performing competitively with open-domain systems that explicitly retrieve information. They also investigate whether additional pre-training using techniques such as salient span masking (SSM) enhances knowledge retention.
The study highlights the trade-offs of storing knowledge within model parameters, noting that while closed-book models can achieve high accuracy, they lack transparency and control over what knowledge is stored. The authors identify challenges, such as the inability to update knowledge post-training and the tendency for models to generate hallucinated answers when uncertain. They also perform human evaluations to assess how well automated metrics capture correctness, revealing that many answers marked incorrect were actually valid. The findings suggest that large-scale language models can serve as implicit knowledge repositories but raise questions about their reliability, interpretability, and efficiency compared to retrieval-based approaches.
-## Limitations
-- **Lack of Knowledge Updating Mechanisms**: One of the most critical limitations is that once a model is trained, its internalized knowledge cannot be easily updated. Unlike retrieval-based systems, where knowledge is dynamically fetched from external sources, a closed-book model requires costly retraining to incorporate new information, making it impractical for domains requiring frequent updates, such as current events or scientific discoveries.
+## Issues Targeted
+- **Closed-Book Question Answering**: The paper explores the task of open-domain question answering without access to any external knowledge or context. This is in contrast to traditional open-book question answering systems that explicitly retrieve and use information from an external knowledge source.
+
+- **Knowledge Retrieval from Language Model Parameters**: The paper investigates the ability of large language models to store and retrieve knowledge from their parameters, rather than relying on external knowledge sources.
+
+- **Scaling Performance with Model Size**: The paper examines how the performance of closed-book question answering systems scales with the size of the language model.
+
+## Contribution/Novelty
+- **Salient Span Masking (SSM) Pre-Training**: Investigates the impact of using a salient span masking pre-training objective, which enhances the model's performance on open-domain question answering tasks, thus contributing a new technique to improve model training.
+
+- **Human Evaluation of Model Predictions**: Conducts a human evaluation to analyze the model's predictions, identifying the prevalence of false negatives in traditional evaluation metrics and suggesting that the performance of closed-book systems may be underestimated.
+
+## Approach
+- **Model Selection**: Utilizes the T5 (Text-to-Text Transfer Transformer) model, which is a transformer-based architecture that treats every NLP task as a text-to-text problem.
+
+- **Pre-Training**: The T5 model is pre-trained on a large, diverse dataset (C4) using a multi-task mixture that includes an unsupervised "span corruption" task, as well as supervised tasks like translation, summarization, and reading comprehension.
+
+- **Fine-Tuning for Closed-Book QA**: The model is fine-tuned specifically for the task of closed-book question answering. During this process, the model is provided only with the input question and must generate the answer based solely on the knowledge it has internalized during pre-training.
+
+- **Evaluation on Open-Domain Datasets**: The approach is evaluated on several open-domain question answering datasets, including Natural Questions, WebQuestions, and TriviaQA. The model's performance is measured without any external context or knowledge.
+
+- **Salient Span Masking (SSM)**: The paper experiments with an additional pre-training step using salient span masking, which focuses on reconstructing masked-out spans (such as named entities and dates) from sentences. This technique is shown to improve performance on question answering tasks.
+
+## Dataset/Testing
+**Dataset**
+
+- **Natural Questions**: A dataset of questions derived from web queries, each accompanied by a Wikipedia article containing the answer. The evaluation includes both the standard open-domain version and a multi-answer variant.
+- **WebQuestions**: This dataset consists of questions from web queries matched to corresponding entries in FreeBase. The model is evaluated based on its ability to answer these questions without any context.
+
+- **TriviaQA**: A collection of questions sourced from quiz league websites, where each question is accompanied by pages from web and Wikipedia searches that may contain the answer. The model is tested on its ability to answer these questions without access to the provided documents.
+
+**Testing**
+
+- **Closed-Book Question Answering**: The model is tested in a closed-book setting, meaning it is only given the input question without any additional context or external knowledge. This allows the evaluation of how well the model can retrieve and generate answers based solely on the knowledge it has internalized during pre-training.
+
+- **Evaluation Procedures**: The evaluation follows standard procedures for each dataset, where the predicted answers are compared to the ground-truth answers after normalization (lowercasing, stripping articles, punctuation, and duplicate whitespace).
+
+- **Human Evaluation**: In addition to automated evaluation metrics, the authors conduct a human evaluation of a subset of predictions to identify false negatives and assess the quality of the model's answers more qualitatively.
+
+## Result
+- **Performance Improvement with Model Size**: The results indicate that as the size of the T5 model increases, its performance on the question answering tasks also improves. The largest model (T5-11B) consistently outperforms smaller variants across all datasets.
+
+- **Specific Performance Metrics**:
+ - The paper provides specific scores achieved by different model sizes on the Natural Questions (NQ), WebQuestions (WQ), and TriviaQA (TQA) tasks. For example:
+ - T5-11B + SSM achieved a score of 34.8 on WebQuestions and 51.0 on TriviaQA.
+ - T5-11B achieved a score of 32.6 on Natural Questions.
+
+- **Recall on Multi-Answer Variant**: For the multi-answer variant of Natural Questions, the T5-11B + SSM model achieved a recall of 36.2, which, while lower than the state-of-the-art score of 51.9, still outperformed the best baseline published alongside the dataset.
+
+## Findings
+- **Knowledge Internalization**: The study finds that large language models, specifically T5, can internalize a significant amount of knowledge during pre-training, allowing them to answer questions without external context.
+
+- **Model Size Impact**: The results indicate that larger models (up to 11 billion parameters) perform better on open-domain question answering tasks, suggesting a positive correlation between model size and knowledge retrieval capabilities.
+
+- **Effectiveness of Salient Span Masking**: The introduction of salient span masking (SSM) during pre-training significantly enhances the model's performance on question answering tasks, demonstrating the value of task-specific pre-training objectives.
+
+## Limitations
- **Interpretability and Explainability Issues**: The study does not address how knowledge is stored or retrieved within model parameters. This opacity limits the ability to verify correctness, trace the source of errors, or understand the reasoning behind an answer. In contrast, retrieval-based systems provide explicit sources that can be inspected and validated.
- **Hallucination of Incorrect but Plausible Answers**: The paper acknowledges that models sometimes generate answers that sound plausible but are incorrect, particularly when they lack the necessary knowledge. This poses risks in high-stakes applications like medical or legal domains, where misinformation could have severe consequences.
- **Overestimation of Performance Due to Dataset Bias**: The evaluation datasets (e.g., Natural Questions, TriviaQA) focus largely on factoid-style questions, which may not represent the complexity of real-world information needs. The study does not explore how well the models handle multi-step reasoning, nuanced interpretation, or ambiguous queries, which are common in practical applications.
+## Scope
+- **Efficiency Improvements**: Future research could focus on developing more efficient language models that maintain high performance while reducing computational requirements.
+
# Retrieval-Enhanced Machine Learning
**Domain**: Retrieval
@@ -418,15 +2359,103 @@ The study highlights the trade-offs of storing knowledge within model parameters
**DOI**: [https://doi.org/10.1145/3477495.3531722](https://doi.org/10.1145/3477495.3531722)
+**Published**: SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022)
+
+**Authors**:
+- [Hamed Zamani](https://www.webofscience.com/wos/author/record/18413635), _University of Massachusetts Amherst_
+- [Fernando Diaz](https://www.webofscience.com/wos/author/record/53288098), [Mostafa Dehghani](https://www.webofscience.com/wos/author/record/29618766), [Donald Metzler](https://www.webofscience.com/wos/author/record/5679318), [Michael Bendersky](https://www.webofscience.com/wos/author/record/5499731), _Google Research_
+
## Summary
The paper introduces the concept of Retrieval-Enhanced Machine Learning (REML), which aims to improve machine learning models by integrating information retrieval (IR) techniques. Traditional machine learning systems often rely on large parameter sizes to encode knowledge, which can be costly and unsustainable. REML proposes a framework where machine learning models can access external information repositories, allowing them to decouple reasoning from memory. This approach enhances model generalization, scalability, robustness, and interpretability by leveraging efficient retrieval methods to access relevant information dynamically during the prediction process.
The authors outline the core principles of REML, including querying, retrieval, and response utilization, and categorize models based on their capabilities, such as storing information and providing feedback to the retrieval system. They discuss the potential applications of REML in various domains, including generalization, scalability, and interpretability, while also addressing challenges in optimizing the interaction between prediction and retrieval models. The paper concludes by emphasizing the need for further research to fully realize the potential of REML in advancing machine learning and artificial intelligence.
+## Issues Targeted
+- **Scalability of Machine Learning Models**
+ - Increasing model capacity by adding parameters is not sustainable.
+ - High capacity models often memorize training data, leading to inefficiencies.
+
+- **Generalization Challenges**
+ - Many existing ML models struggle with generalization, especially in domain adaptation, zero-shot, and few-shot learning tasks.
+
+- **Temporal Aspect of Data**
+ - Current ML models are brittle in nonstationary domains where new information constantly emerges.
+ - Periodic retraining is impractical for quickly-changing domains.
+
+- **On-Device Machine Learning Limitations**
+ - State-of-the-art ML models require significant computational power and memory, which are often unavailable on devices like smartphones.
+ - There is a need for efficient models that can operate with limited resources.
+
+- **Information Access and Retrieval Limitations**
+ - Existing retrieval models may not be fully optimized for machine learning applications.
+ - There is a need for better integration of retrieval systems with ML models to enhance performance.
+
+## Contribution/Novelty
+- **Introduction of Retrieval-Enhanced Machine Learning (REML) Framework**: The paper proposes a novel framework that integrates information retrieval techniques with machine learning models, allowing for improved model generalization, scalability, robustness, and interpretability.
+
+- **Decoupling Reasoning from Memory**: REML allows machine learning models to access external information repositories, reducing the need for large model parameters and enabling more efficient memory management.
+
+- **Flexible Model Architecture**: The framework is designed to be generic and flexible, accommodating various existing models and paving the way for future developments in machine learning.
+
+- **Optimization Strategies for REML**: The paper outlines three categories of optimization approaches: independent optimization, conditional optimization, and joint end-to-end optimization, providing a structured way to enhance model performance.
+
+- **Addressing Key Challenges in ML**: The framework specifically targets issues such as generalization, scalability, interpretability, and the temporal aspect of data, which are often overlooked in traditional ML approaches.
+
+- **Integration of Feedback Mechanisms**: The paper discusses the importance of feedback between prediction models and retrieval systems, proposing mechanisms for improving retrieval performance based on model feedback.
+
+## Approach
+- **Decoupling Knowledge and Reasoning**
+ - The approach emphasizes the separation of knowledge storage from reasoning processes. By utilizing retrieval systems, the framework allows models to offload memorization to external storage, reducing the need for large model parameters.
+
+- **Model Architecture**: REML consists of two main components:
+ - **Prediction Model**: This model generates predictions based on input data and retrieved information.
+ - **Information Access Models**: These models mediate access to a repository of information, allowing the prediction model to query and retrieve relevant data.
+
+- **Querying and Retrieval Mechanisms**: The approach defines necessary requirements for REML models, including:
+ - **Querying**: The prediction model can submit input-dependent queries to the information access models.
+ - **Retrieval**: Information access models efficiently process queries and retrieve relevant information items.
+ - **Response Utilization**: The prediction model utilizes the retrieved information to make predictions.
+
+- **Optimization Strategies**: The paper outlines three categories of optimization approaches for REML:
+ - **Independent Optimization**: Training the prediction model independently of the retrieval model.
+ - **Conditional Optimization**: Updating the prediction model based on the performance of the retrieval model and vice versa.
+ - **Joint End-to-End Optimization**: Training both models simultaneously to optimize a single objective function.
+
+- **Feedback Mechanisms**
+ - The approach incorporates feedback loops where the prediction model can provide feedback to the information access models, allowing for iterative improvements in retrieval performance.
+
+## Dataset/Testing
+- **No Specific Dataset Mentioned**
+ - The paper does not specify a single dataset used for testing the Retrieval-Enhanced Machine Learning (REML) framework. Instead, it discusses the framework in a general context, applicable to various machine learning tasks and domains.
+
+- **Case Studies and Related Work**
+ - The authors review several existing models and approaches as special cases of REML, which may have utilized different datasets in their respective studies. This includes models related to knowledge grounding, memory-augmented learning, and retrieval-enhanced optimization.
+
+- **Evaluation Methodologies**: The paper emphasizes the importance of evaluation methodologies for both prediction models and information access models. It suggests:
+ - **Extrinsic Evaluation**: Measuring the impact of information access quality on the performance of prediction models for downstream tasks.
+ - **Intrinsic Evaluation**: Evaluating the retrieval model independently by defining relevance based on expected documents for a prediction model.
+
+## Result
+- **Improved Generalization**: The authors highlight that retrieval augmentation can significantly enhance the generalization capabilities of machine learning models. For instance, existing models like KNN-LM showed substantial improvements in language model perplexity on both in-distribution and out-of-distribution test sets.
+
+- **Scalability Benefits**: The REML framework allows for the explicit storage of information, reducing the need for high-capacity models. This leads to increased throughput and efficiency in accessing information, which is particularly beneficial for large-scale applications.
+
+- **Temporal Adaptability**: The framework addresses the challenges of nonstationary domains by allowing models to maintain and update knowledge independently of model parameters. This adaptability is crucial for applications in rapidly changing environments, such as news and real-time data.
+
+## Findings
+- **Integration of Retrieval and Machine Learning**: The paper establishes that integrating information retrieval techniques with machine learning models can significantly enhance model performance, particularly in terms of generalization and scalability.
+
+- **Decoupling Knowledge from Reasoning**: The REML framework allows for the separation of knowledge storage from reasoning processes, enabling models to access external information repositories rather than relying solely on large model parameters.
+
+- **Improved Generalization**: Retrieval-augmented approaches can lead to better generalization across various tasks, as evidenced by existing models that have shown substantial performance improvements when incorporating retrieval techniques.
+
## Limitations
-- **Feedback Mechanism Limitations**: The paper discusses the potential for feedback from prediction models to improve retrieval systems. However, the effectiveness of this feedback loop may vary, and establishing a reliable feedback mechanism can be difficult.
+- **Lack of Empirical Testing**: The paper does not provide specific empirical results or quantitative evaluations of the REML framework, relying instead on theoretical discussions and references to existing models.
+
+- **Complexity of Implementation**: Implementing the REML framework may introduce additional complexity in model design and optimization, particularly in managing the interactions between prediction and retrieval components.
-- **Limited Exploration of Querying Strategies**: The paper identifies querying as a core research question but does not delve deeply into the various strategies for effective querying, which could limit the practical application of REML.
+## Scope
+- **Research Agenda for Future Work**: The authors outline a comprehensive research agenda that includes exploring optimization strategies, feedback mechanisms, and evaluation methodologies, paving the way for further advancements in retrieval-enhanced machine learning.
# Can Knowledge Graphs Reduce Hallucinations in LLMs
**Domain**: Knowledge Graph
@@ -435,13 +2464,94 @@ The authors outline the core principles of REML, including querying, retrieval,
**DOI**: [https://doi.org/10.18653/v1/2024.naacl-long.219](https://doi.org/10.18653/v1/2024.naacl-long.219)
+**Published**: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024)
+
+**Authors**: [Garima Agrawal](https://aclanthology.org/people/g/garima-agrawal/), [Tharindu Kumarage](https://aclanthology.org/people/t/tharindu-kumarage/), [Zeyad Alghamdi](https://aclanthology.org/people/z/zeyad-alghamdi/), [Huan Liu](https://aclanthology.org/people/h/huan-liu/), _Arizona State University_
+
## Summary
The paper explores the integration of knowledge graphs (KGs) into large language models (LLMs) to mitigate the issue of hallucinations—outputs that sound plausible but are often incorrect or irrelevant. The authors categorize various knowledge-graph-based augmentation techniques into three main groups: Knowledge-Aware Inference, Knowledge-Aware Learning, and Knowledge-Aware Validation. Each category encompasses methods that enhance the reasoning capabilities of LLMs by improving their inference processes, optimizing learning mechanisms, and validating generated outputs against structured knowledge.
The survey highlights the effectiveness of these techniques in enhancing the reliability and performance of LLMs across different applications, while also discussing current trends, challenges, and future research directions in the field. The authors emphasize the importance of providing precise and contextually relevant external knowledge to improve LLMs' understanding and reasoning, ultimately aiming to create more trustworthy AI systems.
+## Issues Targeted
+- **Hallucinations in Large Language Models (LLMs)**
+ - LLMs often produce outputs that sound plausible but are irrelevant or incorrect, a phenomenon known as hallucinations.
+ - Hallucinations stem from knowledge gaps within the models, leading to unreliable outputs.
+
+- **Knowledge Gaps**
+ - LLMs may struggle to accurately interpret vague phrases or terms due to insufficient contextual knowledge.
+ - The presence of misinformation, biases, or inaccuracies in training data can amplify these knowledge gaps.
+
+- **Probabilistic Nature of LLMs**
+ - The stochastic decoding processes of LLMs can yield varied outputs for the same input, complicating the generation of consistent and accurate responses.
+
+## Contribution/Novelty
+- **Comprehensive Survey of Knowledge Graph Integration**
+ - The paper presents a thorough review of knowledge graph (KG)-based augmentation techniques specifically aimed at reducing hallucinations in large language models (LLMs).
+ - It categorizes these methods into three main groups: Knowledge-Aware Inference, Knowledge-Aware Learning, and Knowledge-Aware Validation, providing a structured framework for understanding the various approaches.
+
+- **Focus on Hallucination Mitigation**
+ - Unlike previous surveys, this paper exclusively focuses on the integration of structured knowledge from KGs to address hallucinations in LLMs, filling a gap in the existing literature.
+
+## Approach
+- **Literature Review and Survey Methodology**
+ - The authors conducted a comprehensive literature review to gather existing knowledge graph (KG)-based augmentation techniques for large language models (LLMs).
+ - They systematically categorized these techniques into three main groups based on their functionality and application:
+ - **Knowledge-Aware Inference**
+ - **Knowledge-Aware Learning**
+ - **Knowledge-Aware Validation**
+
+- **Categorization of Techniques**: Each category encompasses various methods that utilize KGs to enhance LLMs:
+ - **Knowledge-Aware Inference**
+ - Focuses on improving the inference process of LLMs by integrating KGs at the input level to enhance contextual understanding.
+ - Subcategories include:
+ - KG-Augmented Retrieval
+ - KG-Augmented Reasoning
+ - Knowledge-Controlled Generation
+ - **Knowledge-Aware Learning**
+ - Involves optimizing the learning mechanisms of LLMs through knowledge integration during pre-training and fine-tuning stages.
+ - Subcategories include:
+ - Knowledge-Aware Pre-Training
+ - Knowledge-Aware Fine-Tuning
+ - **Knowledge-Aware Validation**
+ - Utilizes structured data from KGs as a fact-checking mechanism to validate the outputs of LLMs and ensure consistency and reliability.
+
+## Dataset/Testing
+The paper does not specify a single dataset that is universally used across all the discussed methods. Instead, it references various datasets employed in different studies related to knowledge graph (KG) augmentation techniques for large language models (LLMs).
+
+The effectiveness of the various KG-augmented methods is assessed through empirical evaluations that involve:
+
+- **Performance Metrics**: The paper discusses various evaluation metrics such as accuracy, Mean Reciprocal Rank (MRR), Hits@1, Exact Match (EM), and human evaluation to assess the quality of outputs generated by the models.
+- **Comparative Analysis**: The authors compare the performance of LLMs with and without KG augmentation across different tasks, using the aforementioned datasets to demonstrate improvements in accuracy and reduction of hallucinations.
+
+
+## Result
+- **Improvement in Accuracy**: KG augmentation significantly enhances the accuracy of LLMs in various tasks, particularly in question-answering scenarios. For instance:
+ - Smaller models showed over an 80% improvement in answer correctness when augmented with facts from knowledge graphs, compared to relying solely on internal knowledge.
+ - Larger models, such as ChatGPT, demonstrated increased accuracy in reasoning tasks, with specific methods like IRCoT reporting improvements from 66.8% to 85.7%.
+
+- **Enhanced Reasoning Capabilities**: Techniques that incorporate step-wise reasoning, such as Chain of Thought (CoT) and its variations, have proven effective in improving the reasoning abilities of LLMs. For example:
+ - The MindMap method achieved an accuracy of 88.2% in medical diagnosis tasks by utilizing clinical reasoning graphs.
+
+## Findings
+- **Effectiveness of Knowledge Graphs (KGs)**
+ - KGs significantly enhance the performance of large language models (LLMs) by reducing hallucinations and improving reasoning accuracy.
+ - Various KG-augmented methods demonstrate substantial improvements in accuracy across different tasks, particularly in question-answering and reasoning scenarios.
+
+- **Categorization of Techniques**
+ - The paper categorizes KG-augmented methods into three main groups: Knowledge-Aware Inference, Knowledge-Aware Learning, and Knowledge-Aware Validation, providing a structured understanding of how KGs can be integrated into LLMs.
+
## Limitations
-- **Open Research Questions**: The paper highlights ongoing challenges, such as the extent to which updated knowledge can be integrated into models and the fundamental question of whether neural networks genuinely engage in reasoning, indicating areas that require further investigation.
+- **Resource Intensity**: Pre-training and fine-tuning LLMs with KGs are resource-intensive processes that require significant computational power, which may limit accessibility for some researchers and practitioners.
+
+- **Dependence on Quality of KGs**: The effectiveness of KG augmentation is highly dependent on the quality and comprehensiveness of the knowledge graphs used. Poorly constructed or biased KGs can lead to inaccurate model outputs.
+
+## Scope
+- **Future Research Directions**: The paper outlines several potential avenues for future research, including:
+ - Improving the quality and context-awareness of KGs to enhance LLM performance.
+ - Addressing biases in KGs to prevent the perpetuation of misinformation.
+ - Exploring multi-modal data integration to enrich the knowledge base available to LLMs.
+ - Investigating the synergistic relationship between LLMs and KGs for mutual enhancement.
# Retrieval Augmentation Reduces Hallucination in Conversation
**Domain**: RAG
@@ -450,12 +2560,105 @@ The survey highlights the effectiveness of these techniques in enhancing the rel
**DOI**: [https://doi.org/10.18653/v1/2021.findings-emnlp.320](https://doi.org/10.18653/v1/2021.findings-emnlp.320)
+**Published**: Findings of the Association for Computational Linguistics: EMNLP 2021 (2021)
+
+**Authors**: [Kurt Shuster](https://aclanthology.org/people/k/kurt-shuster/), [Spencer Poff](https://aclanthology.org/people/s/spencer-poff/), [Moya Chen](https://aclanthology.org/people/m/moya-chen/), [Douwe Kiela](https://aclanthology.org/people/d/douwe-kiela/), [Jason Weston](https://aclanthology.org/people/j/jason-weston/), _Facebook AI Research_
+
## Summary
The paper explores the challenges faced by state-of-the-art dialogue models, particularly the issues of factual inaccuracy and knowledge hallucination. The authors propose the use of neural-retrieval-in-the-loop architectures, specifically retrieval-augmented generation (RAG), to enhance knowledge-grounded dialogue systems. By integrating retrievers, rankers, and encoder-decoder models, the study demonstrates that these architectures can significantly improve the factual accuracy of conversational agents while maintaining their conversational fluency. The results show that the best-performing models achieve state-of-the-art performance on knowledge-grounded conversational tasks, effectively reducing hallucinated responses by over 60% and improving generalization to unseen topics.
The paper also emphasizes the importance of using appropriate evaluation metrics, such as Knowledge F1, to assess the models' performance in terms of knowledge utilization and hallucination reduction. Through extensive experiments on datasets like Wizard of Wikipedia and CMU Document Grounded Conversations, the authors highlight that retrieval-augmented models not only outperform traditional models but also exhibit better consistency and engagement in conversations. The findings suggest that retrieval-augmented approaches are a promising solution to the hallucination problem in dialogue systems, paving the way for future research in this area.
-## Limitations
-- **Complexity of Multi-Turn Dialogue**: The paper acknowledges that knowledge-grounded dialogue is inherently more complex than single-turn question answering. The models may struggle with maintaining coherence and relevance across multiple turns of conversation, especially when the dialogue context is lengthy.
+## Issues Targeted
+- **Factual Incorrectness and Hallucination**
+ - State-of-the-art dialogue models often generate plausible but factually incorrect statements.
+ - Hallucination refers to the generation of information that is not grounded in the training data or external knowledge.
+
+- **Complexity of Knowledge-Grounded Dialogue**
+ - Knowledge-grounded dialogue requires models to query based on complex multi-turn dialogue contexts.
+ - Generating coherent responses while maintaining factual accuracy is challenging.
+
+- **Limitations of Existing Approaches**
+ - Traditional models may not effectively utilize retrieval mechanisms for knowledge grounding.
+ - Existing methods often ignore the dialogue context, leading to less relevant or incorrect responses.
+
+## Contribution/Novelty
+- **Introduction of Retrieval-Augmented Generation (RAG) for Dialogue**
+ - The paper extends the RAG framework, which has been effective in open-domain question answering, to the more complex task of knowledge-grounded dialogue.
+
+- **Development of Advanced Neural Architectures**
+ - Proposes various architectures that incorporate retrievers, rankers, and encoder-decoders to enhance knowledge utilization in dialogue systems.
+ - Introduces methods like RAG-Turn and Fusion-in-Decoder (FiD) to improve the interaction between dialogue context and retrieved knowledge.
+
+- **Reduction of Hallucination in Responses**
+ - Demonstrates that the proposed models significantly reduce the problem of knowledge hallucination, achieving over 60% reduction in hallucinated responses compared to standard models.
+
+- **Introduction of Knowledge F1 Metric**
+ - Proposes a new evaluation metric, Knowledge F1 (KF1), to better assess the knowledge grounding of generated responses, addressing the limitations of traditional metrics.
+
+## Approach
+- **Architecture Components**
+ - **Retrievers**: Identify relevant documents or passages from a large unstructured knowledge base (e.g., Wikipedia) based on the dialogue context.
+ - **Rankers**: Score and rank the retrieved documents to determine their relevance to the current dialogue turn.
+ - **Encoder-Decoders**: Generate responses by conditioning on both the dialogue context and the retrieved knowledge.
+
+- **Specific Techniques Implemented**
+ - **RAG-Token and RAG-Sequence**: Two variations of the RAG model that differ in how they process retrieved documents. RAG-Token allows the generator to attend to different documents for each token, while RAG-Sequence processes documents independently.
+ - **Fusion-in-Decoder (FiD)**: A method that concatenates the outputs of the encoder for all retrieved documents before passing them to the decoder, allowing for simultaneous attention to multiple documents.
+ - **RAG-Turn**: A novel approach that retrieves documents for each turn of dialogue separately, improving the relevance of the retrieved knowledge to the ongoing conversation.
+
+- **End-to-End Training**
+ - The entire system is trained end-to-end, allowing the retriever and generator to learn from each other, optimizing the retrieval process based on the generation task.
+
+- **Knowledge F1 Metric**
+ - Introduces a new evaluation metric, Knowledge F1 (KF1), to measure the overlap between generated responses and the relevant knowledge, providing a better assessment of knowledge utilization.
-- **Hallucination with Increased Documents**: While the models significantly reduce hallucination, the paper notes that increasing the number of retrieved documents can lead to higher levels of hallucination in some cases. This suggests a trade-off between knowledge utilization and the risk of generating incorrect information. \ No newline at end of file
+## Dataset/Testing
+**Dataset**
+
+- **Wizard of Wikipedia (WoW)**
+ - **Description**: A dataset consisting of knowledge-grounded dialogues collected through human-human crowdworker chats. The conversations cover a wide range of topics, with one participant having access to external knowledge from Wikipedia.
+ - **Purpose**: Used to evaluate the performance of the proposed models in generating knowledgeable and coherent responses based on the dialogue context.
+
+- **CMU Document Grounded Conversations (CMU_DoG)**
+ - **Description**: A dataset focused on movie discussions, also collected through human-human interactions. Similar to WoW, it involves dialogues where one participant has access to external knowledge.
+ - **Purpose**: Provides a different domain for testing the models, allowing for evaluation of their performance in a more specific context.
+
+**Testing**
+
+- **Validation and Test Splits**:
+ - Both datasets are split into "seen" and "unseen" validation and test sets. The "unseen" splits contain topics or movies not discussed in the training data, allowing for assessment of the models' generalization capabilities.
+
+- **Evaluation Metrics**:
+ - The models are evaluated using standard automatic metrics such as perplexity (PPL), F1 score, BLEU-4, ROUGE-L, and the newly introduced Knowledge F1 (KF1) metric.
+ - Human evaluations are also conducted to assess the quality of the generated responses across various dimensions, including consistency, engagement, knowledgeability, and hallucination.
+
+## Result
+- **Performance on Knowledge-Grounded Dialogue Tasks**
+ - The proposed models achieved state-of-the-art performance on both datasets (Wizard of Wikipedia and CMU Document Grounded Conversations).
+ - Significant improvements were observed in various evaluation metrics compared to baseline models.
+
+- **Reduction in Hallucination**
+ - The best models substantially reduced the problem of knowledge hallucination, with a reduction of over 60% in hallucinated responses compared to standard (non-retrieval augmented) large language models.
+ - Human evaluations indicated lower hallucination rates for retrieval-augmented models.
+
+- **Knowledge F1 Scores**
+ - The introduction of the Knowledge F1 (KF1) metric highlighted that retrieval-augmented models effectively utilized relevant knowledge, achieving higher KF1 scores compared to baseline models without retrieval.
+ - For example, models using retrieval mechanisms showed significant gains in KF1 scores, indicating better grounding in factual knowledge.
+
+## Findings
+- **Effective Reduction of Hallucination**: The proposed retrieval-augmented models significantly reduce the occurrence of hallucinated responses in dialogue generation, achieving over 60% reduction compared to standard models.
+- **Improved Knowledge Utilization**: The introduction of the Knowledge F1 (KF1) metric demonstrates that retrieval-augmented models effectively utilize relevant knowledge, leading to higher KF1 scores.
+
+- **Generalization to Unseen Topics**: Models that incorporate retrieval mechanisms show better generalization capabilities to unseen topics, maintaining performance metrics even when evaluated on data not present in the training set.
+
+## Limitations
+- **Complexity of Implementation**: The proposed architectures involve multiple components (retrievers, rankers, encoder-decoders), which may complicate implementation and increase computational overhead.
+- **Potential for Increased Hallucination with More Documents**: While retrieving more documents can improve some metrics, it may also lead to higher levels of hallucination, indicating a trade-off between knowledge utilization and conversational coherence.
+
+## Scope
+**Future Research Directions**: The paper suggests several avenues for future research, including:
+- Exploring improved retrieval mechanisms to enhance the relevance and accuracy of retrieved knowledge.
+- Investigating the interplay between retrieved knowledge and knowledge stored in the model's parameters.
+- Developing methods to further reduce hallucination while maintaining conversational ability.
+- Expanding the evaluation framework to include more diverse metrics that capture the quality of dialogue beyond factual accuracy.