add more sources

author: Aditya <bluenerd@protonmail.com> 2025-02-03 19:56:53 +0530
committer: Aditya <bluenerd@protonmail.com> 2025-02-03 19:56:53 +0530
commit: 8008f4741106928aa7d5925becf02a2259e2757e (patch)
tree: 046ce09015e96a3292741014df1a0ce601832386 /sources.md
parent: 573bebf0b709507b09ae6d21616b675dcac08d69 (diff)
1 files changed, 234 insertions, 0 deletions
diff --git a/sources.md b/sources.md
index be635b5..bab2d2f 100644
--- a/sources.md
+++ b/sources.md
@@ -3,6 +3,8 @@
 
 **Relevance Score**: 4
 
+**DOI**: [https://doi.org/10.1561/1500000019](https://doi.org/10.1561/1500000019)
+
 ## Abstract
 The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970–1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.
 
@@ -19,6 +21,8 @@ explores the theoretical underpinnings, development, and extensions of the Proba
 
 **Relevance Score**: 6
 
+**DOI**: [https://doi.org/10.18653/v1/2020.emnlp-main.550](https://doi.org/10.18653/v1/2020.emnlp-main.550)
+
 ## Abstract
 Open-domain question answering relies on ef- ficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented us- ing dense representations alone, where em- beddings are learned from a small number of questions and passages by a simple dual- encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene- BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.
 
@@ -37,6 +41,8 @@ The paper presents an innovative approach to passage retrieval for answering ope
 
 **Relevance Score**: 4
 
+**DOI**: [http://dx.doi.org/10.48550/arXiv.2103.00020](http://dx.doi.org/10.48550/arXiv.2103.00020)
+
 ## Abstract
 State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
 
@@ -56,6 +62,8 @@ The paper introduces CLIP (Contrastive Language-Image Pretraining), a scalable f
 
 **Relevance Score**: 6.5
 
+**DOI**: [https://doi.org/10.48550/arXiv.2402.03181](https://doi.org/10.48550/arXiv.2402.03181)
+
 ## Abstract
 Despite the impressive capabilities of large language models (LLMs) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. Retrieval-augmented language models (RAG) have been proposed to enhance the credibility of generations by grounding external knowledge, but the theoretical understandings of their generation risks remains unexplored. In this paper, we answer: 1) whether RAG can indeed lead to low generation risks, 2) how to provide provable guarantees on the generation risks of RAG and vanilla LLMs, and 3) what sufficient conditions enable RAG models to reduce generation risks. We propose C-RAG, the first framework to certify generation risks for RAG models. Specifically, we provide conformal risk analysis for RAG models and certify an upper confidence bound of generation risks, which we refer to as conformal generation risk. We also provide theoretical guarantees on conformal generation risks for general bounded risk functions under test distribution shifts. We prove that RAG achieves a lower conformal generation risk than that of a single LLM when the quality of the retrieval model and transformer is non-trivial. Our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used NLP datasets on four state-of-the-art retrieval models.
 
@@ -71,6 +79,8 @@ The paper introduces C-RAG, a framework designed to certify and provide theoreti
 
 **Relevance Score**: 6.5
 
+**DOI**: [https://doi.org/10.48550/arXiv.2208.03299](https://doi.org/10.48550/arXiv.2208.03299)
+
 ## Abstract
 Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval-augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval-augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and Natural Questions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameter model by 3% despite having 50x fewer parameters.
 
@@ -88,6 +98,8 @@ The paper presents Atlas, a retrieval-augmented language model designed to excel
 
 **Relevance Score**: 7
 
+**DOI**: [https://doi.org/10.18653/v1/2024.naacl-long.88](https://doi.org/10.18653/v1/2024.naacl-long.88)
+
 ## Abstract
 We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language models, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62X to 2.36X on code or text generation. The code of REST is available at https://github.com/FasterDecoding/REST.
 
@@ -107,6 +119,8 @@ This limitation raises questions about how retrieval methodologies can effective
 
 **Relevance Score**: 7
 
+**DOI**: [http://dx.doi.org/10.48550/arXiv.2005.11401](http://dx.doi.org/10.48550/arXiv.2005.11401)
+
 ## Abstract
 Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
 
@@ -125,6 +139,8 @@ The paper presents Retrieval-Augmented Generation (RAG), a novel approach that c
 
 **Relevance Score**: 7
 
+**DOI**: [https://doi.org/10.48550/arXiv.2002.08909](https://doi.org/10.48550/arXiv.2002.08909)
+
 ## Abstract
 Language model pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring everlarger networks to cover more facts. To capture knowledge in a more modular and interpretable way, we augment language model pretraining with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner, using masked language modeling as the learning signal and backpropagating through a retrieval step that considers millions of documents. We demonstrate the effectiveness of Retrieval-Augmented Language Model pretraining (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as interpretability and modularity.
 
@@ -146,6 +162,8 @@ The paper details the architecture of REALM, which consists of a neural knowledg
 
 **Relevance Score**: 7
 
+**DOI**: [https://doi.org/10.18653/v1/2021.eacl-main.74](https://doi.org/10.18653/v1/2021.eacl-main.74)
+
 ## Abstract
 Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.
 
@@ -165,6 +183,8 @@ The paper titled investigates the integration of generative models with passage
 
 **Relevance Score**: 8
 
+**DOI**: [https://doi.org/10.18653/v1/2021.findings-emnlp.232](https://doi.org/10.18653/v1/2021.findings-emnlp.232)
+
 ## Abstract
 oftware developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers' code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.
 
@@ -180,6 +200,8 @@ The paper presents REDCODER, a retrieval-augmented framework designed to enhance
 
 **Relevance Score**: 8
 
+**DOI**: [https://doi.org/10.48550/arXiv.2207.05987](https://doi.org/10.48550/arXiv.2207.05987)
+
 ## Abstract
 Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus, existing models inherently cannot generalize to using unseen functions and libraries, because these would never appear in the training data. In contrast, when human programmers use functions and libraries for the first time, they frequently refer to textual resources such as code manuals and documentation, to explore and understand the available functionality. Inspired by this observation, we introduce DocPrompting: a natural-language-to-code generation approach that explicitly leverages documentation by (1) retrieving the relevant documentation pieces given an NL intent, and (2) generating code based on the NL intent and the retrieved documentation. DocPrompting is general: it can be applied to any programming language and is agnostic to the underlying neural model. We demonstrate that DocPrompting consistently improves NL-to-code models: DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based evaluation on the popular Python CoNaLa benchmark; on a new Bash dataset tldr, DocPrompting improves CodeT5 and GPT-Neo1.3B by up to absolute 6.9% exact match.
 
@@ -200,6 +222,8 @@ The paper introduces a novel approach called DocPrompting, which enhances natura
 # Retrieval-Augmented Generation for Large Language Models: A Survey
 **Domain**: RAG
 
+**DOI**: [https://doi.org/10.48550/arXiv.2312.10997](https://doi.org/10.48550/arXiv.2312.10997)
+
 ## Abstract
 Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the generation, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs' intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval, the generation and the augmentation techniques. The paper highlights the state-of-the-art technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces up-to-date evaluation framework and benchmark. At the end, this article delineates the challenges currently faced and points out prospective avenues for research and development.
 
@@ -211,6 +235,8 @@ The paper provides a comprehensive survey of Retrieval-Augmented Generation (RAG
 
 **Relevance Score**: 7
 
+**DOI**: [http://dx.doi.org/10.1145/3130348.3130375](http://dx.doi.org/10.1145/3130348.3130375)
+
 ## Abstract
 We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.
 
@@ -227,6 +253,8 @@ The paper presents a novel framework for information retrieval that integrates d
 
 **Relevance Score**: 7
 
+**DOI**: [https://doi.org/10.48550/ARXIV.2206.02743](https://doi.org/10.48550/ARXIV.2206.02743)
+
 ## Abstract
 Current state-of-the-art document retrieval solutions mainly follow an indexretrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.
 
@@ -243,6 +271,8 @@ The paper presents the Neural Corpus Indexer (NCI), an innovative end-to-end dee
 
 **Relevance Score**: 6
 
+**DOI**: [https://doi.org/10.18653/v1/2022.emnlp-main.555](https://doi.org/10.18653/v1/2022.emnlp-main.555)
+
 ## Abstract
 Pre-trained language models (PLMs) have shown their effectiveness in multiple scenarios. However, KBQA remains challenging, especially regarding coverage and generalization settings. This is due to two main factors: i) understanding the semantics of both questions and relevant knowledge from the KB; ii) generating executable logical forms with both semantic and syntactic correctness. In this paper, we present a new KBQA model, TIARA, which addresses those issues by applying multi-grained retrieval to help the PLM focus on the most relevant KB contexts, viz., entities, exemplary logical forms, and schema items. Moreover, constrained decoding is used to control the output space and reduce generation errors. Experiments over important benchmarks demonstrate the effectiveness of our approach. TIARA outperforms previous SOTA, including those using PLMs or oracle entity annotations, by at least 4.1 and 1.1 F1 points on GrailQA and WebQuestionsSP, respectively. Specifically on GrailQA, TIARA outperforms previous models in all categories, with an improvement of 4.7 F1 points in zero-shot generalization.
 
@@ -260,6 +290,8 @@ The experimental results demonstrate that TIARA significantly outperforms previo
 
 **Relevance Score**: 9
 
+**DOI**: [https://doi.org/10.48550/arXiv.2310.11511](https://doi.org/10.48550/arXiv.2310.11511)
+
 ## Abstract
 Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
 
@@ -283,6 +315,8 @@ The paper introduces a novel framework called Self-Reflective Retrieval-Augmente
 
 **Relevance Score**: 9
 
+**DOI**: [https://doi.org/10.18653/v1/2023.acl-long.99](https://doi.org/10.18653/v1/2023.acl-long.99)
+
 ## Abstract
 While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings~(HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
 
@@ -298,6 +332,8 @@ The paper presents a novel approach called HyDE (Hypothetical Document Embedding
 
 **Relevance Score**: 8.5
 
+**DOI**: [https://doi.org/10.48550/arXiv.2401.15884](https://doi.org/10.48550/arXiv.2401.15884)
+
 ## Abstract
 Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return sub-optimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches.
 
@@ -317,6 +353,8 @@ The paper introduces Corrective Retrieval Augmented Generation (CRAG), a novel a
 
 **Relevance Score**: 7
 
+**DOI**: [https://doi.org/10.18653/v1/2022.naacl-main.194](https://doi.org/10.18653/v1/2022.naacl-main.194)
+
 ## Abstract
 As demonstrated by GPT-3 and T5, transformers grow in capability as parameter spaces become larger and larger. However, for tasks that require a large amount of knowledge, non-parametric memory allows models to grow dramatically with a sub-linear increase in computational cost and GPU memory requirements. Recent models such as RAG and REALM have introduced retrieval into conditional generation. These models incorporate neural initial retrieval from a corpus of passages. We build on this line of research, proposing Re2G, which combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation. Our reranking approach also permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval. To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker, and generation using only ground truth on the target sequence output. We find large gains in four diverse tasks: zero-shot slot filling, question answering, fact-checking, and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard. We make our code available as open source at  https://github.com/IBM/kgi-slot-filling/tree/re2g.
 
@@ -336,6 +374,8 @@ The paper presents a novel approach called Re2G (Retrieve, Rerank, Generate), wh
 
 **Relevance Score**: 9
 
+**DOI**: [https://doi.org/10.18653/v1/2023.emnlp-main.495](https://doi.org/10.18653/v1/2023.emnlp-main.495)
+
 ## Abstract
 Despite the remarkable ability of large language models (LMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output. Augmenting LMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input. This is limiting, however, in more general scenarios involving generation of long texts, where continually gathering information throughout generation is essential. In this work, we provide a generalized view of active retrieval augmented generation, methods that actively decide when and what to retrieve across the course of the generation. We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens. We test FLARE along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks/datasets. FLARE achieves superior or competitive performance on all tasks, demonstrating the effectiveness of our method. Code and datasets are available at https://github.com/jzbjyb/FLARE.
 
@@ -356,6 +396,8 @@ The paper presents a novel approach called Forward-Looking Active Retrieval Augm
 
 **Relevance Score**:
 
+**DOI**: [https://doi.org/10.1145/3673791.3698415](https://doi.org/10.1145/3673791.3698415)
+
 ## Abstract
 Language Models (LMs) memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LMs on low-frequent topics are: Retrieval Augmented Generation (RAG) and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LMs in handling low-frequency entities on question answering tasks. We conduct extensive experiments on twelve LMs of varying size and type and different fine tuning, data augmentation, and retrieval models. Our findings indicate that while FT boosts the performance across entities of varying popularity, RAG surpasses FT by a large margin particularly for least popular factual knowledge. Additionally, the success of both RAG and FT approaches is amplified by improving retrieval and data augmentation techniques. Fine tuning, while beneficial for small LMs, requires extensive resources. To address this issue, we propose the new Stimulus RAG approach that surpasses the effectiveness of fine tuning based approaches, thereby eliminating the need for the costly data augmentation and fine tuning step for enriching LMs with less popular factual knowledge. The code is available at https://github.com/informagi/RAGvsFT
 
@@ -368,3 +410,195 @@ The research emphasizes the importance of customizing LMs for less-resourced dom
 - **Resource Intensity of Fine-Tuning**: Fine-tuning methods require significant computational resources and extensive training data, which may not be feasible for all applications, particularly in less-resourced domains.
 
 - **Complexity of Implementation**: The proposed Stimulus RAG (SRAG) method, while effective, may introduce additional complexity in implementation compared to traditional fine-tuning or RAG methods.
+
+# Evaluating Retrieval Quality in Retrieval-Augmented Generation
+
+**Domain**: RAG
+
+**Relevance Score**:
+
+**DOI**: [https://doi.org/10.1145/3626772.3657957](https://doi.org/10.1145/3626772.3657957)
+
+## Abstract
+Evaluating retrieval-augmented generation (RAG) presents challenges, particularly for retrieval models within these systems. Traditional end-to-end evaluation methods are computationally expensive. Furthermore, evaluation of the retrieval model’s performance
+based on query-document relevance labels shows a small correlation with the RAG system’s downstream performance. We propose
+a novel evaluation approach, eRAG, where each document in the
+retrieval list is individually utilized by the large language model
+within the RAG system. The output generated for each document is
+then evaluated based on the downstream task ground truth labels.
+In this manner, the downstream performance for each document
+serves as its relevance label. We employ various downstream task
+metrics to obtain document-level annotations and aggregate them
+using set-based or ranking metrics. Extensive experiments on a
+wide range of datasets demonstrate that eRAG achieves a higher
+correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall’s 𝜏 correlation ranging
+from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50
+times less GPU memory than end-to-end evaluation.
+
+## Summary
+The paper introduces a novel evaluation approach called eRAG for assessing retrieval models within Retrieval-Augmented Generation (RAG) systems. Traditional end-to-end evaluation methods are computationally expensive and often fail to correlate well with the downstream performance of RAG systems. eRAG addresses these issues by utilizing a large language model (LLM) to generate document-level relevance labels based on the output produced for each document in the retrieval list. This method not only enhances the correlation with downstream performance—showing improvements in Kendall’s tau correlation ranging from 0.168 to 0.494—but also significantly reduces computational costs, consuming up to 50 times less GPU memory compared to end-to-end evaluations.
+
+The authors conducted extensive experiments across various datasets, demonstrating that eRAG consistently outperforms baseline methods in terms of correlation with the LLM's performance. The findings suggest that eRAG is more efficient in both inference time and memory utilization, making it a promising approach for evaluating retrieval models in RAG systems. The implementation of eRAG is made publicly available to facilitate further research in this domain.
+
+
+## Limitations
+- **Dependency on LLM’s Internal Mechanisms**: eRAG evaluates retrieval quality based on the downstream task performance of the LLM. This creates a dependency on the LLM’s internal mechanisms, making it difficult to generalize results across different models. If an LLM processes retrieved documents differently, the evaluation may not accurately reflect retrieval effectiveness.
+
+- **Computational Trade-offs**: Although eRAG improves efficiency over end-to-end evaluation, it still requires multiple passes through the LLM—one per retrieved document. While this reduces GPU memory usage, it increases the number of LLM inferences, which may be a computational burden in large-scale applications.
+
+- **Potential Sensitivity to LLM Size and Architecture**: The correlation analysis shows variability depending on the LLM size (T5-small vs. T5-base) and retrieval augmentation strategy (Fusion-in-Decoder vs. In-Prompt Augmentation). The lack of significant performance differences suggests that eRAG’s reliability across different architectures is not fully established.
+
+# Benchmarking Large Language Models in Retrieval-Augmented Generation
+
+**Domain**: RAG
+
+**Relevance Score**:
+
+**DOI**: https://doi.org/10.1609/aaai.v38i16.29728
+
+## Abstract
+Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.
+
+## Summary
+The paper investigates the effectiveness of Retrieval-Augmented Generation (RAG) in enhancing the performance of large language models (LLMs) while addressing challenges such as factual hallucination and outdated knowledge. The authors establish a new benchmark, the Retrieval-Augmented Generation Benchmark (RGB), which evaluates LLMs on four fundamental abilities: noise robustness, negative rejection, information integration, and counterfactual robustness. The benchmark consists of instances generated from the latest news articles and external documents retrieved via search engines, allowing for a comprehensive assessment of LLMs' capabilities in utilizing retrieved information.
+
+The evaluation of six state-of-the-art LLMs reveals that while RAG can improve response accuracy, significant limitations remain in the models' ability to handle noise, reject irrelevant information, integrate data from multiple sources, and identify factual errors in retrieved documents. The findings indicate that LLMs often struggle with noise confusion, fail to reject inappropriate answers, and lack the ability to effectively summarize information from various documents. The authors emphasize the need for further advancements in RAG methodologies to ensure reliable and accurate responses from LLMs, highlighting the importance of careful design and evaluation in the application of RAG techniques.
+
+## Limitations
+- **Noise Confusion**: LLMs exhibit difficulty in distinguishing relevant information from noisy documents, leading to inaccurate answers when similar but incorrect information is present.
+
+- **Negative Rejection Challenges**: The models often fail to reject questions when no relevant information is available in the retrieved documents, resulting in misleading or incorrect responses.
+
+- **Limited Understanding of Complex Queries**: The models show a lack of capability in comprehending and addressing complex questions, which can lead to merging errors, ignoring parts of the question, or misalignment in responses.
+
+# How Much Knowledge Can You Pack Into the Parameters of a Language Model?
+**Domain**: Foundation
+
+**Relevance Score**: 
+
+**DOI**: [https://doi.org/10.18653/v1/2020.emnlp-main.437](https://doi.org/10.18653/v1/2020.emnlp-main.437)
+
+## Abstract
+It has recently been observed that neural language models trained on unstructured text can
+implicitly store and retrieve knowledge using
+natural language queries. In this short paper, we measure the practical utility of this
+approach by fine-tuning pre-trained models to
+answer questions without access to any external context or knowledge. We show that this
+approach scales with model size and performs
+competitively with open-domain systems that
+explicitly retrieve answers from an external
+knowledge source when answering questions.
+To facilitate reproducibility and future work,
+we release our code and trained models.
+
+## Summary
+The paper explores the extent to which pre-trained language models can store and retrieve knowledge without relying on external sources. The authors fine-tune pre-trained models, specifically variants of the Text-to-Text Transfer Transformer (T5), to perform closed-book question answering—answering factual questions without accessing external knowledge bases. Their experiments demonstrate that model performance improves with increasing model size, with the largest model (T5-11B) performing competitively with open-domain systems that explicitly retrieve information. They also investigate whether additional pre-training using techniques such as salient span masking (SSM) enhances knowledge retention.
+
+The study highlights the trade-offs of storing knowledge within model parameters, noting that while closed-book models can achieve high accuracy, they lack transparency and control over what knowledge is stored. The authors identify challenges, such as the inability to update knowledge post-training and the tendency for models to generate hallucinated answers when uncertain. They also perform human evaluations to assess how well automated metrics capture correctness, revealing that many answers marked incorrect were actually valid. The findings suggest that large-scale language models can serve as implicit knowledge repositories but raise questions about their reliability, interpretability, and efficiency compared to retrieval-based approaches.
+
+
+## Limitations
+- **Lack of Knowledge Updating Mechanisms**: One of the most critical limitations is that once a model is trained, its internalized knowledge cannot be easily updated. Unlike retrieval-based systems, where knowledge is dynamically fetched from external sources, a closed-book model requires costly retraining to incorporate new information, making it impractical for domains requiring frequent updates, such as current events or scientific discoveries.
+
+- **Interpretability and Explainability Issues**: The study does not address how knowledge is stored or retrieved within model parameters. This opacity limits the ability to verify correctness, trace the source of errors, or understand the reasoning behind an answer. In contrast, retrieval-based systems provide explicit sources that can be inspected and validated.
+
+- **Hallucination of Incorrect but Plausible Answers**: The paper acknowledges that models sometimes generate answers that sound plausible but are incorrect, particularly when they lack the necessary knowledge. This poses risks in high-stakes applications like medical or legal domains, where misinformation could have severe consequences.
+
+- **Overestimation of Performance Due to Dataset Bias**: The evaluation datasets (e.g., Natural Questions, TriviaQA) focus largely on factoid-style questions, which may not represent the complexity of real-world information needs. The study does not explore how well the models handle multi-step reasoning, nuanced interpretation, or ambiguous queries, which are common in practical applications.
+
+# Retrieval-Enhanced Machine Learning
+**Domain**: Retrieval
+
+**Relevance Score**:
+
+**DOI**: [https://doi.org/10.1145/3477495.3531722](https://doi.org/10.1145/3477495.3531722)
+
+## Abstract
+Although information access systems have long supported people
+in accomplishing a wide range of tasks, we propose broadening the
+scope of users of information access systems to include task-driven
+machines, such as machine learning models. In this way, the core
+principles of indexing, representation, retrieval, and ranking can
+be applied and extended to substantially improve model generalization, scalability, robustness, and interpretability. We describe a
+generic retrieval-enhanced machine learning (REML) framework,
+which includes a number of existing models as special cases. REML
+challenges information retrieval conventions, presenting opportunities for novel advances in core areas, including optimization.
+The REML research agenda lays a foundation for a new style of
+information access research and paves a path towards advancing
+machine learning and artificial intelligence.
+
+## Summary
+The paper introduces the concept of Retrieval-Enhanced Machine Learning (REML), which aims to improve machine learning models by integrating information retrieval (IR) techniques. Traditional machine learning systems often rely on large parameter sizes to encode knowledge, which can be costly and unsustainable. REML proposes a framework where machine learning models can access external information repositories, allowing them to decouple reasoning from memory. This approach enhances model generalization, scalability, robustness, and interpretability by leveraging efficient retrieval methods to access relevant information dynamically during the prediction process.
+
+The authors outline the core principles of REML, including querying, retrieval, and response utilization, and categorize models based on their capabilities, such as storing information and providing feedback to the retrieval system. They discuss the potential applications of REML in various domains, including generalization, scalability, and interpretability, while also addressing challenges in optimizing the interaction between prediction and retrieval models. The paper concludes by emphasizing the need for further research to fully realize the potential of REML in advancing machine learning and artificial intelligence.
+
+## Limitations
+- **Feedback Mechanism Limitations**: The paper discusses the potential for feedback from prediction models to improve retrieval systems. However, the effectiveness of this feedback loop may vary, and establishing a reliable feedback mechanism can be difficult.
+
+- **Limited Exploration of Querying Strategies**: The paper identifies querying as a core research question but does not delve deeply into the various strategies for effective querying, which could limit the practical application of REML.
+
+# Can Knowledge Graphs Reduce Hallucinations in LLMs
+**Domain**: Knowledge Graph
+
+**Relevance Score**:
+
+**DOI**: [https://doi.org/10.18653/v1/2024.naacl-long.219](https://doi.org/10.18653/v1/2024.naacl-long.219)
+
+## Abstract
+The contemporary LLMs are prone to producing hallucinations, stemming mainly from the
+knowledge gaps within the models. To address
+this critical limitation, researchers employ diverse strategies to augment the LLMs by incorporating external knowledge, aiming to reduce
+hallucinations and enhance reasoning accuracy.
+Among these strategies, leveraging knowledge
+graphs as a source of external information has
+demonstrated promising results. In this survey,
+we comprehensively review these knowledgegraph-based augmentation techniques in LLMs,
+focusing on their efficacy in mitigating hallucinations. We systematically categorize these
+methods into three overarching groups, offering
+methodological comparisons and performance
+evaluations. Lastly, this survey explores the
+current trends and challenges associated with
+these techniques and outlines potential avenues
+for future research in this emerging field.
+
+## Summary
+The paper explores the integration of knowledge graphs (KGs) into large language models (LLMs) to mitigate the issue of hallucinations—outputs that sound plausible but are often incorrect or irrelevant. The authors categorize various knowledge-graph-based augmentation techniques into three main groups: Knowledge-Aware Inference, Knowledge-Aware Learning, and Knowledge-Aware Validation. Each category encompasses methods that enhance the reasoning capabilities of LLMs by improving their inference processes, optimizing learning mechanisms, and validating generated outputs against structured knowledge.
+
+The survey highlights the effectiveness of these techniques in enhancing the reliability and performance of LLMs across different applications, while also discussing current trends, challenges, and future research directions in the field. The authors emphasize the importance of providing precise and contextually relevant external knowledge to improve LLMs' understanding and reasoning, ultimately aiming to create more trustworthy AI systems.
+
+## Limitations
+- **Open Research Questions**: The paper highlights ongoing challenges, such as the extent to which updated knowledge can be integrated into models and the fundamental question of whether neural networks genuinely engage in reasoning, indicating areas that require further investigation.
+
+# Retrieval Augmentation Reduces Hallucination in Conversation
+**Domain**: RAG
+
+**Relevance Score**:
+
+**DOI**: [https://doi.org/10.18653/v1/2021.findings-emnlp.320](https://doi.org/10.18653/v1/2021.findings-emnlp.320)
+
+## Abstract
+Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue
+models often suffer from factual incorrectness and hallucination of knowledge (Roller
+et al., 2021). In this work we explore
+the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in
+open-domain QA (Lewis et al., 2020b; Izacard
+and Grave, 2021b) - for knowledge-grounded
+dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses. We
+study various types of architectures with multiple components – retrievers, rankers, and
+encoder-decoders – with the goal of maximizing knowledgeability while retaining conversational ability. We demonstrate that our best
+models obtain state-of-the-art performance on
+two knowledge-grounded conversational tasks.
+The models exhibit open-domain conversational capabilities, generalize effectively to
+scenarios not within the training data, and, as
+verified by human evaluations, substantially reduce the well-known problem of knowledge
+hallucination in state-of-the-art chatbots.
+
+## Summary
+The paper explores the challenges faced by state-of-the-art dialogue models, particularly the issues of factual inaccuracy and knowledge hallucination. The authors propose the use of neural-retrieval-in-the-loop architectures, specifically retrieval-augmented generation (RAG), to enhance knowledge-grounded dialogue systems. By integrating retrievers, rankers, and encoder-decoder models, the study demonstrates that these architectures can significantly improve the factual accuracy of conversational agents while maintaining their conversational fluency. The results show that the best-performing models achieve state-of-the-art performance on knowledge-grounded conversational tasks, effectively reducing hallucinated responses by over 60% and improving generalization to unseen topics.
+
+The paper also emphasizes the importance of using appropriate evaluation metrics, such as Knowledge F1, to assess the models' performance in terms of knowledge utilization and hallucination reduction. Through extensive experiments on datasets like Wizard of Wikipedia and CMU Document Grounded Conversations, the authors highlight that retrieval-augmented models not only outperform traditional models but also exhibit better consistency and engagement in conversations. The findings suggest that retrieval-augmented approaches are a promising solution to the hallucination problem in dialogue systems, paving the way for future research in this area.
+
+## Limitations
+- **Complexity of Multi-Turn Dialogue**: The paper acknowledges that knowledge-grounded dialogue is inherently more complex than single-turn question answering. The models may struggle with maintaining coherence and relevance across multiple turns of conversation, especially when the dialogue context is lengthy.
+
+- **Hallucination with Increased Documents**: While the models significantly reduce hallucination, the paper notes that increasing the number of retrieved documents can lead to higher levels of hallucination in some cases. This suggests a trade-off between knowledge utilization and the risk of generating incorrect information.
+\ No newline at end of file
author	Aditya <bluenerd@protonmail.com>	2025-02-03 19:56:53 +0530
committer	Aditya <bluenerd@protonmail.com>	2025-02-03 19:56:53 +0530
commit	8008f4741106928aa7d5925becf02a2259e2757e (patch)
tree	046ce09015e96a3292741014df1a0ce601832386 /sources.md
parent	573bebf0b709507b09ae6d21616b675dcac08d69 (diff)