literature-review.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

# A Neural Corpus Indexer for Document Retrieval
The Neural Corpus Indexer (NCI) is a novel end-to-end deep neural network framework designed to enhance recall performance by unifying the training and indexing stages, utilizing a sequence-to-sequence architecture to generate relevant document identifiers directly from user queries (Wang et al., 2022). Key innovations include the Prefix-Aware Weight-Adaptive (PAWA) Decoder, which provides distinct embeddings for tokens based on their positions to capture semantic nuances; a query generation network that creates diverse query-document pairs for improved semantic understanding; and the use of hierarchical k-means to encode documents into semantic identifiers that reflect their content. Empirical evaluations show that NCI significantly outperforms traditional methods, achieving a +21.4% relative enhancement for Recall@1 on the NQ320k dataset and a +16.8% improvement for R-Precision on the TriviaQA dataset, highlighting its effectiveness in optimizing retrieval performance through realistic query-document pairs and tailored architectural components (Wang et al., 2022).

# Active Retrieval Augmented Generation
Active Retrieval Augmented Generation (ARAG) is an innovative framework that enhances the capabilities of language models (LMs) by integrating dynamic retrieval mechanisms, allowing for continuous information gathering during text generation. A notable implementation, Forward-Looking Active Retrieval Augmented Generation (FLARE), proposed by Jiang et al. (2023), iteratively generates predictions for upcoming sentences and retrieves relevant documents when low-confidence tokens are detected, significantly improving performance in knowledge-intensive tasks. Unlike traditional single-time retrieval methods, which are effective for short-form tasks but inadequate for long-form generation, ARAG frameworks enable more dynamic and context-aware information retrieval. Experimental results demonstrate that ARAG consistently outperforms traditional methods across various tasks, including multihop question answering and open-domain summarization. Future research may focus on refining retrieval strategies and exploring broader applications of ARAG in natural language processing.

# Atlas few-shot learning with retrieval augmented language models
Atlas is a retrieval-augmented language model specifically designed for few-shot learning, employing a dual-encoder architecture for document retrieval and a sequence-to-sequence model for output generation. It achieves competitive performance with significantly fewer parameters than other state-of-the-art models, demonstrating strong few-shot learning capabilities by attaining over 42% accuracy on the Natural Questions dataset with only 64 training examples, surpassing larger models like PaLM (540B parameters) by 3%. The architecture of Atlas facilitates easy updates to the document index, enhancing its adaptability to new information, and it has been evaluated across various benchmarks, including MMLU, KILT, and Natural Questions, confirming its effectiveness in both few-shot and resource-rich settings. The training process involves joint pre-training of the retriever and language model, which is essential for its few-shot performance, and various loss functions and pretext tasks are explored to optimize the model's capabilities, with extensive experiments, including ablation studies, underscoring the significance of retrieval in improving few-shot learning.

# Benchmarking Large Language Models in Retrieval-Augmented Generation
Benchmarking is essential for evaluating the performance of large language models (LLMs) in Retrieval-Augmented Generation (RAG) settings, as it identifies their strengths and weaknesses in effectively utilizing retrieved information, addressing a significant gap in existing research due to the lack of rigorous evaluation frameworks (Chen et al., 2024). This study evaluates six state-of-the-art LLMs—ChatGPT (OpenAI, 2022), ChatGLM-6B (THUDM, 2023a), ChatGLM2-6B (THUDM, 2023b), Vicuna-7B (Chiang et al., 2023), Qwen-7B-Chat (Bai et al., 2023), and BELLE-7B (BELLEGroup, 2023)—using the Retrieval-Augmented Generation Benchmark (RGB) and employs various metrics, including accuracy, rejection rate, error detection rate, and error correction rate, to assess their capabilities. The results reveal a notable decline in accuracy as noise ratios increase, with ChatGPT's accuracy dropping from 96.33% to 76.00% under varying noise conditions, while LLMs exhibited challenges with long-distance information, evidence uncertainty, and concept confusion. Additionally, the rejection rates were low, with a maximum of 45% for English and 43.33% for Chinese, indicating that LLMs often failed to reject irrelevant information and did not consistently adhere to rejection instructions. Furthermore, the models demonstrated weak performance in integrating information from multiple documents, achieving only 60% accuracy in English and 67% in Chinese without noise, which dropped to 43% and 55% with noise, respectively. Lastly, LLMs struggled to detect and correct factual errors in retrieved documents, often relying on misleading information, highlighting the need for further improvements in their capabilities within RAG contexts (Chen et al., 2024; Guu et al., 2020; Lewis et al., 2020).

# C-RAG certified generation risks for retrieval-augmented language models
C-RAG (Certified Generation Risks for Retrieval-Augmented Language Models) is an innovative framework that certifies generation risks in retrieval-augmented generation (RAG) models by employing conformal risk analysis to establish a high-probability upper bound on generation risks, known as "conformal generation risk." This framework not only certifies risks associated with specific RAG configurations but also identifies valid configurations that maintain generation risks below a desired threshold. The theoretical foundation of C-RAG is based on conformal prediction methods, which ensure coverage for prediction sets (Vovk et al., 1999; 2005), and it extends these methods to handle bounded risk functions under test-time distribution shifts, thereby filling a significant gap in the literature. Empirical validation of C-RAG has been conducted across four widely-used NLP datasets—AESLC, CommonGen, DART, and E2E—demonstrating its soundness and tightness through extensive evaluations with various retrieval models, including BM25, BAAI/bge, and OpenAI/ada. The results consistently show that C-RAG achieves lower conformal generation risks compared to LLMs without retrieval, thereby reinforcing its theoretical contributions (Kang et al., 2024; Lewis et al., 2020; Bates et al., 2021).

# Can Knowledge Graphs Reduce Hallucinations in LLMs A Survey
The paper by Agrawal et al. (2024) investigates the integration of Knowledge Graphs (KGs) into Large Language Models (LLMs) to address the issue of hallucinations—outputs that appear plausible but are often incorrect or irrelevant. The authors categorize augmentation methods into three groups: Knowledge-Aware Inference, which enhances the inference process by incorporating KGs; Knowledge-Aware Learning, which improves training through pre-training and fine-tuning with KGs; and Knowledge-Aware Validation, which employs KGs for fact-checking outputs. Research indicates that smaller LLMs can significantly improve performance by augmenting their knowledge with KGs, achieving over 80% enhancement in answer correctness for question-answering tasks. Larger models benefit from Chain-of-Thought methodologies, with methods like IRCoT increasing accuracy from 66.8% to 85.7% in reasoning tasks. Knowledge-controlled generation methods have also shown superior performance in accuracy and contextual relevance, although they may produce incorrect outputs, necessitating further refinement. While pre-training and fine-tuning with KGs enhance domain-specific performance, they are resource-intensive and may limit transferability across tasks. Additionally, fact-checking mechanisms using KGs effectively reduce hallucinations but can increase computational load, indicating a need for ongoing research to optimize these techniques.

# Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval (DPR) is an advanced method that leverages dense vector representations to enhance the efficiency and accuracy of passage retrieval in open-domain question answering (QA) systems, overcoming the limitations of traditional sparse methods like TF-IDF and BM25, which struggle with semantic matching (Karpukhin et al., 2020). By employing a dual-encoder framework, DPR encodes both questions and passages into dense vectors, resulting in significant improvements in retrieval accuracy; for instance, it achieves a top-20 accuracy of 78.4% on the Natural Questions dataset, compared to BM25's 59.1% (Karpukhin et al., 2020). Furthermore, DPR demonstrates high performance even with limited training data, outperforming BM25 with as few as 1,000 examples, and benefits from in-batch negative training to enhance its discriminative capabilities (Karpukhin et al., 2020). The model also exhibits robust generalization across various datasets, maintaining strong performance without extensive fine-tuning, and effectively captures semantic relationships, retrieving passages with synonyms or paraphrases (Karpukhin et al., 2020). However, there are instances where BM25 outperforms DPR, particularly when salient phrases are crucial, indicating a need for further refinement in DPR's ability to prioritize significant keywords (Karpukhin et al., 2020).

# DocPrompting Generating Code by Retrieving the Docs
Zhou et al. (2023) demonstrate the effectiveness of DocPrompting through extensive experiments on benchmarks such as the CoNaLa dataset for Python and a newly curated Bash dataset, revealing significant performance improvements in models like CodeT5 and GPT-Neo, with a 52% relative gain in pass@1 and a 30% relative gain in pass@10 on CoNaLa. This underscores the potential of documentation retrieval to enhance the accuracy and generalization of code generation models. In contrast to previous models that primarily retrieve NL-code pairs, DocPrompting emphasizes the retrieval of documentation, which is more readily available for newly released libraries, leading to superior generalization and accuracy in code generation compared to traditional methods.

# Document Language Models, Query Models, and Risk Minimization for Information Retrieval
This literature review examines the advancements in Information Retrieval (IR) through the lens of Document Language Models (DLMs), Query Models (QMs), and risk minimization frameworks based on Bayesian decision theory. Pioneering work by Ponte and Croft (1998) introduced unigram language models for document representation, while Berger and Lafferty (1999) enhanced DLMs by incorporating statistical machine translation techniques to address synonymy. The integration of QMs with DLMs, particularly through Markov chains as proposed by Lafferty and Zhai (2001), has significantly improved retrieval performance, especially for short queries. The risk minimization framework formalizes the retrieval process as a decision-making problem aimed at minimizing expected loss, leading to better retrieval outcomes by focusing on the probability of relevance. Empirical evaluations, particularly on TREC collections, have demonstrated the effectiveness of these language modeling approaches compared to traditional vector space models, indicating substantial improvements in retrieval performance. Overall, the synthesis of DLMs, QMs, and risk minimization strategies marks a significant advancement in the field of IR, with future research poised to refine these models and explore their applications in diverse contexts (Berger & Lafferty, 1999; Lafferty & Zhai, 2001; Ponte & Croft, 1998).

# Evaluating Retrieval Quality in Retrieval-Augmented Generation
The paper introduces eRAG, a novel evaluation method that utilizes the large language model (LLM) within Retrieval-Augmented Generation (RAG) systems to generate document-level relevance labels based on downstream task performance, demonstrating a marked improvement in correlating retrieval quality with downstream performance, as evidenced by enhancements in Kendall’s tau correlation ranging from 0.168 to 0.494 (Salemi & Zamani, 2024). eRAG significantly outperforms traditional evaluation methods, such as human judgment and KILT Provenance, which often yield low correlation with actual RAG performance and are limited by cost and practicality (Zamani & Bendersky, 2022; Petroni et al., 2021). Furthermore, eRAG exhibits remarkable computational efficiency, consuming up to 50 times less memory and providing an average speedup of 2.468 times compared to end-to-end evaluation methods, thereby facilitating quicker iterations in model development and evaluation (Lewis et al., 2020). This study highlights the limitations of conventional evaluation approaches, which often lack transparency and fail to provide a comprehensive understanding of retrieval quality, complicating the optimization of retrieval models (Agrawal et al., 2023; Shuster et al., 2021).

# Fine Tuning vs Retrieval Augmented Generation for Less Popular Knowledge
Fine Tuning (FT) involves adjusting model weights to enhance the recall of specific information relevant to a domain, making it particularly useful when domain-specific data is scarce. While FT has been shown to improve the performance of language models (LMs), especially smaller ones, it requires significant computational resources and training data. Techniques like Parameter Efficient Fine-Tuning (PEFT), such as QLoRA, help maintain reasoning capabilities while integrating new knowledge, with the quality of synthetic training data playing a crucial role in effectiveness. In contrast, Retrieval Augmented Generation (RAG) combines retrieval mechanisms with generative models, allowing LMs to dynamically access external knowledge bases, and has consistently outperformed FT, particularly for less popular knowledge. The success of RAG is dependent on the effectiveness of the retrieval models used, with advanced techniques enhancing performance. Comparative studies indicate that RAG achieves higher accuracy for low-frequency entities, while FT can still provide improvements in certain contexts; however, combining FT and RAG yields the best results for smaller models, whereas larger models may suffer from performance degradation due to potential reasoning capability loss during fine-tuning.

# How Much Knowledge Can You Pack Into the Parameters of a Language Model
Roberts et al. (2020) utilized a fine-tuning methodology on three open-domain question answering datasets—Natural Questions, WebQuestions, and TriviaQA—contrasting closed-book question answering, which relies solely on internalized knowledge, with traditional open-book systems that access external knowledge sources. They introduced salient span masking (SSM) as a pre-training objective, positing that it would improve the model's information retrieval capabilities. The experimental results demonstrated that larger models consistently outperformed smaller ones across all datasets, with the integration of SSM during pre-training leading to significant performance enhancements, highlighting the critical role of task-specific pre-training objectives in optimizing knowledge retrieval.

# Learning Transferable Visual Models From Natural Language Supervision
The paper by Radford et al. (2021) introduces CLIP (Contrastive Language-Image Pre-training), a model designed to associate images with their corresponding textual descriptions, trained on a dataset of 400 million (image, text) pairs, enabling it to learn a diverse range of visual concepts. Utilizing a contrastive learning framework, CLIP maximizes the cosine similarity between embeddings of paired images and texts while minimizing it for non-paired examples, employing a joint architecture of an image encoder (ResNet or Vision Transformer) and a text encoder (Transformer). The results demonstrate that CLIP achieves zero-shot performance that is competitive with fully supervised models across various benchmarks, including ImageNet, and shows robustness to natural distribution shifts, indicating its potential for real-world applications. The authors provide a thorough analysis of CLIP's performance, emphasizing its strengths and identifying areas for further improvement (Radford et al., 2021; Brown et al., 2020; Deng et al., 2009).

# Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Generative models like T5 and BART have shown competitive performance in Open Domain Question Answering (ODQA) by generating answers from input questions and retrieved passages, with Roberts et al. (2020) introducing a model that operates without external knowledge, paving the way for further research. Passage retrieval is essential in ODQA, involving the extraction of relevant text passages from knowledge bases such as Wikipedia, utilizing traditional sparse representations like TF-IDF and more recent dense representations through Dense Passage Retrieval (DPR), which enhance retrieval accuracy. The Fusion-in-Decoder approach by Izacard and Grave (2021) effectively combines generative models with passage retrieval by independently processing multiple passages in the encoder and aggregating evidence in the decoder, thus improving the model's answer generation capabilities. The method has achieved state-of-the-art results on benchmarks like Natural Questions and TriviaQA, with performance metrics such as Exact Match (EM) scores demonstrating significant improvements as the number of retrieved passages increases, providing a solid framework for evaluating model accuracy.

# Precise Zero-Shot Dense Retrieval without Relevance Labels
Gao et al. (2023) introduce HyDE, a two-step retrieval process that employs instruction-following language models like InstructGPT to generate hypothetical documents based on user queries, which encapsulate relevance patterns for retrieving actual documents from a corpus. The methodology consists of generating these hypothetical documents in the first step and encoding them using unsupervised contrastive learning methods, such as Contriever, in the second step to filter out irrelevant content and facilitate the retrieval of real documents. This innovative approach enables effective retrieval without the need for relevance labels, making it suitable for various tasks, including web search, question answering, and fact verification. Experimental results demonstrate that HyDE significantly outperforms existing unsupervised dense retrieval models and shows competitive performance against fine-tuned models across multiple tasks and languages, underscoring its potential as a robust solution for zero-shot retrieval scenarios.

# Re2G Retrieve, Rerank, Generate
Recent advancements in retrieval-augmented models, such as RAG (Retrieval-Augmented Generation) and REALM (Retrieval-Augmented Language Model), have highlighted the effectiveness of integrating retrieval mechanisms into generative frameworks, significantly enhancing the knowledge accessible to these models through the use of indexed corpora (Lewis et al., 2020; Guu et al., 2020). Building on this foundation, Re2G introduces key innovations, including a reranking mechanism that integrates retrieval results from various sources, such as BM25 and neural retrieval methods, thereby improving the selection of relevant passages for generation. Additionally, Re2G employs a novel variation of knowledge distillation for end-to-end training of its initial retrieval, reranker, and generation components, utilizing only the ground truth of the target sequence output, which facilitates enhanced performance across diverse tasks.

# REALM Retrieval-Augmented Language Model Pre-Training
REALM employs a two-step methodology consisting of retrieval and prediction, where it first retrieves relevant documents from a knowledge corpus based on the input query and then uses these documents to inform its predictions. This retrieval mechanism is trained unsupervised, utilizing masked language modeling to optimize the process through backpropagation, which enhances prediction accuracy. Experimental evaluations on Open-QA benchmarks, such as Natural Questions and Web Questions, demonstrate that REALM significantly outperforms existing state-of-the-art models, achieving 4-16% improvements in absolute accuracy while also offering qualitative benefits like enhanced interpretability and modularity (Guu et al., 2020). Compared to other retrieval-based and generation-based systems, REALM shows superior performance, even surpassing larger models like T5, highlighting the importance of its retrieval mechanism in accurately answering questions by providing relevant context (Devlin et al., 2018; Raffel et al., 2019).

# REST Retrieval-Based Speculative Decoding
Speculative decoding traditionally relies on a smaller language model to generate draft tokens, which are then verified by a larger model; however, obtaining a high-quality draft model that balances size and predictive power often requires custom training (Miao et al., 2023; Chen et al., 2023). The Retrieval-Based Speculative Decoding (REST) framework addresses these challenges by utilizing a non-parametric retrieval datastore to construct draft tokens, allowing for seamless integration with various large language models (LLMs) without additional training (He et al., 2024). Unlike LLMA, which retrieves from limited contexts, REST draws from a comprehensive datastore, enabling a broader range of information during generation. Extensive experiments show that REST achieves significant speedups in token generation, with improvements ranging from 1.62x to 2.36x compared to standard autoregressive and speculative decoding methods, demonstrating its effectiveness across diverse datasets such as HumanEval and MT-Bench (He et al., 2024).

# Retrieval Augmentation Reduces Hallucination in Conversation
State-of-the-art dialogue models often generate responses that lack factual accuracy, resulting in hallucination, a problem exacerbated by their reliance on internal knowledge that may not cover all relevant information (Roller et al., 2021; Maynez et al., 2020). Retrieval-Augmented Generation (RAG) addresses this issue by integrating neural retrieval mechanisms with generative models, allowing for the retrieval of relevant documents from a large corpus to enhance the factual accuracy of responses (Lewis et al., 2020b). Studies have shown that models employing retrieval mechanisms achieve state-of-the-art performance on knowledge-grounded conversational tasks, significantly reducing hallucination rates (Shuster et al., 2021). Human evaluations further reveal that retrieval-augmented models demonstrate higher knowledgeability and lower hallucination rates compared to standard models, while also exhibiting improved generalization to unseen topics, thereby outperforming models that rely solely on internal knowledge (Dinan et al., 2019b; Zhou et al., 2021).

# Retrieval Augmented Code Generation and Summarization
The REDCODER framework, introduced by Parvez et al. (2021), exemplifies a retrieval augmented approach that enhances code generation and summarization by retrieving relevant code or summaries from a database through a two-step process involving a retriever and a generator, leading to significant improvements in performance metrics such as BLEU and Exact Match scores. Karpukhin et al. (2020) developed the Dense Passage Retriever (DPR), which efficiently encodes queries and passages for document retrieval, serving as a foundational model for various retrieval-based approaches in this domain. Additionally, Feng et al. (2020) created CodeBERT, a pre-trained model for understanding programming and natural languages, while Guo et al. (2021) introduced GraphCodeBERT, which incorporates data flow information for enhanced retrieval and generation tasks. Ahmad et al. (2021) presented PLBART, a sequence-to-sequence model pre-trained on extensive code and natural language data, which plays a crucial role in the REDCODER framework. The performance of these models is typically evaluated using metrics such as BLEU Score, Exact Match (EM), and CodeBLEU, which assess the overlap, accuracy, and correctness of the generated outputs.

# Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Empirical evaluations show that Retrieval-Augmented Generation (RAG) models achieve state-of-the-art performance on various knowledge-intensive tasks, such as open-domain question answering, abstractive question answering, and fact verification, outperforming traditional extractive models by generating answers that are not necessarily verbatim from retrieved documents (Lewis et al., 2020; Karpukhin et al., 2020). This capability highlights RAG's strength in synthesizing and contextualizing information effectively. Additionally, when compared to closed-book models that rely solely on internal knowledge, RAG models demonstrate a balanced approach that combines the flexibility of generation with the accuracy of factual content, making them a more robust solution for knowledge-intensive tasks (Guu et al., 2020; Petroni et al., 2019).

# Retrieval-Enhanced Machine Learning
Retrieval-Enhanced Machine Learning (REML) consists of two main components: the prediction model, which generates queries, and the information access models, which retrieve relevant information from a knowledge repository. This framework supports various optimization strategies, including independent, conditional, and joint end-to-end optimization of the prediction and retrieval models (Zamani et al., 2022). REML has several significant applications, such as enhancing model generalization in domain adaptation and few-shot learning, improving scalability by offloading memorization to retrieval systems, facilitating dynamic updates to knowledge bases in non-stationary environments, and increasing interpretability by grounding predictions in retrieved information.

# Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG introduces several innovative components that set it apart from conventional Retrieval-Augmented Generation (RAG) approaches, including an adaptive retrieval mechanism that allows the model to determine the necessity of retrieving passages based on the input and preceding generations, thereby enhancing efficiency and relevance. The framework utilizes reflection tokens, which facilitate self-reflection by enabling the model to critique its own outputs and assess the relevance of retrieved information, categorized into retrieval and critique tokens to guide decision-making. Additionally, Self-RAG incorporates a structured critique mechanism that evaluates the factuality and overall quality of generated outputs in relation to retrieved information, significantly improving the reliability of responses. Empirical evaluations by Asai et al. (2024) demonstrate that Self-RAG outperforms state-of-the-art LLMs and retrieval-augmented models across various tasks, including open-domain question answering, reasoning, and fact verification, with notable enhancements in factuality and citation accuracy, particularly in long-form generation tasks, thus addressing critical limitations of existing models.

# The Probabilistic Relevance Framework - BM25 and Beyond
The Probabilistic Relevance Framework (PRF), initially formulated by Robertson and Sparck Jones (1977), established a probabilistic model that highlighted the significance of relevance weighting in search term selection, paving the way for further advancements such as the BM25 function introduced by Robertson et al. (1994), which integrates term frequency and inverse document frequency into its scoring mechanism. BM25 has become a widely recognized instantiation of the PRF, known for its effectiveness in various retrieval tasks by estimating the probability of relevance through a combination of term frequency and document length normalization, with its robustness validated by numerous empirical studies (Robertson & Zaragoza, 2009). The PRF has found applications in diverse information retrieval contexts, including ad-hoc retrieval, query expansion, and information filtering, with Sparck Jones et al. (2000) emphasizing its adaptability in relevance feedback scenarios, thereby demonstrating the framework's flexibility and potential for integrating additional features to enhance its applicability in real-world search systems.

# TIARA Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base
Knowledge Base Question Answering (KBQA) systems face significant challenges, primarily in KB grounding, which involves linking questions to relevant knowledge within complex and large knowledge bases, and in logical form generation, where ensuring the semantic and syntactic correctness of generated forms is difficult (Gu et al., 2021; Ye et al., 2021). To address these issues, TIARA employs a multi-grained retrieval approach that enhances contextual understanding by retrieving relevant contexts from the knowledge base, including entities, exemplary logical forms, and schema items (Chen et al., 2021). The TIARA framework incorporates key components such as entity retrieval, exemplary logical form retrieval, schema retrieval, and constrained decoding, which collectively improve the accuracy and reliability of logical form generation. Empirical evaluations indicate that TIARA outperforms previous state-of-the-art methods on benchmarks like GrailQA and WebQuestionsSP, demonstrating significant improvements in compositional and zero-shot generalization, thereby showcasing its robustness in managing complex queries (Raffel et al., 2020; Devlin et al., 2019).