literature-review.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

# A Neural Corpus Indexer for Document Retrieval
The Neural Corpus Indexer (NCI) is a novel end-to-end deep neural network framework designed to enhance recall performance by unifying the training and indexing stages, utilizing a sequence-to-sequence architecture to generate relevant document identifiers directly from user queries (Wang et al., 2022). Key innovations include the Prefix-Aware Weight-Adaptive (PAWA) Decoder, which provides distinct embeddings for tokens based on their positions to capture semantic nuances; a query generation network that creates diverse query-document pairs for improved semantic understanding; and the use of hierarchical k-means to encode documents into semantic identifiers that reflect their content. Empirical evaluations show that NCI significantly outperforms traditional methods, achieving a +21.4% relative enhancement for Recall@1 on the NQ320k dataset and a +16.8% improvement for R-Precision on the TriviaQA dataset, highlighting its effectiveness in optimizing retrieval performance through realistic query-document pairs and tailored architectural components (Wang et al., 2022).

# Active Retrieval Augmented Generation
Active Retrieval Augmented Generation (ARAG) is an innovative framework that enhances the capabilities of language models (LMs) by integrating dynamic retrieval mechanisms, allowing for continuous information gathering during text generation. A notable implementation, Forward-Looking Active Retrieval Augmented Generation (FLARE), proposed by Jiang et al. (2023), iteratively generates predictions for upcoming sentences and retrieves relevant documents when low-confidence tokens are detected, significantly improving performance in knowledge-intensive tasks. Unlike traditional single-time retrieval methods, which are effective for short-form tasks but inadequate for long-form generation, ARAG frameworks enable more dynamic and context-aware information retrieval. Experimental results demonstrate that ARAG consistently outperforms traditional methods across various tasks, including multihop question answering and open-domain summarization. Future research may focus on refining retrieval strategies and exploring broader applications of ARAG in natural language processing.

# Atlas few-shot learning with retrieval augmented language models
Atlas is a retrieval-augmented language model specifically designed for few-shot learning, employing a dual-encoder architecture for document retrieval and a sequence-to-sequence model for output generation. It achieves competitive performance with significantly fewer parameters than other state-of-the-art models, demonstrating strong few-shot learning capabilities by attaining over 42% accuracy on the Natural Questions dataset with only 64 training examples, surpassing larger models like PaLM (540B parameters) by 3%. The architecture of Atlas facilitates easy updates to the document index, enhancing its adaptability to new information, and it has been evaluated across various benchmarks, including MMLU, KILT, and Natural Questions, confirming its effectiveness in both few-shot and resource-rich settings. The training process involves joint pre-training of the retriever and language model, which is essential for its few-shot performance, and various loss functions and pretext tasks are explored to optimize the model's capabilities, with extensive experiments, including ablation studies, underscoring the significance of retrieval in improving few-shot learning.

# Benchmarking Large Language Models in Retrieval-Augmented Generation
Benchmarking is essential for evaluating the performance of large language models (LLMs) in Retrieval-Augmented Generation (RAG) settings, as it identifies their strengths and weaknesses in effectively utilizing retrieved information, addressing a significant gap in existing research due to the lack of rigorous evaluation frameworks (Chen et al., 2024). This study evaluates six state-of-the-art LLMs—ChatGPT (OpenAI, 2022), ChatGLM-6B (THUDM, 2023a), ChatGLM2-6B (THUDM, 2023b), Vicuna-7B (Chiang et al., 2023), Qwen-7B-Chat (Bai et al., 2023), and BELLE-7B (BELLEGroup, 2023)—using the Retrieval-Augmented Generation Benchmark (RGB) and employs various metrics, including accuracy, rejection rate, error detection rate, and error correction rate, to assess their capabilities. The results reveal a notable decline in accuracy as noise ratios increase, with ChatGPT's accuracy dropping from 96.33% to 76.00% under varying noise conditions, while LLMs exhibited challenges with long-distance information, evidence uncertainty, and concept confusion. Additionally, the rejection rates were low, with a maximum of 45% for English and 43.33% for Chinese, indicating that LLMs often failed to reject irrelevant information and did not consistently adhere to rejection instructions. Furthermore, the models demonstrated weak performance in integrating information from multiple documents, achieving only 60% accuracy in English and 67% in Chinese without noise, which dropped to 43% and 55% with noise, respectively. Lastly, LLMs struggled to detect and correct factual errors in retrieved documents, often relying on misleading information, highlighting the need for further improvements in their capabilities within RAG contexts (Chen et al., 2024; Guu et al., 2020; Lewis et al., 2020).

# C-RAG certified generation risks for retrieval-augmented language models
C-RAG (Certified Generation Risks for Retrieval-Augmented Language Models) is an innovative framework that certifies generation risks in retrieval-augmented generation (RAG) models by employing conformal risk analysis to establish a high-probability upper bound on generation risks, known as "conformal generation risk." This framework not only certifies risks associated with specific RAG configurations but also identifies valid configurations that maintain generation risks below a desired threshold. The theoretical foundation of C-RAG is based on conformal prediction methods, which ensure coverage for prediction sets (Vovk et al., 1999; 2005), and it extends these methods to handle bounded risk functions under test-time distribution shifts, thereby filling a significant gap in the literature. Empirical validation of C-RAG has been conducted across four widely-used NLP datasets—AESLC, CommonGen, DART, and E2E—demonstrating its soundness and tightness through extensive evaluations with various retrieval models, including BM25, BAAI/bge, and OpenAI/ada. The results consistently show that C-RAG achieves lower conformal generation risks compared to LLMs without retrieval, thereby reinforcing its theoretical contributions (Kang et al., 2024; Lewis et al., 2020; Bates et al., 2021).

# Can Knowledge Graphs Reduce Hallucinations in LLMs A Survey
The paper by Agrawal et al. (2024) investigates the integration of Knowledge Graphs (KGs) into Large Language Models (LLMs) to address the issue of hallucinations—outputs that appear plausible but are often incorrect or irrelevant. The authors categorize augmentation methods into three groups: Knowledge-Aware Inference, which enhances the inference process by incorporating KGs; Knowledge-Aware Learning, which improves training through pre-training and fine-tuning with KGs; and Knowledge-Aware Validation, which employs KGs for fact-checking outputs. Research indicates that smaller LLMs can significantly improve performance by augmenting their knowledge with KGs, achieving over 80% enhancement in answer correctness for question-answering tasks. Larger models benefit from Chain-of-Thought methodologies, with methods like IRCoT increasing accuracy from 66.8% to 85.7% in reasoning tasks. Knowledge-controlled generation methods have also shown superior performance in accuracy and contextual relevance, although they may produce incorrect outputs, necessitating further refinement. While pre-training and fine-tuning with KGs enhance domain-specific performance, they are resource-intensive and may limit transferability across tasks. Additionally, fact-checking mechanisms using KGs effectively reduce hallucinations but can increase computational load, indicating a need for ongoing research to optimize these techniques.

# Dense Passage Retrieval for Open-Domain Question Answering
Dense Passage Retrieval (DPR) is an advanced method that leverages dense vector representations to enhance the efficiency and accuracy of passage retrieval in open-domain question answering (QA) systems, overcoming the limitations of traditional sparse methods like TF-IDF and BM25, which struggle with semantic matching (Karpukhin et al., 2020). By employing a dual-encoder framework, DPR encodes both questions and passages into dense vectors, resulting in significant improvements in retrieval accuracy; for instance, it achieves a top-20 accuracy of 78.4% on the Natural Questions dataset, compared to BM25's 59.1% (Karpukhin et al., 2020). Furthermore, DPR demonstrates high performance even with limited training data, outperforming BM25 with as few as 1,000 examples, and benefits from in-batch negative training to enhance its discriminative capabilities (Karpukhin et al., 2020). The model also exhibits robust generalization across various datasets, maintaining strong performance without extensive fine-tuning, and effectively captures semantic relationships, retrieving passages with synonyms or paraphrases (Karpukhin et al., 2020). However, there are instances where BM25 outperforms DPR, particularly when salient phrases are crucial, indicating a need for further refinement in DPR's ability to prioritize significant keywords (Karpukhin et al., 2020).

# DocPrompting Generating Code by Retrieving the Docs
Zhou et al. (2023) demonstrate the effectiveness of DocPrompting through extensive experiments on benchmarks such as the CoNaLa dataset for Python and a newly curated Bash dataset, revealing significant performance improvements in models like CodeT5 and GPT-Neo, with a 52% relative gain in pass@1 and a 30% relative gain in pass@10 on CoNaLa. This underscores the potential of documentation retrieval to enhance the accuracy and generalization of code generation models. In contrast to previous models that primarily retrieve NL-code pairs, DocPrompting emphasizes the retrieval of documentation, which is more readily available for newly released libraries, leading to superior generalization and accuracy in code generation compared to traditional methods.

# Document Language Models, Query Models, and Risk Minimization for Information Retrieval
This literature review examines the advancements in Information Retrieval (IR) through the lens of Document Language Models (DLMs), Query Models (QMs), and risk minimization frameworks based on Bayesian decision theory. Pioneering work by Ponte and Croft (1998) introduced unigram language models for document representation, while Berger and Lafferty (1999) enhanced DLMs by incorporating statistical machine translation techniques to address synonymy. The integration of QMs with DLMs, particularly through Markov chains as proposed by Lafferty and Zhai (2001), has significantly improved retrieval performance, especially for short queries. The risk minimization framework formalizes the retrieval process as a decision-making problem aimed at minimizing expected loss, leading to better retrieval outcomes by focusing on the probability of relevance. Empirical evaluations, particularly on TREC collections, have demonstrated the effectiveness of these language modeling approaches compared to traditional vector space models, indicating substantial improvements in retrieval performance. Overall, the synthesis of DLMs, QMs, and risk minimization strategies marks a significant advancement in the field of IR, with future research poised to refine these models and explore their applications in diverse contexts (Berger & Lafferty, 1999; Lafferty & Zhai, 2001; Ponte & Croft, 1998).

# Evaluating Retrieval Quality in Retrieval-Augmented Generation
The paper introduces eRAG, a novel evaluation method that utilizes the large language model (LLM) within Retrieval-Augmented Generation (RAG) systems to generate document-level relevance labels based on downstream task performance, demonstrating a marked improvement in correlating retrieval quality with downstream performance, as evidenced by enhancements in Kendall’s tau correlation ranging from 0.168 to 0.494 (Salemi & Zamani, 2024). eRAG significantly outperforms traditional evaluation methods, such as human judgment and KILT Provenance, which often yield low correlation with actual RAG performance and are limited by cost and practicality (Zamani & Bendersky, 2022; Petroni et al., 2021). Furthermore, eRAG exhibits remarkable computational efficiency, consuming up to 50 times less memory and providing an average speedup of 2.468 times compared to end-to-end evaluation methods, thereby facilitating quicker iterations in model development and evaluation (Lewis et al., 2020). This study highlights the limitations of conventional evaluation approaches, which often lack transparency and fail to provide a comprehensive understanding of retrieval quality, complicating the optimization of retrieval models (Agrawal et al., 2023; Shuster et al., 2021).