Paper Title	Approach	Datasets Used	Results	Key Contributions
A Neural Corpus Indexer for Document Retrieval (Wang et al., 2022)	End-to-end seq2seq network with a Prefix-Aware Weight-Adaptive (PAWA) Decoder Query generation network and hierarchical k-means indexing	NQ320k TriviaQA	+21.4% relative improvement in Recall@1 on NQ320k +16.8% improvement in R-Precision on TriviaQA	Unifies training and indexing Introduces a novel decoder and realistic query–document pair generation for enhanced retrieval performance
Active Retrieval Augmented Generation (Jiang et al., 2023)	Dynamic, iterative retrieval integrated into generation (FLARE) Detects low-confidence tokens and retrieves additional context	Knowledge-intensive tasks (e.g., multihop QA, open-domain summarization) Specific datasets not detailed	Significant performance improvements in complex, long-form generation tasks	Introduces a forward-looking, active retrieval mechanism Moves beyond static, single-time retrieval methods
Atlas Few-shot Learning with Retrieval Augmented Language Models	Dual-encoder retrieval combined with a sequence-to-sequence generator Joint pre-training of both components	Natural Questions MMLU, KILT benchmarks	Over 42% accuracy on Natural Questions with only 64 training examples Outperforms larger models (e.g., PaLM) by 3%	Demonstrates effective few-shot learning with minimal data Offers an adaptable document index
Benchmarking Large Language Models in Retrieval-Augmented Generation (Chen et al., 2024)	Evaluation framework (RGB) assessing retrieval quality in LLMs (e.g., ChatGPT, ChatGLM, Vicuna)	Evaluation tasks in English and Chinese under varying noise conditions	Accuracy drop: e.g., ChatGPT from 96.33% to 76% with noise Multi-document integration challenges (accuracy drops to 43–55%)	Provides a rigorous benchmark for RAG settings Highlights error detection and rejection behaviors in LLMs
C-RAG Certified Generation Risks for Retrieval-Augmented Language Models (Kang et al., 2024)	Conformal risk analysis to certify generation risks Establishes an upper bound (“conformal generation risk”)	AESLC CommonGen, DART, E2E	Consistently lower conformal generation risks compared to non-retrieval models	Extends conformal prediction methods to RAG Provides a framework for risk certification in generation
Can Knowledge Graphs Reduce Hallucinations in LLMs: A Survey (Agrawal et al., 2024)	Survey categorizing KG integration methods into: • Knowledge-aware inference • Knowledge-aware learning • Knowledge-aware validation	Aggregated studies across multiple tasks No single dataset	Up to 80% enhancement in answer correctness in certain settings Improved chain-of-thought reasoning	Comprehensively categorizes KG-based augmentation methods Addresses hallucination reduction in LLMs
Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020)	Dual-encoder dense vector representations for semantic matching Utilizes in-batch negative training	Natural Questions Other open-domain QA benchmarks	Top-20 accuracy of 78.4% on Natural Questions (vs. 59.1% for BM25)	Introduces dense retrieval techniques Significantly improves semantic matching in QA systems
DocPrompting Generating Code by Retrieving the Docs (Zhou et al., 2023)	Retrieval of documentation to guide code generation Focuses on documentation rather than NL-code pairs	CoNaLa (Python) Curated Bash dataset	52% relative gain in pass@1 30% relative gain in pass@10 on CoNaLa	Highlights the importance of documentation retrieval Boosts code generation accuracy and generalization
Document Language Models, Query Models, and Risk Minimization for Information Retrieval (Ponte & Croft, 1998; Berger & Lafferty, 1999; Lafferty & Zhai, 2001)	Combines unigram language models Statistical translation methods Markov chain query models and Bayesian risk minimization	TREC collections	Significant improvements over traditional vector space models	Laid the foundation for integrating DLMs, QMs, and risk minimization Influenced modern retrieval methods
Evaluating Retrieval Quality in Retrieval-Augmented Generation (Salemi & Zamani, 2024)	eRAG: Uses LLMs to generate document-level relevance labels Labels correlate with downstream performance	Various downstream RAG tasks Exact datasets not specified	Kendall’s tau increased from 0.168 to 0.494 Up to 50× memory efficiency and 2.468× speedup	Proposes a novel evaluation metric aligning retrieval quality with end-task performance Reduces computational overhead
Fine Tuning vs Retrieval Augmented Generation for Less Popular Knowledge	Comparative analysis between fine tuning (FT) and RAG	Not explicitly specified	RAG achieves higher accuracy for low-frequency entities Hybrid FT+RAG yields best results for smaller models	Highlights benefits of RAG over traditional fine tuning Effective for less popular or emerging knowledge
How Much Knowledge Can You Pack Into the Parameters of a Language Model (Roberts et al., 2020)	Fine-tuning with salient span masking (SSM) as a pre-training objective Applied to open-domain QA	Natural Questions WebQuestions TriviaQA	Larger models outperform smaller ones Significant performance gains with SSM	Contrasts closed-book vs. open-book QA Demonstrates task-specific pre-training benefits
Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021; Brown et al., 2020; Deng et al., 2009)	Contrastive Language-Image Pre-training (CLIP) Joint image and text encoders	400M (image, text) pairs Evaluated on ImageNet and other benchmarks	Competitive zero-shot performance on ImageNet Robust to natural distribution shifts	Bridges visual and textual modalities Enables transferable visual representations via contrastive learning
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (Izacard & Grave, 2021; Roberts et al., 2020)	Fusion-in-Decoder: Independently encodes multiple passages Aggregates evidence in the decoder	Natural Questions TriviaQA	State-of-the-art Exact Match scores Performance scales with more retrieved passages	Combines retrieval with generation to synthesize evidence Improves open-domain QA accuracy
Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al., 2023)	HyDE: Two-step process Generates hypothetical documents using instruction-following LMs Applies unsupervised contrastive encoding	Various tasks: web search, QA, fact verification (multi-language settings)	Outperforms existing unsupervised dense retrieval models Competitive with fine-tuned models	Enables effective zero-shot retrieval without explicit relevance labels Leverages hypothetical document generation
Re2G Retrieve, Rerank, Generate (Lewis et al., 2020; Guu et al., 2020)	Integrated framework combining retrieval, reranking, and generation Uses knowledge distillation	Various tasks (exact datasets not specified)	Enhanced selection of relevant passages Improved overall performance across tasks	Unifies retrieval, reranking, and generation in an end-to-end framework Improves evidence selection
REALM Retrieval-Augmented Language Model Pre-Training (Guu et al., 2020; Devlin et al., 2018; Raffel et al., 2019)	Two-step process: retrieval followed by masked language model prediction	Natural Questions WebQuestions	4–16% absolute accuracy improvements on open-domain QA benchmarks	Integrates retrieval into pre-training Enhances prediction accuracy and model interpretability
REST Retrieval-Based Speculative Decoding (He et al., 2024; Miao et al., 2023; Chen et al., 2023)	Uses a non-parametric retrieval datastore to construct draft tokens For speculative decoding	HumanEval MT-Bench	1.62× to 2.36× speedup in token generation Compared to standard autoregressive methods	Improves generation speed without additional training Allows seamless integration with various LLMs
Retrieval Augmentation Reduces Hallucination in Conversation (Roller et al., 2021; Maynez et al., 2020; Lewis et al., 2020b; Shuster et al., 2021; Dinan et al., 2019b; Zhou et al., 2021)	Integration of retrieval mechanisms into dialogue systems Fetches relevant documents for improved factuality	Knowledge-grounded conversational datasets Specific names not provided	Reduced hallucination rates Higher factual accuracy compared to standard models	Demonstrates more reliable, factually grounded conversational responses
Retrieval Augmented Code Generation and Summarization (Parvez et al., 2021; Karpukhin et al., 2020; Feng et al., 2020; Guo et al., 2021; Ahmad et al., 2021)	REDCODER framework: Two-step process combining retrieval with generation Uses pre-trained code models	Code generation benchmarks (e.g., CoNaLa, CodeXGLUE) Evaluated via BLEU, Exact Match, CodeBLEU	Significant improvements in BLEU, Exact Match, and CodeBLEU scores	Enhances code generation and summarization Effectively retrieves relevant code snippets and integrates pre-trained models
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020; Karpukhin et al., 2020; Guu et al., 2020; Petroni et al., 2019)	Combines retrieval with generative modeling to synthesize external knowledge	Various knowledge-intensive benchmarks (e.g., Natural Questions, TriviaQA)	Outperforms extractive and closed-book models in accuracy and robustness	Balances internal model knowledge with external retrieval Provides accurate and comprehensive answers
Retrieval-Enhanced Machine Learning (Zamani et al., 2022)	REML framework: Joint optimization of a prediction model and a retrieval model	Applied in domain adaptation and few-shot learning scenarios Datasets not specified	Improves model generalization, scalability, and interpretability	Offloads memorization to a retrieval system Supports dynamic updates to knowledge bases
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., 2024)	Adaptive retrieval with self-reflection using “reflection tokens” Structured self-critique	Open-domain QA, reasoning, and fact verification tasks Specific datasets not provided	Outperforms state-of-the-art models in factuality and citation accuracy	Introduces self-critique into the RAG pipeline Enables adaptive retrieval and improved output reliability
The Probabilistic Relevance Framework – BM25 and Beyond (Robertson & Sparck Jones, 1977; Robertson et al., 1994; Robertson & Zaragoza, 2009; Sparck Jones et al., 2000)	Probabilistic relevance modeling using term frequency, inverse document frequency, and document length normalization	TREC collections and other ad-hoc retrieval tasks	Demonstrated robust performance as a benchmark for relevance estimation	Provides the theoretical foundation for modern IR systems Basis for the widely adopted BM25 scoring function
TIARA Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base (Gu et al., 2021; Ye et al., 2021; Chen et al., 2021; Raffel et al., 2020; Devlin et al., 2019)	Multi-grained retrieval integrating entity, exemplary logical form, and schema retrieval Uses constrained decoding	GrailQA WebQuestionsSP	Significant improvements in compositional and zero-shot generalization Outperforms previous methods	Addresses KBQA challenges by retrieving multiple granularities of context Enhances accuracy and reliability of logical form generation