Paper Title Approach Datasets Used Results Key Contributions
  • A Neural Corpus Indexer for Document Retrieval (Wang et al., 2022)
  • End-to-end seq2seq network with a Prefix-Aware Weight-Adaptive (PAWA) Decoder
  • Query generation network and hierarchical k-means indexing
  • NQ320k
  • TriviaQA
  • +21.4% relative improvement in Recall@1 on NQ320k
  • +16.8% improvement in R-Precision on TriviaQA
  • Unifies training and indexing
  • Introduces a novel decoder and realistic query–document pair generation for enhanced retrieval performance
  • Active Retrieval Augmented Generation (Jiang et al., 2023)
  • Dynamic, iterative retrieval integrated into generation (FLARE)
  • Detects low-confidence tokens and retrieves additional context
  • Knowledge-intensive tasks (e.g., multihop QA, open-domain summarization)
  • Specific datasets not detailed
  • Significant performance improvements in complex, long-form generation tasks
  • Introduces a forward-looking, active retrieval mechanism
  • Moves beyond static, single-time retrieval methods
  • Atlas Few-shot Learning with Retrieval Augmented Language Models
  • Dual-encoder retrieval combined with a sequence-to-sequence generator
  • Joint pre-training of both components
  • Natural Questions
  • MMLU, KILT benchmarks
  • Over 42% accuracy on Natural Questions with only 64 training examples
  • Outperforms larger models (e.g., PaLM) by 3%
  • Demonstrates effective few-shot learning with minimal data
  • Offers an adaptable document index
  • Benchmarking Large Language Models in Retrieval-Augmented Generation (Chen et al., 2024)
  • Evaluation framework (RGB) assessing retrieval quality in LLMs (e.g., ChatGPT, ChatGLM, Vicuna)
  • Evaluation tasks in English and Chinese under varying noise conditions
  • Accuracy drop: e.g., ChatGPT from 96.33% to 76% with noise
  • Multi-document integration challenges (accuracy drops to 43–55%)
  • Provides a rigorous benchmark for RAG settings
  • Highlights error detection and rejection behaviors in LLMs
  • C-RAG Certified Generation Risks for Retrieval-Augmented Language Models (Kang et al., 2024)
  • Conformal risk analysis to certify generation risks
  • Establishes an upper bound (“conformal generation risk”)
  • AESLC
  • CommonGen, DART, E2E
  • Consistently lower conformal generation risks compared to non-retrieval models
  • Extends conformal prediction methods to RAG
  • Provides a framework for risk certification in generation
  • Can Knowledge Graphs Reduce Hallucinations in LLMs: A Survey (Agrawal et al., 2024)
  • Survey categorizing KG integration methods into:
  •   • Knowledge-aware inference
  •   • Knowledge-aware learning
  •   • Knowledge-aware validation
  • Aggregated studies across multiple tasks
  • No single dataset
  • Up to 80% enhancement in answer correctness in certain settings
  • Improved chain-of-thought reasoning
  • Comprehensively categorizes KG-based augmentation methods
  • Addresses hallucination reduction in LLMs
  • Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020)
  • Dual-encoder dense vector representations for semantic matching
  • Utilizes in-batch negative training
  • Natural Questions
  • Other open-domain QA benchmarks
  • Top-20 accuracy of 78.4% on Natural Questions (vs. 59.1% for BM25)
  • Introduces dense retrieval techniques
  • Significantly improves semantic matching in QA systems
  • DocPrompting Generating Code by Retrieving the Docs (Zhou et al., 2023)
  • Retrieval of documentation to guide code generation
  • Focuses on documentation rather than NL-code pairs
  • CoNaLa (Python)
  • Curated Bash dataset
  • 52% relative gain in pass@1
  • 30% relative gain in pass@10 on CoNaLa
  • Highlights the importance of documentation retrieval
  • Boosts code generation accuracy and generalization
  • Document Language Models, Query Models, and Risk Minimization for Information Retrieval
    (Ponte & Croft, 1998; Berger & Lafferty, 1999; Lafferty & Zhai, 2001)
  • Combines unigram language models
  • Statistical translation methods
  • Markov chain query models and Bayesian risk minimization
  • TREC collections
  • Significant improvements over traditional vector space models
  • Laid the foundation for integrating DLMs, QMs, and risk minimization
  • Influenced modern retrieval methods
  • Evaluating Retrieval Quality in Retrieval-Augmented Generation (Salemi & Zamani, 2024)
  • eRAG: Uses LLMs to generate document-level relevance labels
  • Labels correlate with downstream performance
  • Various downstream RAG tasks
  • Exact datasets not specified
  • Kendall’s tau increased from 0.168 to 0.494
  • Up to 50× memory efficiency and 2.468× speedup
  • Proposes a novel evaluation metric aligning retrieval quality with end-task performance
  • Reduces computational overhead
  • Fine Tuning vs Retrieval Augmented Generation for Less Popular Knowledge
  • Comparative analysis between fine tuning (FT) and RAG
  • Not explicitly specified
  • RAG achieves higher accuracy for low-frequency entities
  • Hybrid FT+RAG yields best results for smaller models
  • Highlights benefits of RAG over traditional fine tuning
  • Effective for less popular or emerging knowledge
  • How Much Knowledge Can You Pack Into the Parameters of a Language Model (Roberts et al., 2020)
  • Fine-tuning with salient span masking (SSM) as a pre-training objective
  • Applied to open-domain QA
  • Natural Questions
  • WebQuestions
  • TriviaQA
  • Larger models outperform smaller ones
  • Significant performance gains with SSM
  • Contrasts closed-book vs. open-book QA
  • Demonstrates task-specific pre-training benefits
  • Learning Transferable Visual Models From Natural Language Supervision
    (Radford et al., 2021; Brown et al., 2020; Deng et al., 2009)
  • Contrastive Language-Image Pre-training (CLIP)
  • Joint image and text encoders
  • 400M (image, text) pairs
  • Evaluated on ImageNet and other benchmarks
  • Competitive zero-shot performance on ImageNet
  • Robust to natural distribution shifts
  • Bridges visual and textual modalities
  • Enables transferable visual representations via contrastive learning
  • Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
    (Izacard & Grave, 2021; Roberts et al., 2020)
  • Fusion-in-Decoder: Independently encodes multiple passages
  • Aggregates evidence in the decoder
  • Natural Questions
  • TriviaQA
  • State-of-the-art Exact Match scores
  • Performance scales with more retrieved passages
  • Combines retrieval with generation to synthesize evidence
  • Improves open-domain QA accuracy
  • Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al., 2023)
  • HyDE: Two-step process
  • Generates hypothetical documents using instruction-following LMs
  • Applies unsupervised contrastive encoding
  • Various tasks: web search, QA, fact verification (multi-language settings)
  • Outperforms existing unsupervised dense retrieval models
  • Competitive with fine-tuned models
  • Enables effective zero-shot retrieval without explicit relevance labels
  • Leverages hypothetical document generation
  • Re2G Retrieve, Rerank, Generate (Lewis et al., 2020; Guu et al., 2020)
  • Integrated framework combining retrieval, reranking, and generation
  • Uses knowledge distillation
  • Various tasks (exact datasets not specified)
  • Enhanced selection of relevant passages
  • Improved overall performance across tasks
  • Unifies retrieval, reranking, and generation in an end-to-end framework
  • Improves evidence selection
  • REALM Retrieval-Augmented Language Model Pre-Training
    (Guu et al., 2020; Devlin et al., 2018; Raffel et al., 2019)
  • Two-step process: retrieval followed by masked language model prediction
  • Natural Questions
  • WebQuestions
  • 4–16% absolute accuracy improvements on open-domain QA benchmarks
  • Integrates retrieval into pre-training
  • Enhances prediction accuracy and model interpretability
  • REST Retrieval-Based Speculative Decoding
    (He et al., 2024; Miao et al., 2023; Chen et al., 2023)
  • Uses a non-parametric retrieval datastore to construct draft tokens
  • For speculative decoding
  • HumanEval
  • MT-Bench
  • 1.62× to 2.36× speedup in token generation
  • Compared to standard autoregressive methods
  • Improves generation speed without additional training
  • Allows seamless integration with various LLMs
  • Retrieval Augmentation Reduces Hallucination in Conversation
    (Roller et al., 2021; Maynez et al., 2020; Lewis et al., 2020b; Shuster et al., 2021; Dinan et al., 2019b; Zhou et al., 2021)
  • Integration of retrieval mechanisms into dialogue systems
  • Fetches relevant documents for improved factuality
  • Knowledge-grounded conversational datasets
  • Specific names not provided
  • Reduced hallucination rates
  • Higher factual accuracy compared to standard models
  • Demonstrates more reliable, factually grounded conversational responses
  • Retrieval Augmented Code Generation and Summarization
    (Parvez et al., 2021; Karpukhin et al., 2020; Feng et al., 2020; Guo et al., 2021; Ahmad et al., 2021)
  • REDCODER framework: Two-step process combining retrieval with generation
  • Uses pre-trained code models
  • Code generation benchmarks (e.g., CoNaLa, CodeXGLUE)
  • Evaluated via BLEU, Exact Match, CodeBLEU
  • Significant improvements in BLEU, Exact Match, and CodeBLEU scores
  • Enhances code generation and summarization
  • Effectively retrieves relevant code snippets and integrates pre-trained models
  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
    (Lewis et al., 2020; Karpukhin et al., 2020; Guu et al., 2020; Petroni et al., 2019)
  • Combines retrieval with generative modeling to synthesize external knowledge
  • Various knowledge-intensive benchmarks (e.g., Natural Questions, TriviaQA)
  • Outperforms extractive and closed-book models in accuracy and robustness
  • Balances internal model knowledge with external retrieval
  • Provides accurate and comprehensive answers
  • Retrieval-Enhanced Machine Learning (Zamani et al., 2022)
  • REML framework: Joint optimization of a prediction model and a retrieval model
  • Applied in domain adaptation and few-shot learning scenarios
  • Datasets not specified
  • Improves model generalization, scalability, and interpretability
  • Offloads memorization to a retrieval system
  • Supports dynamic updates to knowledge bases
  • Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., 2024)
  • Adaptive retrieval with self-reflection using “reflection tokens”
  • Structured self-critique
  • Open-domain QA, reasoning, and fact verification tasks
  • Specific datasets not provided
  • Outperforms state-of-the-art models in factuality and citation accuracy
  • Introduces self-critique into the RAG pipeline
  • Enables adaptive retrieval and improved output reliability
  • The Probabilistic Relevance Framework – BM25 and Beyond
    (Robertson & Sparck Jones, 1977; Robertson et al., 1994; Robertson & Zaragoza, 2009; Sparck Jones et al., 2000)
  • Probabilistic relevance modeling using term frequency, inverse document frequency, and document length normalization
  • TREC collections and other ad-hoc retrieval tasks
  • Demonstrated robust performance as a benchmark for relevance estimation
  • Provides the theoretical foundation for modern IR systems
  • Basis for the widely adopted BM25 scoring function
  • TIARA Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base
    (Gu et al., 2021; Ye et al., 2021; Chen et al., 2021; Raffel et al., 2020; Devlin et al., 2019)
  • Multi-grained retrieval integrating entity, exemplary logical form, and schema retrieval
  • Uses constrained decoding
  • GrailQA
  • WebQuestionsSP
  • Significant improvements in compositional and zero-shot generalization
  • Outperforms previous methods
  • Addresses KBQA challenges by retrieving multiple granularities of context
  • Enhances accuracy and reliability of logical form generation