| 
          A Neural Corpus Indexer for Document Retrieval (Wang et al., 2022) | 
          End-to-end seq2seq network with a Prefix-Aware Weight-Adaptive (PAWA) DecoderQuery generation network and hierarchical k-means indexing |  | 
          +21.4% relative improvement in Recall@1 on NQ320k+16.8% improvement in R-Precision on TriviaQA | 
          Unifies training and indexingIntroduces a novel decoder and realistic query–document pair generation for enhanced retrieval performance | 
    
    
      | 
          Active Retrieval Augmented Generation (Jiang et al., 2023) | 
          Dynamic, iterative retrieval integrated into generation (FLARE)Detects low-confidence tokens and retrieves additional context | 
          Knowledge-intensive tasks (e.g., multihop QA, open-domain summarization)Specific datasets not detailed | 
          Significant performance improvements in complex, long-form generation tasks | 
          Introduces a forward-looking, active retrieval mechanismMoves beyond static, single-time retrieval methods | 
    
    
      | 
          Atlas Few-shot Learning with Retrieval Augmented Language Models | 
          Dual-encoder retrieval combined with a sequence-to-sequence generatorJoint pre-training of both components | 
          Natural QuestionsMMLU, KILT benchmarks | 
          Over 42% accuracy on Natural Questions with only 64 training examplesOutperforms larger models (e.g., PaLM) by 3% | 
          Demonstrates effective few-shot learning with minimal dataOffers an adaptable document index | 
    
    
      | 
          Benchmarking Large Language Models in Retrieval-Augmented Generation (Chen et al., 2024) | 
          Evaluation framework (RGB) assessing retrieval quality in LLMs (e.g., ChatGPT, ChatGLM, Vicuna) | 
          Evaluation tasks in English and Chinese under varying noise conditions | 
          Accuracy drop: e.g., ChatGPT from 96.33% to 76% with noiseMulti-document integration challenges (accuracy drops to 43–55%) | 
          Provides a rigorous benchmark for RAG settingsHighlights error detection and rejection behaviors in LLMs | 
    
    
      | 
          C-RAG Certified Generation Risks for Retrieval-Augmented Language Models (Kang et al., 2024) | 
          Conformal risk analysis to certify generation risksEstablishes an upper bound (“conformal generation risk”) | 
          AESLCCommonGen, DART, E2E | 
          Consistently lower conformal generation risks compared to non-retrieval models | 
          Extends conformal prediction methods to RAGProvides a framework for risk certification in generation | 
    
    
      | 
          Can Knowledge Graphs Reduce Hallucinations in LLMs: A Survey (Agrawal et al., 2024) | 
          Survey categorizing KG integration methods into:  • Knowledge-aware inference  • Knowledge-aware learning  • Knowledge-aware validation | 
          Aggregated studies across multiple tasksNo single dataset | 
          Up to 80% enhancement in answer correctness in certain settingsImproved chain-of-thought reasoning | 
          Comprehensively categorizes KG-based augmentation methodsAddresses hallucination reduction in LLMs | 
    
    
      | 
          Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020) | 
          Dual-encoder dense vector representations for semantic matchingUtilizes in-batch negative training | 
          Natural QuestionsOther open-domain QA benchmarks | 
          Top-20 accuracy of 78.4% on Natural Questions (vs. 59.1% for BM25) | 
          Introduces dense retrieval techniquesSignificantly improves semantic matching in QA systems | 
    
    
      | 
          DocPrompting Generating Code by Retrieving the Docs (Zhou et al., 2023) | 
          Retrieval of documentation to guide code generationFocuses on documentation rather than NL-code pairs | 
          CoNaLa (Python)Curated Bash dataset | 
          52% relative gain in pass@130% relative gain in pass@10 on CoNaLa | 
          Highlights the importance of documentation retrievalBoosts code generation accuracy and generalization | 
    
    
      | 
          Document Language Models, Query Models, and Risk Minimization for Information Retrieval(Ponte & Croft, 1998; Berger & Lafferty, 1999; Lafferty & Zhai, 2001)
 | 
          Combines unigram language modelsStatistical translation methodsMarkov chain query models and Bayesian risk minimization |  | 
          Significant improvements over traditional vector space models | 
          Laid the foundation for integrating DLMs, QMs, and risk minimizationInfluenced modern retrieval methods | 
    
    
      | 
          Evaluating Retrieval Quality in Retrieval-Augmented Generation (Salemi & Zamani, 2024) | 
          eRAG: Uses LLMs to generate document-level relevance labelsLabels correlate with downstream performance | 
          Various downstream RAG tasksExact datasets not specified | 
          Kendall’s tau increased from 0.168 to 0.494Up to 50× memory efficiency and 2.468× speedup | 
          Proposes a novel evaluation metric aligning retrieval quality with end-task performanceReduces computational overhead | 
    
    
      | 
          Fine Tuning vs Retrieval Augmented Generation for Less Popular Knowledge | 
          Comparative analysis between fine tuning (FT) and RAG |  | 
          RAG achieves higher accuracy for low-frequency entitiesHybrid FT+RAG yields best results for smaller models | 
          Highlights benefits of RAG over traditional fine tuningEffective for less popular or emerging knowledge | 
    
    
      | 
          How Much Knowledge Can You Pack Into the Parameters of a Language Model (Roberts et al., 2020) | 
          Fine-tuning with salient span masking (SSM) as a pre-training objectiveApplied to open-domain QA | 
          Natural QuestionsWebQuestionsTriviaQA | 
          Larger models outperform smaller onesSignificant performance gains with SSM | 
          Contrasts closed-book vs. open-book QADemonstrates task-specific pre-training benefits | 
    
    
      | 
          Learning Transferable Visual Models From Natural Language Supervision(Radford et al., 2021; Brown et al., 2020; Deng et al., 2009)
 | 
          Contrastive Language-Image Pre-training (CLIP)Joint image and text encoders | 
          400M (image, text) pairsEvaluated on ImageNet and other benchmarks | 
          Competitive zero-shot performance on ImageNetRobust to natural distribution shifts | 
          Bridges visual and textual modalitiesEnables transferable visual representations via contrastive learning | 
    
    
      | 
          Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering(Izacard & Grave, 2021; Roberts et al., 2020)
 | 
          Fusion-in-Decoder: Independently encodes multiple passagesAggregates evidence in the decoder | 
          Natural QuestionsTriviaQA | 
          State-of-the-art Exact Match scoresPerformance scales with more retrieved passages | 
          Combines retrieval with generation to synthesize evidenceImproves open-domain QA accuracy | 
    
    
      | 
          Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al., 2023) | 
          HyDE: Two-step processGenerates hypothetical documents using instruction-following LMsApplies unsupervised contrastive encoding | 
          Various tasks: web search, QA, fact verification (multi-language settings) | 
          Outperforms existing unsupervised dense retrieval modelsCompetitive with fine-tuned models | 
          Enables effective zero-shot retrieval without explicit relevance labelsLeverages hypothetical document generation | 
    
    
      | 
          Re2G Retrieve, Rerank, Generate (Lewis et al., 2020; Guu et al., 2020) | 
          Integrated framework combining retrieval, reranking, and generationUses knowledge distillation | 
          Various tasks (exact datasets not specified) | 
          Enhanced selection of relevant passagesImproved overall performance across tasks | 
          Unifies retrieval, reranking, and generation in an end-to-end frameworkImproves evidence selection | 
    
    
      | 
          REALM Retrieval-Augmented Language Model Pre-Training(Guu et al., 2020; Devlin et al., 2018; Raffel et al., 2019)
 | 
          Two-step process: retrieval followed by masked language model prediction | 
          Natural QuestionsWebQuestions | 
          4–16% absolute accuracy improvements on open-domain QA benchmarks | 
          Integrates retrieval into pre-trainingEnhances prediction accuracy and model interpretability | 
    
    
      | 
          REST Retrieval-Based Speculative Decoding(He et al., 2024; Miao et al., 2023; Chen et al., 2023)
 | 
          Uses a non-parametric retrieval datastore to construct draft tokensFor speculative decoding |  | 
          1.62× to 2.36× speedup in token generationCompared to standard autoregressive methods | 
          Improves generation speed without additional trainingAllows seamless integration with various LLMs | 
    
    
      | 
          Retrieval Augmentation Reduces Hallucination in Conversation(Roller et al., 2021; Maynez et al., 2020; Lewis et al., 2020b; Shuster et al., 2021; Dinan et al., 2019b; Zhou et al., 2021)
 | 
          Integration of retrieval mechanisms into dialogue systemsFetches relevant documents for improved factuality | 
          Knowledge-grounded conversational datasetsSpecific names not provided | 
          Reduced hallucination ratesHigher factual accuracy compared to standard models | 
          Demonstrates more reliable, factually grounded conversational responses | 
    
    
      | 
          Retrieval Augmented Code Generation and Summarization(Parvez et al., 2021; Karpukhin et al., 2020; Feng et al., 2020; Guo et al., 2021; Ahmad et al., 2021)
 | 
          REDCODER framework: Two-step process combining retrieval with generationUses pre-trained code models | 
          Code generation benchmarks (e.g., CoNaLa, CodeXGLUE)Evaluated via BLEU, Exact Match, CodeBLEU | 
          Significant improvements in BLEU, Exact Match, and CodeBLEU scores | 
          Enhances code generation and summarizationEffectively retrieves relevant code snippets and integrates pre-trained models | 
    
    
      | 
          Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks(Lewis et al., 2020; Karpukhin et al., 2020; Guu et al., 2020; Petroni et al., 2019)
 | 
          Combines retrieval with generative modeling to synthesize external knowledge | 
          Various knowledge-intensive benchmarks (e.g., Natural Questions, TriviaQA) | 
          Outperforms extractive and closed-book models in accuracy and robustness | 
          Balances internal model knowledge with external retrievalProvides accurate and comprehensive answers | 
    
    
      | 
          Retrieval-Enhanced Machine Learning (Zamani et al., 2022) | 
          REML framework: Joint optimization of a prediction model and a retrieval model | 
          Applied in domain adaptation and few-shot learning scenariosDatasets not specified | 
          Improves model generalization, scalability, and interpretability | 
          Offloads memorization to a retrieval systemSupports dynamic updates to knowledge bases | 
    
    
      | 
          Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., 2024) | 
          Adaptive retrieval with self-reflection using “reflection tokens”Structured self-critique | 
          Open-domain QA, reasoning, and fact verification tasksSpecific datasets not provided | 
          Outperforms state-of-the-art models in factuality and citation accuracy | 
          Introduces self-critique into the RAG pipelineEnables adaptive retrieval and improved output reliability | 
    
    
      | 
          The Probabilistic Relevance Framework – BM25 and Beyond(Robertson & Sparck Jones, 1977; Robertson et al., 1994; Robertson & Zaragoza, 2009; Sparck Jones et al., 2000)
 | 
          Probabilistic relevance modeling using term frequency, inverse document frequency, and document length normalization | 
          TREC collections and other ad-hoc retrieval tasks | 
          Demonstrated robust performance as a benchmark for relevance estimation | 
          Provides the theoretical foundation for modern IR systemsBasis for the widely adopted BM25 scoring function | 
    
    
      | 
          TIARA Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base(Gu et al., 2021; Ye et al., 2021; Chen et al., 2021; Raffel et al., 2020; Devlin et al., 2019)
 | 
          Multi-grained retrieval integrating entity, exemplary logical form, and schema retrievalUses constrained decoding |  | 
          Significant improvements in compositional and zero-shot generalizationOutperforms previous methods | 
          Addresses KBQA challenges by retrieving multiple granularities of contextEnhances accuracy and reliability of logical form generation |