- A Neural Corpus Indexer for Document Retrieval (Wang et al., 2022)
|
- End-to-end seq2seq network with a Prefix-Aware Weight-Adaptive (PAWA) Decoder
- Query generation network and hierarchical k-means indexing
|
|
- +21.4% relative improvement in Recall@1 on NQ320k
- +16.8% improvement in R-Precision on TriviaQA
|
- Unifies training and indexing
- Introduces a novel decoder and realistic query–document pair generation for enhanced retrieval performance
|
- Active Retrieval Augmented Generation (Jiang et al., 2023)
|
- Dynamic, iterative retrieval integrated into generation (FLARE)
- Detects low-confidence tokens and retrieves additional context
|
- Knowledge-intensive tasks (e.g., multihop QA, open-domain summarization)
- Specific datasets not detailed
|
- Significant performance improvements in complex, long-form generation tasks
|
- Introduces a forward-looking, active retrieval mechanism
- Moves beyond static, single-time retrieval methods
|
- Atlas Few-shot Learning with Retrieval Augmented Language Models
|
- Dual-encoder retrieval combined with a sequence-to-sequence generator
- Joint pre-training of both components
|
- Natural Questions
- MMLU, KILT benchmarks
|
- Over 42% accuracy on Natural Questions with only 64 training examples
- Outperforms larger models (e.g., PaLM) by 3%
|
- Demonstrates effective few-shot learning with minimal data
- Offers an adaptable document index
|
- Benchmarking Large Language Models in Retrieval-Augmented Generation (Chen et al., 2024)
|
- Evaluation framework (RGB) assessing retrieval quality in LLMs (e.g., ChatGPT, ChatGLM, Vicuna)
|
- Evaluation tasks in English and Chinese under varying noise conditions
|
- Accuracy drop: e.g., ChatGPT from 96.33% to 76% with noise
- Multi-document integration challenges (accuracy drops to 43–55%)
|
- Provides a rigorous benchmark for RAG settings
- Highlights error detection and rejection behaviors in LLMs
|
- C-RAG Certified Generation Risks for Retrieval-Augmented Language Models (Kang et al., 2024)
|
- Conformal risk analysis to certify generation risks
- Establishes an upper bound (“conformal generation risk”)
|
- AESLC
- CommonGen, DART, E2E
|
- Consistently lower conformal generation risks compared to non-retrieval models
|
- Extends conformal prediction methods to RAG
- Provides a framework for risk certification in generation
|
- Can Knowledge Graphs Reduce Hallucinations in LLMs: A Survey (Agrawal et al., 2024)
|
- Survey categorizing KG integration methods into:
- • Knowledge-aware inference
- • Knowledge-aware learning
- • Knowledge-aware validation
|
- Aggregated studies across multiple tasks
- No single dataset
|
- Up to 80% enhancement in answer correctness in certain settings
- Improved chain-of-thought reasoning
|
- Comprehensively categorizes KG-based augmentation methods
- Addresses hallucination reduction in LLMs
|
- Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020)
|
- Dual-encoder dense vector representations for semantic matching
- Utilizes in-batch negative training
|
- Natural Questions
- Other open-domain QA benchmarks
|
- Top-20 accuracy of 78.4% on Natural Questions (vs. 59.1% for BM25)
|
- Introduces dense retrieval techniques
- Significantly improves semantic matching in QA systems
|
- DocPrompting Generating Code by Retrieving the Docs (Zhou et al., 2023)
|
- Retrieval of documentation to guide code generation
- Focuses on documentation rather than NL-code pairs
|
- CoNaLa (Python)
- Curated Bash dataset
|
- 52% relative gain in pass@1
- 30% relative gain in pass@10 on CoNaLa
|
- Highlights the importance of documentation retrieval
- Boosts code generation accuracy and generalization
|
- Document Language Models, Query Models, and Risk Minimization for Information Retrieval
(Ponte & Croft, 1998; Berger & Lafferty, 1999; Lafferty & Zhai, 2001)
|
- Combines unigram language models
- Statistical translation methods
- Markov chain query models and Bayesian risk minimization
|
|
- Significant improvements over traditional vector space models
|
- Laid the foundation for integrating DLMs, QMs, and risk minimization
- Influenced modern retrieval methods
|
- Evaluating Retrieval Quality in Retrieval-Augmented Generation (Salemi & Zamani, 2024)
|
- eRAG: Uses LLMs to generate document-level relevance labels
- Labels correlate with downstream performance
|
- Various downstream RAG tasks
- Exact datasets not specified
|
- Kendall’s tau increased from 0.168 to 0.494
- Up to 50× memory efficiency and 2.468× speedup
|
- Proposes a novel evaluation metric aligning retrieval quality with end-task performance
- Reduces computational overhead
|
- Fine Tuning vs Retrieval Augmented Generation for Less Popular Knowledge
|
- Comparative analysis between fine tuning (FT) and RAG
|
|
- RAG achieves higher accuracy for low-frequency entities
- Hybrid FT+RAG yields best results for smaller models
|
- Highlights benefits of RAG over traditional fine tuning
- Effective for less popular or emerging knowledge
|
- How Much Knowledge Can You Pack Into the Parameters of a Language Model (Roberts et al., 2020)
|
- Fine-tuning with salient span masking (SSM) as a pre-training objective
- Applied to open-domain QA
|
- Natural Questions
- WebQuestions
- TriviaQA
|
- Larger models outperform smaller ones
- Significant performance gains with SSM
|
- Contrasts closed-book vs. open-book QA
- Demonstrates task-specific pre-training benefits
|
- Learning Transferable Visual Models From Natural Language Supervision
(Radford et al., 2021; Brown et al., 2020; Deng et al., 2009)
|
- Contrastive Language-Image Pre-training (CLIP)
- Joint image and text encoders
|
- 400M (image, text) pairs
- Evaluated on ImageNet and other benchmarks
|
- Competitive zero-shot performance on ImageNet
- Robust to natural distribution shifts
|
- Bridges visual and textual modalities
- Enables transferable visual representations via contrastive learning
|
- Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
(Izacard & Grave, 2021; Roberts et al., 2020)
|
- Fusion-in-Decoder: Independently encodes multiple passages
- Aggregates evidence in the decoder
|
- Natural Questions
- TriviaQA
|
- State-of-the-art Exact Match scores
- Performance scales with more retrieved passages
|
- Combines retrieval with generation to synthesize evidence
- Improves open-domain QA accuracy
|
- Precise Zero-Shot Dense Retrieval without Relevance Labels (Gao et al., 2023)
|
- HyDE: Two-step process
- Generates hypothetical documents using instruction-following LMs
- Applies unsupervised contrastive encoding
|
- Various tasks: web search, QA, fact verification (multi-language settings)
|
- Outperforms existing unsupervised dense retrieval models
- Competitive with fine-tuned models
|
- Enables effective zero-shot retrieval without explicit relevance labels
- Leverages hypothetical document generation
|
- Re2G Retrieve, Rerank, Generate (Lewis et al., 2020; Guu et al., 2020)
|
- Integrated framework combining retrieval, reranking, and generation
- Uses knowledge distillation
|
- Various tasks (exact datasets not specified)
|
- Enhanced selection of relevant passages
- Improved overall performance across tasks
|
- Unifies retrieval, reranking, and generation in an end-to-end framework
- Improves evidence selection
|
- REALM Retrieval-Augmented Language Model Pre-Training
(Guu et al., 2020; Devlin et al., 2018; Raffel et al., 2019)
|
- Two-step process: retrieval followed by masked language model prediction
|
- Natural Questions
- WebQuestions
|
- 4–16% absolute accuracy improvements on open-domain QA benchmarks
|
- Integrates retrieval into pre-training
- Enhances prediction accuracy and model interpretability
|
- REST Retrieval-Based Speculative Decoding
(He et al., 2024; Miao et al., 2023; Chen et al., 2023)
|
- Uses a non-parametric retrieval datastore to construct draft tokens
- For speculative decoding
|
|
- 1.62× to 2.36× speedup in token generation
- Compared to standard autoregressive methods
|
- Improves generation speed without additional training
- Allows seamless integration with various LLMs
|
- Retrieval Augmentation Reduces Hallucination in Conversation
(Roller et al., 2021; Maynez et al., 2020; Lewis et al., 2020b; Shuster et al., 2021; Dinan et al., 2019b; Zhou et al., 2021)
|
- Integration of retrieval mechanisms into dialogue systems
- Fetches relevant documents for improved factuality
|
- Knowledge-grounded conversational datasets
- Specific names not provided
|
- Reduced hallucination rates
- Higher factual accuracy compared to standard models
|
- Demonstrates more reliable, factually grounded conversational responses
|
- Retrieval Augmented Code Generation and Summarization
(Parvez et al., 2021; Karpukhin et al., 2020; Feng et al., 2020; Guo et al., 2021; Ahmad et al., 2021)
|
- REDCODER framework: Two-step process combining retrieval with generation
- Uses pre-trained code models
|
- Code generation benchmarks (e.g., CoNaLa, CodeXGLUE)
- Evaluated via BLEU, Exact Match, CodeBLEU
|
- Significant improvements in BLEU, Exact Match, and CodeBLEU scores
|
- Enhances code generation and summarization
- Effectively retrieves relevant code snippets and integrates pre-trained models
|
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
(Lewis et al., 2020; Karpukhin et al., 2020; Guu et al., 2020; Petroni et al., 2019)
|
- Combines retrieval with generative modeling to synthesize external knowledge
|
- Various knowledge-intensive benchmarks (e.g., Natural Questions, TriviaQA)
|
- Outperforms extractive and closed-book models in accuracy and robustness
|
- Balances internal model knowledge with external retrieval
- Provides accurate and comprehensive answers
|
- Retrieval-Enhanced Machine Learning (Zamani et al., 2022)
|
- REML framework: Joint optimization of a prediction model and a retrieval model
|
- Applied in domain adaptation and few-shot learning scenarios
- Datasets not specified
|
- Improves model generalization, scalability, and interpretability
|
- Offloads memorization to a retrieval system
- Supports dynamic updates to knowledge bases
|
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., 2024)
|
- Adaptive retrieval with self-reflection using “reflection tokens”
- Structured self-critique
|
- Open-domain QA, reasoning, and fact verification tasks
- Specific datasets not provided
|
- Outperforms state-of-the-art models in factuality and citation accuracy
|
- Introduces self-critique into the RAG pipeline
- Enables adaptive retrieval and improved output reliability
|
- The Probabilistic Relevance Framework – BM25 and Beyond
(Robertson & Sparck Jones, 1977; Robertson et al., 1994; Robertson & Zaragoza, 2009; Sparck Jones et al., 2000)
|
- Probabilistic relevance modeling using term frequency, inverse document frequency, and document length normalization
|
- TREC collections and other ad-hoc retrieval tasks
|
- Demonstrated robust performance as a benchmark for relevance estimation
|
- Provides the theoretical foundation for modern IR systems
- Basis for the widely adopted BM25 scoring function
|
- TIARA Multi-grained Retrieval for Robust Question Answering over Large Knowledge Base
(Gu et al., 2021; Ye et al., 2021; Chen et al., 2021; Raffel et al., 2020; Devlin et al., 2019)
|
- Multi-grained retrieval integrating entity, exemplary logical form, and schema retrieval
- Uses constrained decoding
|
|
- Significant improvements in compositional and zero-shot generalization
- Outperforms previous methods
|
- Addresses KBQA challenges by retrieving multiple granularities of context
- Enhances accuracy and reliability of logical form generation
|