diff options
Diffstat (limited to 'main.tex')
-rw-r--r-- | main.tex | 252 |
1 files changed, 252 insertions, 0 deletions
diff --git a/main.tex b/main.tex new file mode 100644 index 0000000..feb4841 --- /dev/null +++ b/main.tex @@ -0,0 +1,252 @@ +\documentclass[conference]{IEEEtran} +\IEEEoverridecommandlockouts +% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out. +\usepackage{cite} +\usepackage{amsmath,amssymb,amsfonts} +\usepackage{algorithmic} +\usepackage{graphicx} +\usepackage{textcomp} +\usepackage{xcolor} +\usepackage{hyperref} + +\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em + T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} +\begin{document} + +\title{A Case Study on Retrieval-Augmented Generation for AI-Generated Content} + +\author{\IEEEauthorblockN{Aditya Kumar} +\IEEEauthorblockA{\textit{University Institute of Engineering} \\ +\textit{Chandigarh University}\\ +Mohali, India \\ +24mai14003@cuchd.in} +} + +\maketitle + +\begin{abstract} + Improvements in model algorithms has led to the development of Artificial Intelligence Generated Content (AIGC), aided by the expansion of core models, and the availability of high-quality datasets. Even with its noteworthy accomplishments, AIGC still confronts challenges that include keeping up with new information, managing large amounts of training and inference data, minimizing data leakage, and handling long tail data. The paradigm known as Retrieval-Augmented Generation (RAG) has now surfaced as a solution to these problems. Specifically, RAG presents the information retrieval procedure that improves the generation process by obtaining pertinent objects from accessible data sources, resulting in increased robustness and accuracy. In this study, we present a thorough overview of previous attempts to include RAG methodologies into AIGC scenarios. In order to isolate the essential abstractions of the augmentation approaches for different retrievers and generators, we first categorize RAG foundations based on how the retriever augments the generator. All RAG situations are covered by this cohesive viewpoint, which highlights developments and important technology that support possible future breakthroughs. We also provide a summary of further RAG enhancement techniques that help with efficient RAG system deployment and engineering. Then, looking at things from a different angle, we survey on real-world RAG applications across many modalities and tasks, providing insightful references for scholars and professionals. We also go over the shortcomings of the existing RAG systems, present the benchmarks for RAG, and make some recommendations for future research paths. +\end{abstract} +\begin{IEEEkeywords} + Retrieval-augmented generation, AI-generated content, generative models, information retrieval. +\end{IEEEkeywords} +\section{Introduction} +\subsection{Background} +Artificial Intelligence Generated Content (AIGC) has seen a surge in attention in recent years. Large Language Models (LLMs) such as the GPT series \cite{DBLP:conf/nips/BrownMRSKDNSSAA20,DBLP:journals/corr/abs-2107-03374,DBLP:journals/corr/abs-2303-08774} and the LLAMA series \cite{LLaMA,DBLP:journals/corr/abs-2307-09288,DBLP:journals/corr/abs-2308-12950} for texts and codes, DALLE \cite{DBLP:conf/icml/RameshPGGVRCS21,DBLP:journals/corr/abs-2204-06125,betker2023improving} and Stable Diffusion \cite{DBLP:conf/cvpr/RombachBLEO22} for images, and Sora \cite{openai/sora} novel model algorithms, explosive scale of foundation models, and massive high-quality datasets for videos are just a few examples of the carefully designed content generation tools that can produce a wide range of outputs across different modalities. The term ”AIGC” highlights that sophisticated generative models, instead of humans or rule-based methods, are used to construct the contents. Consequently, the use of cutting-edge model methods, enormous, high-quality datasets, and foundation models with an exponential scale, these generative models have demonstrated outstanding performance. In particular, image-generation tasks have moved from Generative Adversarial Networks (GANs) \cite{GAN} to Latent Diffusion Models (LDMs) \cite{DBLP:conf/cvpr/RombachBLEO22}, while sequence-to-sequence tasks have moved from using Long Short Term Memory (LSTM) networks \cite{DBLP:journals/neco/HochreiterS97} to Transformer-based models \cite{DBLP:conf/nips/VaswaniSPUJGKP17}. It is noteworthy that the architecture of foundation models has expanded from millions of parameters at first \cite{DBLP:conf/iclr/GuoRLFT0ZDSFTDC21,DBLP:journals/jmlr/RaffelSRLNMZLL20}, to billions or even trillions of parameters at this point \cite{DBLP:conf/nips/BrownMRSKDNSSAA20}, \cite{LLaMA}, \cite{Switch_transformers}. The availability of extensive, high-quality datasets \cite{DBLP:conf/nips/BrownMRSKDNSSAA20}, \cite{scalingLaw}, which offer enough training examples to completely tune model parameters, further supports these developments. + +Another essential use in computer science is information retrieval. Retrieval seeks to identify pertinent things that already exist from a sizable pool of resources, as compared to generation. Web search engines, which are primarily concerned with document retrieval, are the most common applications of retrieval \cite{DBLP:journals/ftir/RobertsonZ09}, \cite{DBLP:conf/emnlp/KarpukhinOMLWEC20}. Currently, billion-scale document collections can be handled by effective information retrieval systems \cite{DBLP:journals/tbd/JohnsonDJ21}, \cite{DBLP:conf/nips/ChenZWLLLYW21}. Retrieval has been used for many additional modalities in addition to documents \cite{DBLP:journals/csur/DattaJLW08,radford2021learning,DBLP:conf/emnlp/FengGTDFGS0LJZ20,DBLP:conf/icassp/WuCZHBD23}. + +Even with major progress in generative models, AIGC still faces obstacles like as out-of-date knowledge, a lack of long-tail knowledge \cite{Adaptive-Retrieval-whennottrust}, and the possibility of private training data leaks \cite{DBLP:conf/uss/CarliniTWJHLRBS21}. Retrieval-Augmented Generation (RAG) uses a flexible data store to try to alleviate these problems \cite{C-RAG}. Retrievable knowledge serves as non-parametric memory that may encode sensitive information, is readily updated, and can handle a large amount of long-tail knowledge. Retrieval can also reduce the cost of generation. Large models can be made smaller with RAG \cite{Atlas}, extended contexts can be supported \cite{MemTransformer2022}, and certain generation processes can be removed \cite{REST}. + +The retriever receives an input query, finds pertinent data sources, and uses the knowledge to better the generating process by interacting with the generator. Depending on how the retrieved results enhance the generation, there are various foundational paradigms (or foundations, to put it short): they can act as an enhanced input to the generator \cite{REALM}, \cite{2020RAG}; they can join as latent representations at a mid-stage of generation \cite{FID}, \cite{RETRO}; they can contribute to the final generation results as logits \cite{KNN-LM}, \cite{Efficient-KNNLM}; they can even affect or omit certain generation steps \cite{REST}, \cite{GPTCache}. Researchers have also suggested a number of improvements to strengthen the basic RAG procedure. These techniques include targeted improvements for particular parts as well as comprehensive improvements targeted at the pipeline as a whole. + +Furthermore, although the idea behind RAG first surfaced in text-to-text generation \cite{2020RAG}, this method has since found use in a wide range of fields, includeing codes \cite{DBLP:conf/emnlp/ParvezACRC21,DBLP:conf/naacl/AhmadCRC21,DBLP:conf/iclr/Zhou0XJN23}, audios \cite{DBLP:journals/corr/abs-2012-07331}, \cite{DBLP:conf/icml/HuangHY0LLYLYZ23}, images \cite{tseng2020retrievegan,sarto2022retrieval,ramos2023smallcap}, videos \cite{DBLP:journals/tomccap/ChenPLYCM23}, \cite{DBLP:journals/corr/abs-2401-00789}, 3D \cite{DBLP:journals/corr/abs-2402-02972}, \cite{DBLP:conf/iccv/ZhangGPCHLYL23}, knowledge \cite{DBLP:conf/coling/HuWSQ22,DBLP:conf/emnlp/HuangKZ21,DBLP:conf/emnlp/DasZTGPLTPM21}, and artificial intelligence for science \cite{wang2022retrieval}, \cite{jin2023genegpt}. Specifically, the fundamental concept and methodology of RAG are substantially uniform throughout modalities. It does, however, require some small modifications to augmentation methods, and the choice of generators and retrievers changes based on the particular modalities and applications. + +The lack of a comprehensive assessment covering all foundations, advancements, and applications of RAG, in spite of the field's recent rapid expansion and expanding applications, is impeding its progress. The practical relevance of the research in this area is severely undermined by the lack of discussion on RAG foundations, which prevents RAG's full potential from being realized. Despite of query-based RAG in text-generation tasks having garnered most study interest, it is important to recognize that other RAG foundations are equally useful and have a great deal of room for expansion. Another reason is that without a broad overview of RAG applications, practitioners and academics tend to ignore RAG's advancements across a variety of modalities and are ignorant of its potential applications.. While text creation is commonly regarded as the primary use case for RAG, it is important to note that RAG development in other modalities has also started to gain traction and has produced encouraging developments. A number of modalities have a long history of being associated with retrieval procedures, which gives RAG its unique qualities. Motivated by this, our goal in this study is to offer a thorough survey that presents a methodical summary of RAG. + +\subsection{Contribution} +This case study provides a thorough introduction to RAG, addressing its origins, improvements, uses, benchmarks, constraints, and possible future paths. We extract the fundamentals of RAG foundations, seeing applications as modifications of these principles, notwithstanding differences in retrievers and generators across modalities and workloads. The goal of this paper is to provide scholars and practitioners with recommendations and references, along with insightful information that will help advance RAG techniques and related applications. To summarize, the following is a list of contributions: +\begin{itemize} + \item This study performs a thorough analysis of RAG and distills the foundational abstractions of RAG for different retrievers and generators. + \item Examination of the improvements made in RAG literature and outlining the strategies used to make RAG systems more efficient. + \item Survey of existing AIGC methods that use RAG techniques for different modalities and tasks, showing how RAG adds value to existing generative models. + \item RAG's research directions and limitations, which provide insight into possible future developments. +\end{itemize} + +\subsection{Related Work} +Numerous surveys have appeared as RAG develops, although they only cover a portion of the subject. Specifically, they either only cover a small portion of RAG techniques for specific contexts, or they solely concentrate on one RAG foundation. Without a thorough examination of alternative modalities, the majority of the publications that are now available concentrate on text-related RAG activities that are assisted by LLMs. A fundamental review of RAG is provided in the survey by Li et al. \cite{DBLP:journals/corr/abs-2202-01110}, which also covers particular applications related to text production tasks. Similar to this, Asai et al.'s tutorial \cite{retrieval-lm-tutorial} focuses on retrieval-based language models and describes their training approaches and architectures. Meanwhile, RAG is examined in the context of LLMs in a recent survey by Gao et al. \cite{DBLP:journals/corr/abs-2312-10997}, with a focus on query-based RAG optimization techniques. Our approach extends RAG's reach to the full AIGC ecosystem, acknowledging its expansion outside the text domain and enabling a more thorough coverage of RAG research. Another survey, put forth by Zhao et al. \cite{DBLP:conf/emnlp/ZhaoCWJLQDGLLJ23}, skips over the topic of RAG foundations and instead provides RAG applications across several modalities. Only a portion of other modalities' works are covered in another study \cite{ding2024survey}. Even though certain facets of RAG have been studied in previous research, a thorough overview including the basics, improvements, and domain-specific applicability of RAG is still lacking. The goal of this paper is to close this gap by offering an organized analysis of RAG. + +\section{Preliminary} +\subsection{Overview} +The generator and the retriever are the two main modules that make up the RAG system. The generator generates the necessary contents, while the retriever looks for pertinent information in the data store. The following is how the RAG process goes: (i) The query is first sent to the retriever, which then looks for pertinent data; (ii) The original query and the retrieval results are then fed into the generator using a certain augmentation process; (iii) Lastly, the generator generates the intended results. + +\subsection{Generator} +The era of AIGC has begun, thanks to generative AI's outstanding performance on a variety of jobs. In the RAG system, the generating module is essential. For example, transformer models are used for text-to-text tasks, VisualGPT \cite{DBLP:conf/cvpr/ChenGY0E22} is used for image-to-text tasks, Stable Diffusion [10] is used for text-to-image tasks, Codex \cite{DBLP:journals/corr/abs-2107-03374} is used for text-to-code tasks, and so on. Various generative models are used for different circumstances. Four common generators that are commonly used in RAG are introduced here: the diffusion model, GAN, LSTM, and transformer model. + +\subsubsection{Transformer Model} +Transformer models, which combine feedforward networks, layer normalization modules, residual networks, and self-attention mechanisms, are among the highest performing models in the field of natural language processing (NLP) \cite{EfficientTransformers}. At each generating phase, vocabulary classification is applied to a series of latent representations obtained from tokenization and embedding to construct the final output sequence. + +\subsubsection{LSTM} +The Recurrent Neural Network (RNN) model has a unique variant known as Long Short-Term Memory (LSTM) \cite{lstm_survey}. Cell states and gating methods are used to address the problems of exploding/vanishing gradients in long-term dependence processing. The three gates in the model—Input, Forget, and Output—filter data, while the central Cell State module stores and controls the data. It generates outputs autoregressively using the same vocabulary classification technique as transformer models. + +\subsubsection{Diffusion Model} +A family of deep generative models known as diffusion models is capable of producing a wide range of realistic data samples, such as texts, photos, videos, molecules, and more \cite{yang2023diffsurvey}. In order to create fresh data from noise, diffusion models first add noise to the data gradually until it becomes random, then reverse the process. Neural networks and probabilistic modeling serve as the foundation for this procedure. + +\subsubsection{GAN} +Generative Adversarial Networks (GANs) \cite{GAN} are deep learning models that can generate realistic images, audio, and other data \cite{GAN_Survey}. They consist of a generator and a discriminator, which compete through adversarial learning. The generator continuously improves its ability to generate realistic samples, while the discriminator continuously improves its ability to distinguish between true and false samples. + +\subsection{Retriever} +Finding and obtaining pertinent information in response to an information need is known as retrieval. In particular, let's look at data sources that may be thought of as a key-value store, in which every key is associated with a value (keys and values can be the same). The goal is to use a similarity function to find the top k most similar keys to a given query in order to extract the associated values. Existing retrieval techniques can be divided into sparse retrieval, dense retrieval, and other categories based on various similarity functions. The entire process of commonly used sparse and dense retrieval may be broken down into two separate stages: (i) each object is first encoded into a particular representation, and (ii) an index is created to arrange the data source for effective search. + +\subsubsection{Sparse Retriever} +Sparse retrieval techniques are frequently employed in document retrieval, where the documents to be searched are represented by the keys or values. This is done by making use of term matching metrics that examine word statistics from texts and create inverted indices for effective searching, such as TF-IDF \cite{DBLP:conf/sigir/RobertsonW97}, query probability \cite{DBLP:conf/sigir/LaffertyZ01}, and BM25 \cite{DBLP:journals/ftir/RobertsonZ09}. In general, BM25 is a robust baseline for extensive online search that incorporates query token occurrences, inverse document frequency weights, and other relevant metrics. Typically, sparse retrieval uses an inverted index to arrange items in order to facilitate effective search. In specifics, every term in the query looks up a list of potential documents, which are then ranked according to their statistical rankings. + +Typically, sparse retrieval uses an inverted index to arrange items in order to facilitate effective search. In specifics, every term in the query looks up a list of potential documents, which are then ranked according to their statistical rankings. Typically, sparse retrieval uses an inverted index to arrange items in order to facilitate effective search. In specifics, every term in the query looks up a list of potential documents, which are then ranked according to their statistical rankings. + +\subsubsection{Dense Retriever} +Dense retrieval techniques, in contrast to sparse retrieval, use dense embedding vectors to represent queries and keys and create an Approximate Nearest Neighbor (ANN) index to expedite the search. This is true for every modality. Recent developments in pre-trained models (like BERT \cite{DBLP:conf/iclr/GuoRLFT0ZDSFTDC21}) have been used to encode queries and keys separately for text data \cite{DBLP:journals/ftir/RobertsonZ09}. Dense Passage Retrieval (DPR) is a common term for this method. Models for encoding code data \cite{DBLP:conf/emnlp/FengGTDFGS0LJZ20}, audio data \cite{DBLP:conf/icassp/HersheyCEGJMPPS17}, image data \cite{radford2021learning}, video data \cite{DBLP:conf/cvpr/DongLXJH0W19}, and other types of data have been proposed, much like text. Typically, measures like cosine, inner product, and L2-distance are used to calculate the similarity score between dense representations. + +Contrastive learning is used in dense retrieval training to make positive data more similar and negative samples less similar. To improve model quality even more, a number of hard negative techniques \cite{DBLP:conf/iclr/XiongXLTLBAO21} have been put forth. ANN algorithms are used for effective searching during inference. Tree \cite{bentley1975multidimensional}, \cite{li2023learning}, location sensitive hashing \cite{datar2004locality}, neighbor graph indices (e.g., HNSW \cite{malkov2018efficient}, DiskANN \cite{jayaram2019diskann}), and combined graph and inverted indices (e.g., SPANN \cite{DBLP:conf/nips/ChenZWLLLYW21}) are some of the indices created to support ANN search. + +\subsubsection{Others} +There are more techniques for obtaining pertinent objects besides sparse and dense retrieval \cite{DBLP:conf/nips/WangHWMWCXCZL0022}, \cite{DBLP:conf/nips/ZhangWCCZMHDMWP23}. Some studies employ the edit distance between natural language texts \cite{DBLP:conf/emnlp/HayatiOAYTN18} or abstract syntax trees (AST) of code snippets \cite{DBLP:conf/icse/ZhangW00020}, \cite{DBLP:conf/iclr/PoesiaP00SMG22} directly in place of computing representations. Relationships between entities in knowledge graphs act as a pre-built index for retrieval. K-hop neighbor searches can therefore be used for retrieval in RAG approaches that use knowledge graphs \cite{DBLP:conf/acl/YeYHZX22}, \cite{DBLP:journals/corr/abs-2210-12925}. Named Entity Recognition (NER) \cite{lin2020bridging} is an additional retrieval technique in which the entities serve as keys and the query as the input. + +\section{Methodologies} +\subsection{RAG Foundations} +\subsubsection{Query-based RAG} +Originating from the concept of prompt augmentation, query-based RAG easily incorporates insights from retrieved data with the user's inquiry, delivering it straight into the generator's input stage. This approach is often used in RAG applications. After being retrieved, the content is combined with the user's initial query to generate a composite input, which the generator processes to produce a response. Query-based RAG is frequently used in many different modalities. + +REALM \cite{REALM} uses a dual-BERT framework for text production, combining knowledge extractors with pre-trained models to expedite knowledge retrieval and integration. Lewis et al. \cite{lewis2020retrieval} used BART as the generator to efficiently improve the generation and DPR for information retrieval. A critique module is used by SELF-RAG \cite{Self-RAG} to assess if the retrieval is necessary. Query-based RAG can be used in situations that use LLM through API calls, in addition to being interoperable with local generators. By considering the language model as a ”black box,” REPLUG \cite{REPLUG} adheres to this paradigm and successfully incorporates pertinent external documents into the query. The top-ranked documents are reordered and integrated using a predictive reranker trained using In-Context RALM \cite{RALM}, which leverages BM25 for document retrieval. + +The query-based paradigm has been used in a number of publications \cite{DBLP:conf/iclr/Zhou0XJN23}, \cite{DBLP:conf/emnlp/ZanCLGWL22,DBLP:conf/icse/NashidSM23,DBLP:conf/sigsoft/JinSTSLSS23,DBLP:conf/acl/LuDHGHS22} in the field of code to improve the efficacy of downstream tasks by incorporating contextual information from text or code into the prompt. + +Recent studies on Knowledge Base Question Answering (KBQA) have also demonstrated the important benefits of integrating language and retrieval models. For example, by combining inquiries and obtained data into prompts, Uni-Parser \cite{DBLP:conf/acl/Liu22}, RNG-KBQA \cite{DBLP:conf/acl/YeYHZX22}, and ECBRF \cite{DBLP:conf/eacl/YangDCC23} successfully increase the accuracy and performance of QA systems. + +Chat-Orthopedist \cite{shi2023retrieval}, a tool in the AI-for-Science space, uses recovered data in model prompts, facilitating in shared decision-making for teenagers with idiopathic scoliosis and increases the efficacy and accuracy of LLMs. + +RetrieveGAN \cite{tseng2020retrievegan} incorporates retrieved data, such as selected picture patches and their bounding boxes, into the generator's input stage to increase the relevance and accuracy of generated images in the image generating task. Noise vectors and instance characteristics are concatenated by IC-GAN \cite{casanova2021instance}, which adjusts the particular conditions and details of the generated images. + +RetDream \cite{DBLP:journals/corr/abs-2402-02972} uses CLIP \cite{DBLP:journals/corr/abs-2204-06125} to first recover pertinent 3D elements for 3D generation. During the input phase, the returned contents are combined with user input. + +Frequently used in conjunction with LLM generators, query-based RAG pro- vides modular flexibility that enables the rapid integration of pretrained components for rapid deployment. Using the retrieved data in this setting requires quick design. + +\subsubsection{Latent Representation-based RAG} +The recovered objects are used as latent representations in generative models in the latent representation-based RAG framework, thereby improving the quality of the generated information and strengthening the model's understanding capabilities. + +FiD \cite{FID} and RETRO \cite{RETRO} are two traditional structures of latent representation- based RAG in the text field upon which numerous later works have made changes. FiD \cite{FID} combines the generated latent representations for decoding by a single decoder to generate the final output after processing each recovered paragraph, its title, and the query through separate encoders. After retrieving pertinent data for every segmented sub-query, RETRO \cite{RETRO} uses a brand-new module called Chunked Cross-Attention (CCA) to combine the obtained data with each sub-query token. Other significant innovative structures fall under the purview of latent representation-based RAG as well. In order to enable input chunking and, in theory, meet the long-criticized context length limits of Transformer models, a number of studies \cite{MemTransformer2022}, \cite{Bertsch2023UnlimiformerLT} have integrated k Nearest Neighbor (kNN) search into transformer blocks. Kuratov et al. \cite{RMT-R} combined Transformer with RNN, using the intermediate output of the model as the retrieval content. + +FiD has become widely used in the disciplines of science and code, with applications in a variety of code-related domains \cite{DBLP:conf/kbse/LiL000J21,DBLP:conf/icsm/YuYCLZ22,DBLP:conf/nips/HashimotoGOL18,DBLP:conf/kbse/WeiLLXJ20,DBLP:conf/emnlp/ShiW0DZHZ022} and AI-for-Science \cite{wang2022retrieval}. + +Several research \cite{chen2022re,sheynin2022knn,blattmann2022retrieval,rombach2022text} use cross-attention techniques in the visual domain to integrate their latent representations and merge retrieval outcomes. On the other hand, Li et al. \cite{li2022memory} use an Affine Combination Module (ACM) that concatenates hidden characteristics directly between text and images. + +Numerous studies \cite{DBLP:conf/naacl/OguzCKPOSGMY22,DBLP:conf/iclr/YuZNZL0HWWX23,DBLP:conf/cikm/DongLWZXX23,DBLP:journals/corr/abs-2308-13259,DBLP:conf/sigir/YuY23} have used FiD and its derivatives for downstream tasks inside the knowledge domain. While TOME \cite{TOME} shifts to a nuanced encoding of mentions, giving mention granularity precedence over entity representations alone, EaE \cite{EaE} improves the generator's comprehension by entity-specific parameterization. + +ReMoDiffuse \cite{DBLP:conf/iccv/ZhangGPCHLYL23} advances the field of 3D generation by introducing a semantics-modulated attention method that improves the precision of producing comparable 3D motions from textual descriptions. By combining the original diffusion process with the reference diffusion process, AMD \cite{jing2023amd} successfully converts text to 3D motion. + +Koizumi et al. \cite{DBLP:journals/corr/abs-2012-07331} used an LLM in the audio domain, directing the creation of audio captions by integrating encoded dense information in the attention module. Deep features are extracted from text and audio using different encoders by ReAudioLDM \cite{DBLP:journals/corr/abs-2309-08051}, and these characteristics are then included into the Latent Diffusion Model's (LDM) attention mechanism. + +R-ConvED \cite{DBLP:journals/tomccap/ChenPLYCM23} processes retrieved video-sentence pairs using an attention mechanism and a convolutional encoder-decoder network, creating hidden states to generate captions for videos. CARE \cite{DBLP:journals/tip/YangCZ23} integrates idea representations into a hybrid attention mechanism and presents a concept detector to generate concept probabilities. EgoInstructor \cite{DBLP:journals/corr/abs-2401-00789} enhances the coherence and relevance of captions for egocentric videos by combining text and visual elements via gated-cross attention. Latent representation-based RAG combines retriever and generator hidden states and is flexible across modalities and tasks, although it necessitates extra training to align latent spaces. It makes it possible to create complex algorithms that smoothly integrate the data that has been retrieved. + +\subsubsection{Logit-based RAG} +During the decoding phase, generative models incorporate retrieval information via logits in logit-based RAG. To calculate the probability for step-wise generation, the logits are usually merged using straightforward summation or models. + +Language model probabilities and those derived from retrieval distances of identical prefixes are combined at each decoding step in the text domain by kNN-LM \cite{KNN-LM} and its version \cite{Efficient-KNNLM}. Using highly aligned tokens from a local database as output, TRIME \cite{TRIME} and NPM \cite{NPM} are radical extensions of conventional kNNLM techniques that improve performance especially in longtail distribution circumstances. + +In addition to text, logit-based RAG is also used in other modalities like code and images. + +A number of research \cite{DBLP:conf/icse/ZhangW00020}, \cite{DBLP:conf/emnlp/Zhang0YC23} have also used the kNN concept in the code domain to improve final output control and attain better performance. Additionally, EDITSUM \cite{DBLP:conf/kbse/LiL000J21} incorporates prototype summaries at the logit level to enhance the quality of code summarisation. MA \cite{fei2021memory} uses the kNN-LM frame work to solve the image caption problem with positive outcomes. This makes logit-based RAG perfect for sequence creation since it uses previous data to infer current states and combines information at the logit level. It emphasises generator training and makes room for cutting-edge techniques that take advantage of probability distributions for upcoming assignments. + +\subsubsection{Speculative RAG} +Speculative RAG looks for ways to economise resources and speed up reaction times by using retrieval rather of pure production. REST \cite{REST} allows for the creation of drafts by substituting retrieval for the tiny models used in speculative decoding \cite{Speculative_Decoding}. GPTCache \cite{GPTCache} creates a semantic cache to store LLM replies, hence resolving the problem of excessive latency when utilising the LLM APIs. In order to retrieve words or phrases from the documents rather than generating them, COG \cite{COG} breaks down the text generation process into a sequence of copy-and-paste operations. Cao et al. \cite{RetrievalisAccurateGeneration} suggested a novel paradigm that substitutes directly retrieved phrase level content for generation in order to remove the final result's reliance on the calibre of the first-stage retrieved content. + +Sequential data is now the main use of speculative RAG. Separating the generator and the retriever makes it possible to employ pre-trained models as components directly. We can investigate a greater variety of tactics to make efficient use of the recovered content within this framework. + +\subsection{RAG Enhancements} +\subsubsection{Input Enhancement} +The first input fed into the retriever has a significant impact on the outcome of the retrieval stage. This section presents query transformation and data augmentation as two input enhancement techniques. + +\textit{Query Transformation:} By altering the input query, query transformation can improve the retrieval outcome. + +The original query is used by Query2doc \cite{Query2doc} and HyDE \cite{HyDE} to create a faux document, which is then used as the retrieval query. Richer, pertinent information is included in the pseudo document, which aids in the retrieval of more precise results. + +By using the obtained contents, TOC \cite{TOC} breaks down the confusing query into several distinct sub-queries, which are then sent to the generator and combined to yield the final output. + +RQ-RAG \cite{RQ-RAG} deconstructs complex or ambiguous enquiries into distinct subqueries for fine-grained retrieval and combines the answers to provide a coherent response to the initial inquiry. Tayal et al. \cite{tayal2024dynamic} improved the generator's understanding of user intent by refining the original query using context retrieval and dynamic few-shot samples. + +\textit{Data Augmentation:} By using methods including deleting ambiguity, updating old documents, synthesising new data, and removing extraneous information, data augmentation enhances data prior to retrieval. + +Make-An-Audio \cite{DBLP:conf/icml/HuangHY0LLYLYZ23} adds random concept audio to enhance the original audio and employs captioning and audio-text retrieval to create captions for language-free audio in order to reduce data sparsity. In order to improve model performance in response to instructional prompts, LESS \cite{LESS} analyses gradient information to optimise dataset selection for downstream tasks. To pre-train the code retrieval model, ReACC \cite{DBLP:conf/acl/LuDHGHS22} uses data augmentation techniques including renaming and dead code insertion. By using a ”Vocabulary for 3GPP Specifications” and matching them to user queries using a router module, TelcoRAG \cite{Telco-RAG} improves the retrieval accuracy. + +\subsubsection{Retriever Enhancement} +The information sent into the generators in RAG systems is determined by the quality of the content that is retrieved. The likelihood of model hallucinations or other deterioration rises with lower content quality. We present useful strategies to improve retrieval efficacy in this section. + +\textit{Recursive Retrieval:} This method involves conducting several searches to obtain more comprehensive and superior content. + +ReACT \cite{ReAct} provides deeper information by decomposing questions for recursive retrieval using Chain-of-Thought (CoT) \cite{COT}. The best retrieval material is chosen by RATP \cite{RATP} using the Monte-Carlo Tree Search for simulations. The content is then templated and sent to the generator for output. Chunk optimisation is the process of modifying chunk size to enhance retrieval outcomes. + +\textit{Chunk Optimization:} Chunk optimization refers to adjusting chunk size for improved retrieval results. + +One of the chunk optimisation techniques used by LlamaIndex \cite{LlamaIndex} is based on the ”small to big” theory. Finding finer-grained content while returning richer information is the main idea here. Sentence-window retrieval, for example, retrieves brief text passages and provides a window of pertinent sentences that encircle the recovered section. Documents are organised in a tree structure for automerge retrieval. By initially retrieving the child node, the method obtains the parent node, which contains the content of its child nodes. RAPTOR \cite{RAPTOR} uses recurrent embedding, clustering, and summarisation of text chunks until additional clustering is impractical in order to solve the lack of contextual information. This creates a multi-level tree structure. By creating a table of contents beforehand, PromptRAG \cite{Prompt-RAG} improves retrieval accuracy by allowing the model to choose pertinent chapters on its own based on the query. To increase recollection and produce better outcomes, Raina et al. \cite{raina2024question} divide text fragments into smaller, more atomic assertions. + +\textit{Retriever Finetuning:} The core component of the RAG system, the retriever, depends on an effective embedding model \cite{bge_embedding,bge_m3,cocktail,llm_embedder} to feed the generator with relevant content and represent it, improving system performance. + +Furthermore, domain-specific or task-related data can be used to refine embedding models with high expressive power in order to improve performance in certain domains. REPLUG \cite{REPLUG} handles LM as black box, which updates the retriever model in response to the outcomes. Python files, api names, signatures, and descriptions are used by APICoder \cite{DBLP:conf/emnlp/ZanCLGWL22} to refine the retriever. + +After retrieval, EDITSUM \cite{DBLP:conf/kbse/LiL000J21} optimises the retriever to reduce the jaccard distance between summaries. Target Similarity Tuning (TST) is used by SYNCHROMESH \cite{DBLP:conf/iclr/PoesiaP00SMG22} to fine-tune the retriever after adding tree distance os ASTs to the loss. Using the same data as the generator, R-ConvED \cite{DBLP:journals/tomccap/ChenPLYCM23} optimizes the retriever. InfoNCE loss was used by Kulkarni et al. \cite{RL4RAG} to optimise the retriever. + +\textit{Hybrid Retrieval:} A hybrid retrieve refers to the simultaneous use of a wide range of retrieval techniques or the extraction of data from several different sources. + +To increase the quality of retrieval, RAP-Gen \cite{DBLP:conf/sigsoft/Wang0JH23}, BlendedRAG \cite{Blended-RAG}, and ReACC \cite{DBLP:conf/acl/LuDHGHS22} employ both dense and sparse retrievers. Rencos \cite{DBLP:conf/icse/ZhangW00020} retrieves similar code snippets on a syntactic level using a sparse retriever and on a semantic level using a dense retriever. BASHEXPLAINER \cite{DBLP:conf/icsm/YuYCLZ22} first gathers semantic data using a dense retriever, and then it gathers lexical data using a sparse retriever. RetDream \cite{DBLP:journals/corr/abs-2402-02972} retrieves using text first, followed by image embedding. A retrieval evaluator in CRAG \cite{CRAG} determines the relevance of documents to queries and generates three retrieval replies based on confidence: a hybrid approach for unclear circumstances, Web Search if results are inaccurate, and direct use of results for Knowledge Refinement if results are accurate. By adding DKS (Dense Knowledge Similarity) and RAC (Retriever as Answer Classifier) to the retrieval phase and assessing answer relevance and knowledge applicability, Huang et al. \cite{RAGAE} enhanced question-answering. A new type of token known as the ”acting token,” which establishes the source from which to obtain information, is introduced by UniMSRAG \cite{UniMS-RAG}. By combining text and drawing for fine-grained retrieval, Koley et al. \cite{koley2024you} improve image retrieval and produce better outcomes. + +\textit{Reranking:} Rearranging the content that has been obtained in order to increase diversity and improve outcomes is known as the Rerank technique. In order to lessen the impact of information loss brought on by text compression into vectors, Re2G \cite{Re2G} uses a re-ranker \cite{ReRanker} model after the conventional retriever. In order to eliminate redundant programs and produce a diversified set of retrieved programs, AceCoder \cite{li2023acecoder} reranks the programs using a selector. Following retrieval, XRICL \cite{DBLP:conf/emnlp/0010Z0L22} employs an exemplar reranker based on distillation. Rangan, et al. \cite{rangan2024fine} evaluate the similarity of data subsets and reranks retrieval results by using the Quantised Influence Measure, which measures statistical biases between a query and a reference. In order to create a cohesive retriever, UDAPDR \cite{UDAPDR} use multi-teacher knowledge distillation in conjunction with LLMs to economically produce synthetic queries that train domain-specific rerankers. By using a static LLM for document rating and reward model training in addition to knowledge distillation, LLM-R \cite{LLM-R} iteratively improves its retriever. Progressive optimisation is made possible by the retriever's incremental improvement with each training cycle. Finardi et al. \cite{finardi2024chronicles} used monoT5 as a reranker to maximise the quality of the results and incorporated reciprocal rank into the retrieval process for improved text chunk relevancy. Li et al. \cite{li2024enhancing} improve the retrieval quality and factual accuracy of LLMs by incorporating a reranking module into their end-to-end RAG system. + +\textit{Retrieval Tranformation:} Retrieval transformation is the process of reword- ing content that has been retrieved in order to better engage the generator's potential and provide better output. + +In order to simplify the generator's duty and enable precise answer prediction, FILCO \cite{FILCO} effectively removes unnecessary content from recovered text, separating just the relevant supporting stuff. In order to significantly reduce latency time, FiD-Light \cite{FiD-Light} first uses an encoder to transform the retrieved content into a vector, which it subsequently compresses. Using a template, RRR \cite{RRR} combines the current query with the top-k documents in each round before restructuring it using LLMs that have already been trained (GPT-3.5-Turbo, etc.). + +\textit{Others:} There are more optimisation techniques for the retrieval process in addition to the ones mentioned above. + +For instance, meta-data filtering \cite{Pinecone} is a technique to aid in the processing of retrieved documents by filtering them for better outcomes using metadata (such as time, purpose, etc.). By asking an LLM to produce documents in response to a specific query, GENREAD \cite{GENREAD} and GRG \cite{GRG} present a revolutionary method that replaces or enhances the retrieval process. In order to improve retrieval accuracy, Multi-Head-RAG \cite{Multi-Head-RAG} uses a multi-head attention layer to capture distinct informational features and numerous embedding models to project the same text chunk into different vector spaces. + +\subsection{Generator Enhancement} +The quality of the output results in RAG systems is frequently dictated by the quality of the generator. As a result, the maximum effectiveness of the entire RAG system is determined by the generator's capability. + +\textit{Prompt Engineering:} LLM generators in RAG systems might benefit from technologies in prompt engineering \cite{Prompt_Engineering_Guide} that concentrate on enhancing the output quality of LLMs, such prompt compression, Stepback Prompt \cite{StepBack-Prompting}, Active Prompt \cite{active-prompt}, Chain of Thought Prompt \cite{COT}, etc. + +In order to speed up model inference, LLMLingua \cite{LLMLingua} uses a tiny model to condense the query's total length. This lessens the detrimental effect of extraneous information on the model and the "Lost in the Middle" \cite{Lost_in_the_middle} issue. Using ChatGPT, ReMoDiffuse \cite{DBLP:conf/iccv/ZhangGPCHLYL23} breaks down intricate explanations into anatomical text scripts. To improve outcomes, ASAP \cite{ahmed2024automatic} adds exemplar tuples—which include input code, function definitions, analysis findings, and related comments—to prompts. CEDAR \cite{DBLP:conf/icse/NashidSM23} arranges code demonstration, question, and natural language instructions into a prompt using a pre-made prompt template. Translation pairs are added by XRICL \cite{DBLP:conf/emnlp/0010Z0L22} using COT technology as a transitional stage in cross-linguistic semantic parsing and inference. The Cognition Nexus method is used by ACTIVERAG \cite{ActiveRAG} to calibrate LLMs' internal cognition, and COT prompt is applied when generating answers. Other modalities can be used as input by Make-An-Audio \cite{DBLP:conf/icml/HuangHY0LLYLYZ23}, which can yield far more detailed data for the process that follows. + +\textit{Generator Finetuning:} Among other changes, decoding tuning entails improving generator control by adjusting hyperparameters for greater variability and limiting the output vocabulary. + +InferFix \cite{DBLP:conf/sigsoft/JinSTSLSS23} modifies the decoder's temperature to balance the variety and calibre of returns. SYNCHROMESH \cite{DBLP:conf/iclr/PoesiaP00SMG22} uses a completion engine to remove implementation flaws and restricts the decoder's output vocabulary. Finetuning the generator can improve the model's capacity to suit the retriever more accurately or have more exact domain knowledge. + +RETRO combines the content of the query and retriever by fixing the retriever's parameters and using the chunked cross attention mechanism in the generator. The generator CODEGEN-MONO 350M \cite{CODEGEN-MONO} is improved by API-Coder \cite{DBLP:conf/icse/NashidSM23} using a shuffled new file along with code blocks and API metadata. While maintaining the encoders and retriever fixed, CARE \cite{DBLP:journals/tip/YangCZ23} trains encoders using picture, audio, and video-text pairings before optimising the decoder (generator) to concurrently decrease caption and concept identification loss. After using picture data to optimise the video generator, Animate-AStory \cite{DBLP:journals/corr/abs-2307-06940} fine tunes a LoRA \cite{LoRA} adaptor to capture the specifics of the character's appearance. RetDream \cite{DBLP:journals/corr/abs-2402-02972} uses the produced images to refine a LoRA adaptor \cite{LoRA}. + +\subsection{Result Enhancement} +In many situations, RAG results might not have the desired impact; nevertheless, there are methods for improving results that can assist mitigate this issue. + +Output Rewrite: Rewriting the material produced by the generator in specific situations to satisfy the requirements of activities that come after is known as output rewrite. In order to better match the real-world code context, SARGAM \cite{DBLP:journals/corr/abs-2306-06490} uses a unique Transformer in conjunction with Deletion, Placeholder, and Insertion Classifiers to enhance outputs in code-related activities. By reranking candidates according to the average of the log probabilities generated by the generator for each token, Ring \cite{DBLP:conf/aaai/JoshiSG0VR23} is able to acquire diversity outcomes. By matching the created relations with those shown in the knowledge graph's immediate neighbourhood of the query entity, CBRKBQA \cite{DBLP:conf/emnlp/DasZTGPLTPM21} updates the outcome. + +\subsection{RAG Pipeline Enhancement} +RAG pipeline augmentation is the process of streamlining the entire RAG process to improve performance outcomes. + +Adaptive Retrieval: According to certain RAG research, retrieval doesn't always improve the outcome. When the model's intrinsic parameterised information is sufficient to address pertinent concerns, over-retrieval may result in resource waste and possible misunderstanding. Thus, rule-based and model-based techniques to assessing retrieval requirement will be covered in this subsection. + +\textit{Rule Based:} Using probability, FLARE \cite{FLARE} actively determines when and whether to search during the generating process. To calculate the percentage of generation and retrieval, Efficient-KNNLM \cite{Efficient-KNNLM} includes the generation probability of KNN-LM \cite{KNN-LM} and NPM \cite{NPM} along with a hyperparameter $\lambda$. + +For high-level questions, Mallen et al. \cite{Adaptive-Retrieval-whennottrust} used statistical analysis to provide accurate answers, but for low-frequency questions they used RAG. Jiang et al. \cite{lm-calibration} assessed model confidence using fit statistics, model uncertainty, and fit uncertainty to inform regression choices. In order to determine whether the deduction is appropriate, Kandpal et al. \cite{LLM_Struggle_to_Learn_Long-Tail_Knowledge} investigated the relationship between the amount of relevant text and the comprehension of model knowledge. + +\textit{Model-based:} In order to decide whether to execute a retrieval based on the retrieve token under various user queries, Self-RAG \cite{Self-RAG} makes use of a trained generator. Ren et al. \cite{LLM-Knowledge-Boundary} employed ”Judgement Prompting” to assess LLMs' ability to respond to pertinent queries and the accuracy of their responses, which helped determine if a retrieval was required. + +SKR \cite{SKR} makes use of LLMs' inherent capacity to determine beforehand whether they are able to respond to the inquiry; if they do, no retrieval is necessary. In order to ascertain whether information retrieval is necessary, Rowen \cite{Rowen} translates a query into several languages and verifies that the responses are consistent across these languages. AdaptiveRAG \cite{AdaptiveRAG} uses a classifier, which is a smaller LM, to dynamically determine whether to retrieve based on the query difficulty. + +\textit{Iterative RAG:} Instead of using a single round, iterative RAG cycles through the retrieval and creation phases again to gradually improve results. + +In order to effectively utilise scattered data and enhance results, RepoCoder \cite{DBLP:conf/emnlp/ZhangCZKLZMLC23} refines queries using previously created code through an iterative retrieval-generation approach to code completion. By employing the generator's output to identify knowledge gaps, retrieve pertinent data, and inform subsequent generation cycles, ITER-RETGEN \cite{ITER-RETGEN} iteratively improves the quality of the content. Using an iterative retrieval-augmented generator, SelfMemory \cite{SelfMemory} creates a large memory pool from which a memory selector selects an output to feed the subsequent generation cycle. RAT \cite{RAT} uses a zero-shot CoT prompt to first generate material by an LLM, then retrieves information from an external knowledge store to update each thinking step. + +\section{Discussion} +Despute the widespread adoption of RAG, it suffers from several limitations by design. + +\subsection{Noises in Retrieval Results} +Information loss in item representations and ANN search makes information retrieval fundamentally faulty. RAG systems may experience failure points due to the unavoidable noise, which may appear as irrelevant content or false information \cite{DBLP:journals/corr/abs-2401-05856}. Nevertheless, current research surprisingly discovers that noisy retrieval results may improve generation quality, even while increasing retrieval accuracy seems obvious for RAG efficacy \cite{DBLP:journals/corr/abs-2401-14887}. One explanation is that quick building may be facilitated by a variety of retrieval outcomes \cite{qiu2022evaluating}. As a result, it is unclear how retrieval noise affects real applications, which causes misunderstandings regarding metric selection and retriever-generator interaction + +\subsection{Extra Overhead} +In most situations, retrieval has non-negligible overhead, even if it can occasionally lower generating costs \cite{Atlas,MemTransformer2022,REST}. Stated differently, delay is necessarily increased by the retrieval and interaction operations. This is enhanced when RAG is used in conjunction with sophisticated enhancing techniques like iterative RAG \cite{DBLP:conf/emnlp/ZhangCZKLZMLC23} and recursive retrieval \cite{Query_Expansion_by_Prompting_LLMs}. Moreover, the complexity of access and storage will rise in tandem with the size of retrieval sources \cite{EA}. The usefulness of RAG for latency-sensitive real-time systems is severely hampered by this overhead. + +\subsection{The Gap between Generators and Retrievers} +The interplay between retrievers and generators necessitates careful design and optimisation because their latent spaces and goals may not coincide. Present methods either separate generation and retrieval or combine them in a middle stage. The latter could gain from combined training but hinder generality, whereas the former is more modular. Choosing an affordable engagement strategy to close the gap is difficult and requires careful consideration in real-world situations. + +\subsection{Increased System Complexity} +The complexity of the system and the amount of hyper-parameters to adjust inevitably rise with the addition of retrieval. In query-based RAG, for example, a recent study discovered that employing top-k rather than a single retrieval enhances attribution but degrades fluency \cite{DBLP:journals/corr/abs-2302-05578}. Other factors, such metric selection, are yet not fully investigated. Therefore, when RAG is involved, tuning the generation service calls for greater skill. + +\subsection{Lengthy Context} +RAG's enormous context lengthening, especially the query-based RAG, is one of its main drawbacks, rendering it unworkable for generators with constrained context length. Furthermore, the extended context often slows down the creation process. These issues have been somewhat alleviated by research developments in long-context support \cite{DBLP:journals/corr/abs-2308-16137} and quick compression \cite{LLMLingua}, but at a minor cost or accuracy trade-off. + +\section{Conclusion} +This case study discussed about an extensive and in-depth analysis of RAG in the framework of AIGC, highlighting special attention to the applications, improvements, and foundations of augmentation. We started by methodically classifying and summarising the fundamental RAG concepts, offering insights into how retrievers and generators interact. Next, we looked at the improvements made to RAG that increase its efficacy even more, whether they were made to the pipeline as a whole or to individual components. We demonstrated real-world RAG implementations in a variety of tasks and modalities to aid researchers from a wide range of fields. + +\bibliography{refs}{} + +\bibliographystyle{IEEEtran}{} + + +\end{document} |