diff options
author | Aditya <bluenerd@protonmail.com> | 2025-02-10 22:30:39 +0530 |
---|---|---|
committer | Aditya <bluenerd@protonmail.com> | 2025-02-10 22:30:39 +0530 |
commit | f60b215860b040b039222f8a23e58c79111976d3 (patch) | |
tree | 2edff5fda2834e35ab24128936b90a0878c1e6dd | |
parent | 93ee9c739c9dbe6ce281f544d428df807d476964 (diff) |
-rw-r--r-- | README.md | 141 |
1 files changed, 141 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..abaa239 --- /dev/null +++ b/README.md @@ -0,0 +1,141 @@ +# **Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)** + +## **Overview** +This application implements a **Retrieval-Augmented Generation (RAG) based Question Answering System** using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload **PDF documents**, process them into **text chunks**, store them as **vector embeddings**, and retrieve relevant information to generate AI-powered responses. + +--- + +## **System Components** + +### **1. File Processing and Text Chunking** +**Function:** `process_document(uploaded_file: UploadedFile) -> list[Document]` + +- Takes a user-uploaded **PDF file** and processes it into **smaller text chunks**. +- Uses **PyMuPDFLoader** to extract text from PDFs. +- Splits extracted text into **overlapping segments** using **RecursiveCharacterTextSplitter**. +- Returns a list of **Document objects** containing text chunks and metadata. + +**Key Steps:** +1. Save uploaded file to a **temporary file**. +2. Load content using **PyMuPDFLoader**. +3. Split text using **RecursiveCharacterTextSplitter**. +4. Delete the temporary file. +5. Return the **list of Document objects**. + +--- + +### **2. Vector Storage and Retrieval (ChromaDB)** + +#### **Creating a ChromaDB Collection** +**Function:** `get_vector_collection() -> chromadb.Collection` + +- Initializes **ChromaDB** with a **persistent vector store**. +- Uses **OllamaEmbeddingFunction** to generate vector embeddings. +- Retrieves or creates a collection for storing **document embeddings**. +- Uses **cosine similarity** for querying documents. + +**Key Steps:** +1. Define **OllamaEmbeddingFunction** for embedding generation. +2. Initialize **ChromaDB PersistentClient**. +3. Retrieve or create a **ChromaDB collection** for storing vectors. +4. Return the **collection object**. + +#### **Adding Documents to Vector Store** +**Function:** `add_to_vector_collection(all_splits: list[Document], file_name: str)` + +- Takes a list of document chunks and stores them in **ChromaDB**. +- Each document is stored with **unique IDs** based on file name. +- Success message displayed via **Streamlit**. + +**Key Steps:** +1. Retrieve ChromaDB collection using `get_vector_collection()`. +2. Convert document chunks into a list of **text embeddings, metadata, and unique IDs**. +3. Use `upsert()` to store document embeddings. +4. Display success message. + +#### **Querying the Vector Collection** +**Function:** `query_collection(prompt: str, n_results: int = 10) -> dict` + +- Queries **ChromaDB** with a user-provided search query. +- Returns the **top n most relevant documents** based on similarity. + +**Key Steps:** +1. Retrieve ChromaDB collection. +2. Perform query using `collection.query()`. +3. Return **retrieved documents and metadata**. + +--- + +### **3. Language Model Interaction (Ollama API)** + +#### **Generating Responses using the AI Model** +**Function:** `call_llm(context: str, prompt: str)` + +- Calls **Ollama**'s language model to generate a **context-aware response**. +- Uses a **system prompt** to guide the model’s behavior. +- Streams the AI-generated response in **chunks**. + +**Key Steps:** +1. Send **system prompt** and user query to **Ollama**. +2. Retrieve and yield streamed responses. +3. Display results in **Streamlit**. + +--- + +### **4. Cross-Encoder Based Re-Ranking** +**Function:** `re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]` + +- Uses **CrossEncoder (MS MARCO MiniLM model)** to **re-rank retrieved documents**. +- Selects the **top 3 most relevant documents**. +- Returns **concatenated relevant text** and **document indices**. + +**Key Steps:** +1. Load **MS MARCO MiniLM CrossEncoder model**. +2. Rank documents using **cross-encoder re-ranking**. +3. Extract the **top-ranked documents**. +4. Return **concatenated text** and **indices**. + +--- + +## **User Interface (Streamlit)** + +### **1. Document Uploading and Processing** +- Sidebar allows **PDF file upload**. +- User clicks **Process** to extract text and store embeddings. +- File name is **normalized** before processing. +- Extracted **text chunks** are stored in **ChromaDB**. + +### **2. Question Answering System** +- Main interface displays a **text area** for users to enter questions. +- Clicking **Ask** triggers the retrieval and response generation process: + 1. **Query ChromaDB** to retrieve relevant documents. + 2. **Re-rank documents** using **cross-encoder**. + 3. **Pass relevant text** and **question** to the **LLM**. + 4. Stream and display the AI-generated response. + 5. Provide options to view **retrieved documents and rankings**. + +--- + +## **Technologies Used** +- **Streamlit** → UI framework for interactive user interface. +- **PyMuPDF** → PDF text extraction. +- **ChromaDB** → Vector database for semantic search. +- **Ollama** → LLM API for generating responses. +- **LangChain** → Document processing utilities. +- **Sentence Transformers (CrossEncoder)** → Document re-ranking. + +--- + +## **Error Handling & Edge Cases** +- **File I/O Errors**: Proper handling of **temporary file read/write issues**. +- **ChromaDB Errors**: Ensures **database consistency and query failures** are managed. +- **Ollama API Failures**: Detects and **handles API unavailability or timeouts**. +- **Empty Document Handling**: Ensures that **no empty files** are processed. +- **Invalid Queries**: Provides **feedback for low-relevance queries**. + +--- + +## **Conclusion** +This application provides a **RAG-based interactive Q&A system**, leveraging **retrieval, ranking, and generation** techniques to deliver highly **relevant AI-generated responses**. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings. + + |