add READMEHEAD master

author: Aditya <bluenerd@protonmail.com> 2025-02-10 22:30:39 +0530
committer: Aditya <bluenerd@protonmail.com> 2025-02-10 22:30:39 +0530
commit: f60b215860b040b039222f8a23e58c79111976d3 (patch)
tree: 2edff5fda2834e35ab24128936b90a0878c1e6dd
parent: 93ee9c739c9dbe6ce281f544d428df807d476964 (diff)
1 files changed, 141 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..abaa239
--- /dev/null
+++ b/README.md
@@ -0,0 +1,141 @@
+# **Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)**
+
+## **Overview**
+This application implements a **Retrieval-Augmented Generation (RAG) based Question Answering System** using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload **PDF documents**, process them into **text chunks**, store them as **vector embeddings**, and retrieve relevant information to generate AI-powered responses.
+
+---
+
+## **System Components**
+
+### **1. File Processing and Text Chunking**
+**Function:** `process_document(uploaded_file: UploadedFile) -> list[Document]`
+
+- Takes a user-uploaded **PDF file** and processes it into **smaller text chunks**.
+- Uses **PyMuPDFLoader** to extract text from PDFs.
+- Splits extracted text into **overlapping segments** using **RecursiveCharacterTextSplitter**.
+- Returns a list of **Document objects** containing text chunks and metadata.
+
+**Key Steps:**
+1. Save uploaded file to a **temporary file**.
+2. Load content using **PyMuPDFLoader**.
+3. Split text using **RecursiveCharacterTextSplitter**.
+4. Delete the temporary file.
+5. Return the **list of Document objects**.
+
+---
+
+### **2. Vector Storage and Retrieval (ChromaDB)**
+
+#### **Creating a ChromaDB Collection**
+**Function:** `get_vector_collection() -> chromadb.Collection`
+
+- Initializes **ChromaDB** with a **persistent vector store**.
+- Uses **OllamaEmbeddingFunction** to generate vector embeddings.
+- Retrieves or creates a collection for storing **document embeddings**.
+- Uses **cosine similarity** for querying documents.
+
+**Key Steps:**
+1. Define **OllamaEmbeddingFunction** for embedding generation.
+2. Initialize **ChromaDB PersistentClient**.
+3. Retrieve or create a **ChromaDB collection** for storing vectors.
+4. Return the **collection object**.
+
+#### **Adding Documents to Vector Store**
+**Function:** `add_to_vector_collection(all_splits: list[Document], file_name: str)`
+
+- Takes a list of document chunks and stores them in **ChromaDB**.
+- Each document is stored with **unique IDs** based on file name.
+- Success message displayed via **Streamlit**.
+
+**Key Steps:**
+1. Retrieve ChromaDB collection using `get_vector_collection()`.
+2. Convert document chunks into a list of **text embeddings, metadata, and unique IDs**.
+3. Use `upsert()` to store document embeddings.
+4. Display success message.
+
+#### **Querying the Vector Collection**
+**Function:** `query_collection(prompt: str, n_results: int = 10) -> dict`
+
+- Queries **ChromaDB** with a user-provided search query.
+- Returns the **top n most relevant documents** based on similarity.
+
+**Key Steps:**
+1. Retrieve ChromaDB collection.
+2. Perform query using `collection.query()`.
+3. Return **retrieved documents and metadata**.
+
+---
+
+### **3. Language Model Interaction (Ollama API)**
+
+#### **Generating Responses using the AI Model**
+**Function:** `call_llm(context: str, prompt: str)`
+
+- Calls **Ollama**'s language model to generate a **context-aware response**.
+- Uses a **system prompt** to guide the model’s behavior.
+- Streams the AI-generated response in **chunks**.
+
+**Key Steps:**
+1. Send **system prompt** and user query to **Ollama**.
+2. Retrieve and yield streamed responses.
+3. Display results in **Streamlit**.
+
+---
+
+### **4. Cross-Encoder Based Re-Ranking**
+**Function:** `re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]`
+
+- Uses **CrossEncoder (MS MARCO MiniLM model)** to **re-rank retrieved documents**.
+- Selects the **top 3 most relevant documents**.
+- Returns **concatenated relevant text** and **document indices**.
+
+**Key Steps:**
+1. Load **MS MARCO MiniLM CrossEncoder model**.
+2. Rank documents using **cross-encoder re-ranking**.
+3. Extract the **top-ranked documents**.
+4. Return **concatenated text** and **indices**.
+
+---
+
+## **User Interface (Streamlit)**
+
+### **1. Document Uploading and Processing**
+- Sidebar allows **PDF file upload**.
+- User clicks **Process** to extract text and store embeddings.
+- File name is **normalized** before processing.
+- Extracted **text chunks** are stored in **ChromaDB**.
+
+### **2. Question Answering System**
+- Main interface displays a **text area** for users to enter questions.
+- Clicking **Ask** triggers the retrieval and response generation process:
+  1. **Query ChromaDB** to retrieve relevant documents.
+  2. **Re-rank documents** using **cross-encoder**.
+  3. **Pass relevant text** and **question** to the **LLM**.
+  4. Stream and display the AI-generated response.
+  5. Provide options to view **retrieved documents and rankings**.
+
+---
+
+## **Technologies Used**
+- **Streamlit** → UI framework for interactive user interface.
+- **PyMuPDF** → PDF text extraction.
+- **ChromaDB** → Vector database for semantic search.
+- **Ollama** → LLM API for generating responses.
+- **LangChain** → Document processing utilities.
+- **Sentence Transformers (CrossEncoder)** → Document re-ranking.
+
+---
+
+## **Error Handling & Edge Cases**
+- **File I/O Errors**: Proper handling of **temporary file read/write issues**.
+- **ChromaDB Errors**: Ensures **database consistency and query failures** are managed.
+- **Ollama API Failures**: Detects and **handles API unavailability or timeouts**.
+- **Empty Document Handling**: Ensures that **no empty files** are processed.
+- **Invalid Queries**: Provides **feedback for low-relevance queries**.
+
+---
+
+## **Conclusion**
+This application provides a **RAG-based interactive Q&A system**, leveraging **retrieval, ranking, and generation** techniques to deliver highly **relevant AI-generated responses**. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings.
+
+
author	Aditya <bluenerd@protonmail.com>	2025-02-10 22:30:39 +0530
committer	Aditya <bluenerd@protonmail.com>	2025-02-10 22:30:39 +0530
commit	f60b215860b040b039222f8a23e58c79111976d3 (patch)
tree	2edff5fda2834e35ab24128936b90a0878c1e6dd
parent	93ee9c739c9dbe6ce281f544d428df807d476964 (diff)