README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141

# **Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)**

## **Overview**
This application implements a **Retrieval-Augmented Generation (RAG) based Question Answering System** using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload **PDF documents**, process them into **text chunks**, store them as **vector embeddings**, and retrieve relevant information to generate AI-powered responses.

---

## **System Components**

### **1. File Processing and Text Chunking**
**Function:** `process_document(uploaded_file: UploadedFile) -> list[Document]`

- Takes a user-uploaded **PDF file** and processes it into **smaller text chunks**.
- Uses **PyMuPDFLoader** to extract text from PDFs.
- Splits extracted text into **overlapping segments** using **RecursiveCharacterTextSplitter**.
- Returns a list of **Document objects** containing text chunks and metadata.

**Key Steps:**
1. Save uploaded file to a **temporary file**.
2. Load content using **PyMuPDFLoader**.
3. Split text using **RecursiveCharacterTextSplitter**.
4. Delete the temporary file.
5. Return the **list of Document objects**.

---

### **2. Vector Storage and Retrieval (ChromaDB)**

#### **Creating a ChromaDB Collection**
**Function:** `get_vector_collection() -> chromadb.Collection`

- Initializes **ChromaDB** with a **persistent vector store**.
- Uses **OllamaEmbeddingFunction** to generate vector embeddings.
- Retrieves or creates a collection for storing **document embeddings**.
- Uses **cosine similarity** for querying documents.

**Key Steps:**
1. Define **OllamaEmbeddingFunction** for embedding generation.
2. Initialize **ChromaDB PersistentClient**.
3. Retrieve or create a **ChromaDB collection** for storing vectors.
4. Return the **collection object**.

#### **Adding Documents to Vector Store**
**Function:** `add_to_vector_collection(all_splits: list[Document], file_name: str)`

- Takes a list of document chunks and stores them in **ChromaDB**.
- Each document is stored with **unique IDs** based on file name.
- Success message displayed via **Streamlit**.

**Key Steps:**
1. Retrieve ChromaDB collection using `get_vector_collection()`.
2. Convert document chunks into a list of **text embeddings, metadata, and unique IDs**.
3. Use `upsert()` to store document embeddings.
4. Display success message.

#### **Querying the Vector Collection**
**Function:** `query_collection(prompt: str, n_results: int = 10) -> dict`

- Queries **ChromaDB** with a user-provided search query.
- Returns the **top n most relevant documents** based on similarity.

**Key Steps:**
1. Retrieve ChromaDB collection.
2. Perform query using `collection.query()`.
3. Return **retrieved documents and metadata**.

---

### **3. Language Model Interaction (Ollama API)**

#### **Generating Responses using the AI Model**
**Function:** `call_llm(context: str, prompt: str)`

- Calls **Ollama**'s language model to generate a **context-aware response**.
- Uses a **system prompt** to guide the model’s behavior.
- Streams the AI-generated response in **chunks**.

**Key Steps:**
1. Send **system prompt** and user query to **Ollama**.
2. Retrieve and yield streamed responses.
3. Display results in **Streamlit**.

---

### **4. Cross-Encoder Based Re-Ranking**
**Function:** `re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]`

- Uses **CrossEncoder (MS MARCO MiniLM model)** to **re-rank retrieved documents**.
- Selects the **top 3 most relevant documents**.
- Returns **concatenated relevant text** and **document indices**.

**Key Steps:**
1. Load **MS MARCO MiniLM CrossEncoder model**.
2. Rank documents using **cross-encoder re-ranking**.
3. Extract the **top-ranked documents**.
4. Return **concatenated text** and **indices**.

---

## **User Interface (Streamlit)**

### **1. Document Uploading and Processing**
- Sidebar allows **PDF file upload**.
- User clicks **Process** to extract text and store embeddings.
- File name is **normalized** before processing.
- Extracted **text chunks** are stored in **ChromaDB**.

### **2. Question Answering System**
- Main interface displays a **text area** for users to enter questions.
- Clicking **Ask** triggers the retrieval and response generation process:
  1. **Query ChromaDB** to retrieve relevant documents.
  2. **Re-rank documents** using **cross-encoder**.
  3. **Pass relevant text** and **question** to the **LLM**.
  4. Stream and display the AI-generated response.
  5. Provide options to view **retrieved documents and rankings**.

---

## **Technologies Used**
- **Streamlit** → UI framework for interactive user interface.
- **PyMuPDF** → PDF text extraction.
- **ChromaDB** → Vector database for semantic search.
- **Ollama** → LLM API for generating responses.
- **LangChain** → Document processing utilities.
- **Sentence Transformers (CrossEncoder)** → Document re-ranking.

---

## **Error Handling & Edge Cases**
- **File I/O Errors**: Proper handling of **temporary file read/write issues**.
- **ChromaDB Errors**: Ensures **database consistency and query failures** are managed.
- **Ollama API Failures**: Detects and **handles API unavailability or timeouts**.
- **Empty Document Handling**: Ensures that **no empty files** are processed.
- **Invalid Queries**: Provides **feedback for low-relevance queries**.

---

## **Conclusion**
This application provides a **RAG-based interactive Q&A system**, leveraging **retrieval, ranking, and generation** techniques to deliver highly **relevant AI-generated responses**. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings.