diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 284 |
1 files changed, 147 insertions, 137 deletions
@@ -1,141 +1,151 @@ -# **Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)** - -## **Overview** -This application implements a **Retrieval-Augmented Generation (RAG) based Question Answering System** using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload **PDF documents**, process them into **text chunks**, store them as **vector embeddings**, and retrieve relevant information to generate AI-powered responses. - ---- - -## **System Components** - -### **1. File Processing and Text Chunking** -**Function:** `process_document(uploaded_file: UploadedFile) -> list[Document]` - -- Takes a user-uploaded **PDF file** and processes it into **smaller text chunks**. -- Uses **PyMuPDFLoader** to extract text from PDFs. -- Splits extracted text into **overlapping segments** using **RecursiveCharacterTextSplitter**. -- Returns a list of **Document objects** containing text chunks and metadata. - -**Key Steps:** -1. Save uploaded file to a **temporary file**. -2. Load content using **PyMuPDFLoader**. -3. Split text using **RecursiveCharacterTextSplitter**. -4. Delete the temporary file. -5. Return the **list of Document objects**. - ---- - -### **2. Vector Storage and Retrieval (ChromaDB)** - -#### **Creating a ChromaDB Collection** -**Function:** `get_vector_collection() -> chromadb.Collection` - -- Initializes **ChromaDB** with a **persistent vector store**. -- Uses **OllamaEmbeddingFunction** to generate vector embeddings. -- Retrieves or creates a collection for storing **document embeddings**. -- Uses **cosine similarity** for querying documents. - -**Key Steps:** -1. Define **OllamaEmbeddingFunction** for embedding generation. -2. Initialize **ChromaDB PersistentClient**. -3. Retrieve or create a **ChromaDB collection** for storing vectors. -4. Return the **collection object**. - -#### **Adding Documents to Vector Store** -**Function:** `add_to_vector_collection(all_splits: list[Document], file_name: str)` - -- Takes a list of document chunks and stores them in **ChromaDB**. -- Each document is stored with **unique IDs** based on file name. -- Success message displayed via **Streamlit**. - -**Key Steps:** -1. Retrieve ChromaDB collection using `get_vector_collection()`. -2. Convert document chunks into a list of **text embeddings, metadata, and unique IDs**. -3. Use `upsert()` to store document embeddings. -4. Display success message. - -#### **Querying the Vector Collection** -**Function:** `query_collection(prompt: str, n_results: int = 10) -> dict` - -- Queries **ChromaDB** with a user-provided search query. -- Returns the **top n most relevant documents** based on similarity. - -**Key Steps:** -1. Retrieve ChromaDB collection. -2. Perform query using `collection.query()`. -3. Return **retrieved documents and metadata**. - ---- - -### **3. Language Model Interaction (Ollama API)** - -#### **Generating Responses using the AI Model** -**Function:** `call_llm(context: str, prompt: str)` - -- Calls **Ollama**'s language model to generate a **context-aware response**. -- Uses a **system prompt** to guide the model’s behavior. -- Streams the AI-generated response in **chunks**. - -**Key Steps:** -1. Send **system prompt** and user query to **Ollama**. -2. Retrieve and yield streamed responses. -3. Display results in **Streamlit**. +# extract.py + +## Overview +This program processes text documents, extracts key concepts using a language model, constructs a graph representation of these concepts, and visualizes the resulting network using Pyvis and NetworkX. The extracted relationships between terms are stored in CSV files, and the final graph is displayed in an interactive HTML file. + +## Dependencies +The program requires the following Python libraries: +- `pyvis.network` +- `seaborn` +- `networkx` +- `pandas` +- `numpy` +- `os` +- `pathlib` +- `random` +- `sys` +- `subprocess` +- `langchain.document_loaders` +- `langchain.text_splitter` +- `helpers.df_helpers` + +## Workflow + +### 1. Input Handling +The program expects command-line arguments containing text data. It stores the input data in a specified directory (`data_input`) and creates necessary output directories (`data_output`). + +### 2. Document Loading and Splitting +- The program loads documents using `langchain.document_loaders.DirectoryLoader`. +- Text is split into chunks using `RecursiveCharacterTextSplitter` with a chunk size of 1500 and overlap of 150 characters. +- The extracted text chunks are converted into a Pandas DataFrame. + +### 3. Graph Generation +- If `regenerate` is set to `True`, extracted text chunks are processed to generate a concept graph using `df2Graph`. +- The relationships are stored in a CSV file (`graph.csv`). +- The extracted text chunks are stored in `chunks.csv`. + +### 4. Contextual Proximity Calculation +The `contextual_proximity` function: +- Establishes relationships between terms appearing in the same text chunk. +- Generates additional edges in the graph based on co-occurrence in chunks. +- Drops edges with only one occurrence. +- Assigns the label `contextual proximity` to these relationships. + +### 5. Graph Construction +- A `networkx.Graph` object is created. +- Nodes and edges are added, with edge weights normalized by dividing by 4. +- Communities in the graph are detected using the Girvan-Newman algorithm. +- Each community is assigned a unique color. + +### 6. Graph Visualization +- Pyvis is used to create an interactive visualization of the graph. +- The visualization is saved as `index.html` inside the `docs` directory. +- The layout uses the `force_atlas_2based` algorithm for optimal positioning. + +## Output +- Processed document data (`chunks.csv`). +- Extracted concept relationships (`graph.csv`). +- Interactive graph visualization (`index.html`). +- Notifications are sent via `wsl-notify-send.exe` when processing starts and completes. + +## Usage +Execute the script with an argument containing text input: +```bash +python extract.py path/to/file +``` + +## Notes +- The program creates necessary directories if they do not exist. +- If `regenerate` is `False`, the program reads precomputed relationships from `graph.csv` instead of generating them anew. +- Community detection enhances graph visualization by grouping related terms. +- The visualization can be viewed in a web browser by opening `docs/index.html`. --- -### **4. Cross-Encoder Based Re-Ranking** -**Function:** `re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]` - -- Uses **CrossEncoder (MS MARCO MiniLM model)** to **re-rank retrieved documents**. -- Selects the **top 3 most relevant documents**. -- Returns **concatenated relevant text** and **document indices**. - -**Key Steps:** -1. Load **MS MARCO MiniLM CrossEncoder model**. -2. Rank documents using **cross-encoder re-ranking**. -3. Extract the **top-ranked documents**. -4. Return **concatenated text** and **indices**. - ---- - -## **User Interface (Streamlit)** - -### **1. Document Uploading and Processing** -- Sidebar allows **PDF file upload**. -- User clicks **Process** to extract text and store embeddings. -- File name is **normalized** before processing. -- Extracted **text chunks** are stored in **ChromaDB**. - -### **2. Question Answering System** -- Main interface displays a **text area** for users to enter questions. -- Clicking **Ask** triggers the retrieval and response generation process: - 1. **Query ChromaDB** to retrieve relevant documents. - 2. **Re-rank documents** using **cross-encoder**. - 3. **Pass relevant text** and **question** to the **LLM**. - 4. Stream and display the AI-generated response. - 5. Provide options to view **retrieved documents and rankings**. - ---- - -## **Technologies Used** -- **Streamlit** → UI framework for interactive user interface. -- **PyMuPDF** → PDF text extraction. -- **ChromaDB** → Vector database for semantic search. -- **Ollama** → LLM API for generating responses. -- **LangChain** → Document processing utilities. -- **Sentence Transformers (CrossEncoder)** → Document re-ranking. - ---- - -## **Error Handling & Edge Cases** -- **File I/O Errors**: Proper handling of **temporary file read/write issues**. -- **ChromaDB Errors**: Ensures **database consistency and query failures** are managed. -- **Ollama API Failures**: Detects and **handles API unavailability or timeouts**. -- **Empty Document Handling**: Ensures that **no empty files** are processed. -- **Invalid Queries**: Provides **feedback for low-relevance queries**. - ---- - -## **Conclusion** -This application provides a **RAG-based interactive Q&A system**, leveraging **retrieval, ranking, and generation** techniques to deliver highly **relevant AI-generated responses**. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings. - +# gradio-app.py + +## Overview +This program implements a Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents, extract and store textual information in a vector database, and query the system to retrieve contextually relevant information. It also integrates a knowledge graph generation mechanism to visualize extracted knowledge. + +## Dependencies +The program utilizes the following libraries: +- `gradio`: For building an interactive web-based interface. +- `chromadb`: For vector storage and retrieval. +- `ollama`: For handling LLM-based responses. +- `langchain_community`: For PDF document loading and text processing. +- `sentence_transformers`: For cross-encoder-based document re-ranking. +- `subprocess`, `tempfile`, and `os`: For handling system-level tasks. + +## Workflow +1. **Document Processing** + - A PDF file is uploaded via the Gradio interface. + - The `process_document` function extracts text from the PDF and splits it into chunks using `RecursiveCharacterTextSplitter`. + - The extracted text chunks are stored in a ChromaDB vector collection. + +2. **Query Processing** + - A user enters a query via the Gradio interface. + - The `query_collection` function retrieves relevant text chunks from the vector collection. + - The retrieved chunks are re-ranked using a cross-encoder model. + - The most relevant text is passed to an LLM for generating a response. + +3. **Knowledge Graph Generation** + - The generated response is saved temporarily. + - The `extract.py` script is executed to create a knowledge graph. + - The system notifies the user of success or failure. + +## Core Functions +### `process_document(file_path: str) -> list[Document]` +Extracts text from a PDF and splits it into chunks for further processing. + +### `get_vector_collection() -> chromadb.Collection` +Creates or retrieves a ChromaDB collection for vector-based semantic search. + +### `add_to_vector_collection(all_splits: list[Document], file_name: str)` +Adds processed document chunks to the vector collection with metadata. + +### `query_collection(prompt: str, n_results: int = 10) -> dict` +Queries the vector collection to retrieve contextually relevant documents. + +### `call_llm(context, prompt) -> str` +Passes the context and question to the `deepseek-r1` LLM for response generation. + +### `re_rank_cross_encoders(documents: list[str], prompt) -> tuple[str, list[int]]` +Uses a cross-encoder model to re-rank retrieved documents based on query relevance. + +### `process_question(prompt: str) -> str` +Combines querying, re-ranking, and LLM response generation. + +### `create_knowledge_graph(response: str) -> str` +Executes the `extract.py` script to generate a knowledge graph based on LLM output. + +### `process_pdf(file_path: str)` +Processes an uploaded PDF, extracts text, and adds it to the vector collection. + +## Gradio Interface +The Gradio-based UI consists of: +- **File Upload Section**: Users upload a PDF for processing. +- **Query Section**: Users ask questions related to the uploaded content. +- **Knowledge Graph Section**: Users can generate a visual representation of extracted knowledge. + +## Execution +To run the program: +```sh +python script_name.py +``` +This launches the Gradio interface, allowing document uploads and question answering. + +## Error Handling +- If document processing fails, an error message is displayed in the UI. +- If no relevant documents are found for a query, the system returns an appropriate message. +- If knowledge graph generation fails, the error is captured and displayed. |