README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151

# extract.py

## Overview
This program processes text documents, extracts key concepts using a language model, constructs a graph representation of these concepts, and visualizes the resulting network using Pyvis and NetworkX. The extracted relationships between terms are stored in CSV files, and the final graph is displayed in an interactive HTML file.

## Dependencies
The program requires the following Python libraries:
- `pyvis.network`
- `seaborn`
- `networkx`
- `pandas`
- `numpy`
- `os`
- `pathlib`
- `random`
- `sys`
- `subprocess`
- `langchain.document_loaders`
- `langchain.text_splitter`
- `helpers.df_helpers`

## Workflow

### 1. Input Handling
The program expects command-line arguments containing text data. It stores the input data in a specified directory (`data_input`) and creates necessary output directories (`data_output`).

### 2. Document Loading and Splitting
- The program loads documents using `langchain.document_loaders.DirectoryLoader`.
- Text is split into chunks using `RecursiveCharacterTextSplitter` with a chunk size of 1500 and overlap of 150 characters.
- The extracted text chunks are converted into a Pandas DataFrame.

### 3. Graph Generation
- If `regenerate` is set to `True`, extracted text chunks are processed to generate a concept graph using `df2Graph`.
- The relationships are stored in a CSV file (`graph.csv`).
- The extracted text chunks are stored in `chunks.csv`.

### 4. Contextual Proximity Calculation
The `contextual_proximity` function:
- Establishes relationships between terms appearing in the same text chunk.
- Generates additional edges in the graph based on co-occurrence in chunks.
- Drops edges with only one occurrence.
- Assigns the label `contextual proximity` to these relationships.

### 5. Graph Construction
- A `networkx.Graph` object is created.
- Nodes and edges are added, with edge weights normalized by dividing by 4.
- Communities in the graph are detected using the Girvan-Newman algorithm.
- Each community is assigned a unique color.

### 6. Graph Visualization
- Pyvis is used to create an interactive visualization of the graph.
- The visualization is saved as `index.html` inside the `docs` directory.
- The layout uses the `force_atlas_2based` algorithm for optimal positioning.

## Output
- Processed document data (`chunks.csv`).
- Extracted concept relationships (`graph.csv`).
- Interactive graph visualization (`index.html`).
- Notifications are sent via `wsl-notify-send.exe` when processing starts and completes.

## Usage
Execute the script with an argument containing text input:
```bash
python extract.py path/to/file
```

## Notes
- The program creates necessary directories if they do not exist.
- If `regenerate` is `False`, the program reads precomputed relationships from `graph.csv` instead of generating them anew.
- Community detection enhances graph visualization by grouping related terms.
- The visualization can be viewed in a web browser by opening `docs/index.html`.

---

# gradio-app.py

## Overview
This program implements a Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents, extract and store textual information in a vector database, and query the system to retrieve contextually relevant information. It also integrates a knowledge graph generation mechanism to visualize extracted knowledge.

## Dependencies
The program utilizes the following libraries:
- `gradio`: For building an interactive web-based interface.
- `chromadb`: For vector storage and retrieval.
- `ollama`: For handling LLM-based responses.
- `langchain_community`: For PDF document loading and text processing.
- `sentence_transformers`: For cross-encoder-based document re-ranking.
- `subprocess`, `tempfile`, and `os`: For handling system-level tasks.

## Workflow
1. **Document Processing**
   - A PDF file is uploaded via the Gradio interface.
   - The `process_document` function extracts text from the PDF and splits it into chunks using `RecursiveCharacterTextSplitter`.
   - The extracted text chunks are stored in a ChromaDB vector collection.

2. **Query Processing**
   - A user enters a query via the Gradio interface.
   - The `query_collection` function retrieves relevant text chunks from the vector collection.
   - The retrieved chunks are re-ranked using a cross-encoder model.
   - The most relevant text is passed to an LLM for generating a response.

3. **Knowledge Graph Generation**
   - The generated response is saved temporarily.
   - The `extract.py` script is executed to create a knowledge graph.
   - The system notifies the user of success or failure.

## Core Functions
### `process_document(file_path: str) -> list[Document]`
Extracts text from a PDF and splits it into chunks for further processing.

### `get_vector_collection() -> chromadb.Collection`
Creates or retrieves a ChromaDB collection for vector-based semantic search.

### `add_to_vector_collection(all_splits: list[Document], file_name: str)`
Adds processed document chunks to the vector collection with metadata.

### `query_collection(prompt: str, n_results: int = 10) -> dict`
Queries the vector collection to retrieve contextually relevant documents.

### `call_llm(context, prompt) -> str`
Passes the context and question to the `deepseek-r1` LLM for response generation.

### `re_rank_cross_encoders(documents: list[str], prompt) -> tuple[str, list[int]]`
Uses a cross-encoder model to re-rank retrieved documents based on query relevance.

### `process_question(prompt: str) -> str`
Combines querying, re-ranking, and LLM response generation.

### `create_knowledge_graph(response: str) -> str`
Executes the `extract.py` script to generate a knowledge graph based on LLM output.

### `process_pdf(file_path: str)`
Processes an uploaded PDF, extracts text, and adds it to the vector collection.

## Gradio Interface
The Gradio-based UI consists of:
- **File Upload Section**: Users upload a PDF for processing.
- **Query Section**: Users ask questions related to the uploaded content.
- **Knowledge Graph Section**: Users can generate a visual representation of extracted knowledge.

## Execution
To run the program:
```sh
python gradio-app.py
```
This launches the Gradio interface, allowing document uploads and question answering.

## Error Handling
- If document processing fails, an error message is displayed in the UI.
- If no relevant documents are found for a query, the system returns an appropriate message.
- If knowledge graph generation fails, the error is captured and displayed.