extract.py

Overview

This program processes text documents, extracts key concepts using a language model, constructs a graph representation of these concepts, and visualizes the resulting network using Pyvis and NetworkX. The extracted relationships between terms are stored in CSV files, and the final graph is displayed in an interactive HTML file.

Dependencies

The program requires the following Python libraries: - pyvis.network - seaborn - networkx - pandas - numpy - os - pathlib - random - sys - subprocess - langchain.document_loaders - langchain.text_splitter - helpers.df_helpers

Workflow

1. Input Handling

The program expects command-line arguments containing text data. It stores the input data in a specified directory (data_input) and creates necessary output directories (data_output).

2. Document Loading and Splitting

The program loads documents using langchain.document_loaders.DirectoryLoader.
Text is split into chunks using RecursiveCharacterTextSplitter with a chunk size of 1500 and overlap of 150 characters.
The extracted text chunks are converted into a Pandas DataFrame.

3. Graph Generation

If regenerate is set to True, extracted text chunks are processed to generate a concept graph using df2Graph.
The relationships are stored in a CSV file (graph.csv).
The extracted text chunks are stored in chunks.csv.

4. Contextual Proximity Calculation

The contextual_proximity function: - Establishes relationships between terms appearing in the same text chunk. - Generates additional edges in the graph based on co-occurrence in chunks. - Drops edges with only one occurrence. - Assigns the label contextual proximity to these relationships.

5. Graph Construction

A networkx.Graph object is created.
Nodes and edges are added, with edge weights normalized by dividing by 4.
Communities in the graph are detected using the Girvan-Newman algorithm.
Each community is assigned a unique color.

6. Graph Visualization

Pyvis is used to create an interactive visualization of the graph.
The visualization is saved as index.html inside the docs directory.
The layout uses the force_atlas_2based algorithm for optimal positioning.

Output

Processed document data (chunks.csv).
Extracted concept relationships (graph.csv).
Interactive graph visualization (index.html).
Notifications are sent via wsl-notify-send.exe when processing starts and completes.

Usage

Execute the script with an argument containing text input:

python extract.py path/to/file

Notes

The program creates necessary directories if they do not exist.
If regenerate is False, the program reads precomputed relationships from graph.csv instead of generating them anew.
Community detection enhances graph visualization by grouping related terms.
The visualization can be viewed in a web browser by opening docs/index.html.

gradio-app.py

Overview

This program implements a Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents, extract and store textual information in a vector database, and query the system to retrieve contextually relevant information. It also integrates a knowledge graph generation mechanism to visualize extracted knowledge.

Dependencies

The program utilizes the following libraries: - gradio: For building an interactive web-based interface. - chromadb: For vector storage and retrieval. - ollama: For handling LLM-based responses. - langchain_community: For PDF document loading and text processing. - sentence_transformers: For cross-encoder-based document re-ranking. - subprocess, tempfile, and os: For handling system-level tasks.

Workflow

Document Processing - A PDF file is uploaded via the Gradio interface. - The process_document function extracts text from the PDF and splits it into chunks using RecursiveCharacterTextSplitter. - The extracted text chunks are stored in a ChromaDB vector collection.
Query Processing - A user enters a query via the Gradio interface. - The query_collection function retrieves relevant text chunks from the vector collection. - The retrieved chunks are re-ranked using a cross-encoder model. - The most relevant text is passed to an LLM for generating a response.
Knowledge Graph Generation - The generated response is saved temporarily. - The extract.py script is executed to create a knowledge graph. - The system notifies the user of success or failure.

Core Functions

`process_document(file_path: str) -> list[Document]`

Extracts text from a PDF and splits it into chunks for further processing.

`get_vector_collection() -> chromadb.Collection`

Creates or retrieves a ChromaDB collection for vector-based semantic search.

`add_to_vector_collection(all_splits: list[Document], file_name: str)`

Adds processed document chunks to the vector collection with metadata.

`query_collection(prompt: str, n_results: int = 10) -> dict`

Queries the vector collection to retrieve contextually relevant documents.

`call_llm(context, prompt) -> str`

Passes the context and question to the deepseek-r1 LLM for response generation.

`re_rank_cross_encoders(documents: list[str], prompt) -> tuple[str, list[int]]`

Uses a cross-encoder model to re-rank retrieved documents based on query relevance.

`process_question(prompt: str) -> str`

Combines querying, re-ranking, and LLM response generation.

`create_knowledge_graph(response: str) -> str`

Executes the extract.py script to generate a knowledge graph based on LLM output.

`process_pdf(file_path: str)`

Processes an uploaded PDF, extracts text, and adds it to the vector collection.

Gradio Interface

The Gradio-based UI consists of: - File Upload Section: Users upload a PDF for processing. - Query Section: Users ask questions related to the uploaded content. - Knowledge Graph Section: Users can generate a visual representation of extracted knowledge.

Execution

To run the program:

python gradio-app.py

This launches the Gradio interface, allowing document uploads and question answering.

Error Handling

If document processing fails, an error message is displayed in the UI.
If no relevant documents are found for a query, the system returns an appropriate message.
If knowledge graph generation fails, the error is captured and displayed.