extract.py
Overview
This program processes text documents, extracts key concepts using a language model, constructs a graph representation of these concepts, and visualizes the resulting network using Pyvis and NetworkX. The extracted relationships between terms are stored in CSV files, and the final graph is displayed in an interactive HTML file.
Dependencies
The program requires the following Python libraries:
- pyvis.network
- seaborn
- networkx
- pandas
- numpy
- os
- pathlib
- random
- sys
- subprocess
- langchain.document_loaders
- langchain.text_splitter
- helpers.df_helpers
Workflow
1. Input Handling
The program expects command-line arguments containing text data. It stores the input data in a specified directory (data_input
) and creates necessary output directories (data_output
).
2. Document Loading and Splitting
- The program loads documents using
langchain.document_loaders.DirectoryLoader
. - Text is split into chunks using
RecursiveCharacterTextSplitter
with a chunk size of 1500 and overlap of 150 characters. - The extracted text chunks are converted into a Pandas DataFrame.
3. Graph Generation
- If
regenerate
is set toTrue
, extracted text chunks are processed to generate a concept graph usingdf2Graph
. - The relationships are stored in a CSV file (
graph.csv
). - The extracted text chunks are stored in
chunks.csv
.
4. Contextual Proximity Calculation
The contextual_proximity
function:
- Establishes relationships between terms appearing in the same text chunk.
- Generates additional edges in the graph based on co-occurrence in chunks.
- Drops edges with only one occurrence.
- Assigns the label contextual proximity
to these relationships.
5. Graph Construction
- A
networkx.Graph
object is created. - Nodes and edges are added, with edge weights normalized by dividing by 4.
- Communities in the graph are detected using the Girvan-Newman algorithm.
- Each community is assigned a unique color.
6. Graph Visualization
- Pyvis is used to create an interactive visualization of the graph.
- The visualization is saved as
index.html
inside thedocs
directory. - The layout uses the
force_atlas_2based
algorithm for optimal positioning.
Output
- Processed document data (
chunks.csv
). - Extracted concept relationships (
graph.csv
). - Interactive graph visualization (
index.html
). - Notifications are sent via
wsl-notify-send.exe
when processing starts and completes.
Usage
Execute the script with an argument containing text input:
python extract.py path/to/file
Notes
- The program creates necessary directories if they do not exist.
- If
regenerate
isFalse
, the program reads precomputed relationships fromgraph.csv
instead of generating them anew. - Community detection enhances graph visualization by grouping related terms.
- The visualization can be viewed in a web browser by opening
docs/index.html
.
gradio-app.py
Overview
This program implements a Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents, extract and store textual information in a vector database, and query the system to retrieve contextually relevant information. It also integrates a knowledge graph generation mechanism to visualize extracted knowledge.
Dependencies
The program utilizes the following libraries:
- gradio
: For building an interactive web-based interface.
- chromadb
: For vector storage and retrieval.
- ollama
: For handling LLM-based responses.
- langchain_community
: For PDF document loading and text processing.
- sentence_transformers
: For cross-encoder-based document re-ranking.
- subprocess
, tempfile
, and os
: For handling system-level tasks.
Workflow
-
Document Processing - A PDF file is uploaded via the Gradio interface. - The
process_document
function extracts text from the PDF and splits it into chunks usingRecursiveCharacterTextSplitter
. - The extracted text chunks are stored in a ChromaDB vector collection. -
Query Processing - A user enters a query via the Gradio interface. - The
query_collection
function retrieves relevant text chunks from the vector collection. - The retrieved chunks are re-ranked using a cross-encoder model. - The most relevant text is passed to an LLM for generating a response. -
Knowledge Graph Generation - The generated response is saved temporarily. - The
extract.py
script is executed to create a knowledge graph. - The system notifies the user of success or failure.
Core Functions
process_document(file_path: str) -> list[Document]
Extracts text from a PDF and splits it into chunks for further processing.
get_vector_collection() -> chromadb.Collection
Creates or retrieves a ChromaDB collection for vector-based semantic search.
add_to_vector_collection(all_splits: list[Document], file_name: str)
Adds processed document chunks to the vector collection with metadata.
query_collection(prompt: str, n_results: int = 10) -> dict
Queries the vector collection to retrieve contextually relevant documents.
call_llm(context, prompt) -> str
Passes the context and question to the deepseek-r1
LLM for response generation.
re_rank_cross_encoders(documents: list[str], prompt) -> tuple[str, list[int]]
Uses a cross-encoder model to re-rank retrieved documents based on query relevance.
process_question(prompt: str) -> str
Combines querying, re-ranking, and LLM response generation.
create_knowledge_graph(response: str) -> str
Executes the extract.py
script to generate a knowledge graph based on LLM output.
process_pdf(file_path: str)
Processes an uploaded PDF, extracts text, and adds it to the vector collection.
Gradio Interface
The Gradio-based UI consists of: - File Upload Section: Users upload a PDF for processing. - Query Section: Users ask questions related to the uploaded content. - Knowledge Graph Section: Users can generate a visual representation of extracted knowledge.
Execution
To run the program:
python gradio-app.py
This launches the Gradio interface, allowing document uploads and question answering.
Error Handling
- If document processing fails, an error message is displayed in the UI.
- If no relevant documents are found for a query, the system returns an appropriate message.
- If knowledge graph generation fails, the error is captured and displayed.