A modular and lightweight RAG (Retrieval-Augmented Generation) agent that processes PDF documents and provides intelligent question-answering capabilities using open-source models.
- π PDF Text Extraction: Uses
markitdownto extract text from PDF documents - π§© Smart Chunking: Creates overlapping chunks with Pydantic models for structured data
- π Semantic Search: Uses lightweight Hugging Face embedding models for vector search
- πΎ Vector Storage: ChromaDB for persistent vector storage with cosine similarity
- π€ Text Generation: Free, open-source LLMs from Hugging Face
- π Retrieval Display: Shows top-k relevant chunks with similarity scores
- ποΈ Modular Design: Clean, maintainable code structure
- π» Interactive CLI: Easy-to-use command-line interface
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β PDF Document β -> β Document β -> β Text Chunks β
β β β Processor β β (with overlap) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
|
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β User Query β -> β Embedding β -> β Vector β
β β β Model β β Embeddings β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
|
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Final β <- β LLM Generator β <- β ChromaDB β
β Response β β β β (Vector Store) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
-
Clone or download the repository
-
Install dependencies:
pip install -r requirements.txt
-
Place your PDF document in the project directory (e.g.,
practical_guide_to_building_agents_notes.pdf)
Run the interactive command-line interface:
python cli.pyAvailable commands:
load <pdf_path>- Load a PDF document into the systemquery <question>- Ask a question about the loaded documentstats- Show database statisticsclear- Clear the databasehelp- Show available commandsquit- Exit the program
rag> load practical_guide_to_building_agents_notes.pdf
Processing document: practical_guide_to_building_agents_notes.pdf
Extracted 50000 characters
Split into 25 pages
Created 75 chunks
Successfully loaded 75 chunks
rag> query What are the main concepts in this document?
Query: What are the main concepts in this document?
Response: Based on the document, the main concepts include...
Retrieved 2 chunks:
--- Chunk 1 (Similarity: 0.8234) ---
Page: 5
Content: The fundamental concepts of building agents include...
--- Chunk 2 (Similarity: 0.7891) ---
Page: 12
Content: Key principles for agent development involve...
from rag_agent import RAGAgent
# Initialize the RAG agent
rag_agent = RAGAgent()
# Load a document
chunks = rag_agent.load_document("your_document.pdf")
# Query the system
response = rag_agent.query("What are the key points?")
print(f"Response: {response.response}")
print(f"Retrieved {len(response.retrieved_chunks)} relevant chunks")
# Display retrieved chunks
for i, result in enumerate(response.retrieved_chunks):
print(f"Chunk {i+1} (Score: {result.similarity_score:.4f})")
print(f"Content: {result.chunk.content[:200]}...")You can customize the models used by the RAG agent:
rag_agent = RAGAgent(
embedding_model_name="all-MiniLM-L6-v2", # Fast, lightweight embedding model
llm_model_name="microsoft/DialoGPT-medium", # Conversational LLM
collection_name="my_documents",
persist_directory="./my_vector_db"
)Embedding Models (from Hugging Face):
all-MiniLM-L6-v2- Default, good balance of speed and qualityall-MiniLM-L12-v2- Better quality, slightly slowerparaphrase-MiniLM-L6-v2- Good for paraphrase detection
Language Models (from Hugging Face):
microsoft/DialoGPT-medium- Default, good for conversational responsesdistilgpt2- Lighter, faster alternativegpt2- Standard GPT-2 model
chunks = rag_agent.load_document(
"document.pdf",
chunk_size=1000, # Maximum characters per chunk
overlap_size=200, # Overlap between consecutive chunks
clear_existing=True # Clear existing data in database
)- Extracts text from PDFs using
markitdown - Splits text into pages
- Creates overlapping chunks with metadata
- Loads sentence transformer models from Hugging Face
- Generates vector embeddings for text
- Calculates cosine similarity
- Uses ChromaDB for persistent vector storage
- Implements cosine similarity search
- Manages document chunks and metadata
- Loads language models from Hugging Face
- Generates responses based on query and context
- Handles prompt formatting and response cleaning
- Main orchestrator class
- Coordinates all components
- Provides high-level API
- Pydantic models for type safety
DocumentChunk: Represents a text chunk with metadataRetrievalResult: Search result with similarity scoreRAGResponse: Complete response with metadata
RAG Agent/
βββ cli.py # Interactive command-line interface
βββ rag_agent.py # Main RAG agent orchestrator
βββ document_processor.py # PDF processing and chunking
βββ embedding_model.py # Text embedding functionality
βββ vector_database.py # ChromaDB vector storage
βββ llm_generator.py # Language model for generation
βββ models.py # Pydantic data models
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ chroma_db/ # ChromaDB data directory (auto-created)
- Python 3.8+
- Dependencies listed in
requirements.txt - At least 2GB RAM (for model loading)
- GPU recommended but not required
-
Import errors: Make sure all dependencies are installed:
pip install -r requirements.txt
-
Memory issues: If models are too large, try smaller alternatives:
# Use lighter models rag_agent = RAGAgent( embedding_model_name="all-MiniLM-L6-v2", llm_model_name="distilgpt2" )
-
PDF processing errors: Ensure your PDF is text-based (not scanned images)
-
Slow performance: Consider using GPU if available, or reduce chunk sizes
- Use GPU acceleration if available
- Reduce
chunk_sizefor faster processing - Use lighter models for better speed
- Increase
overlap_sizefor better context retention
Feel free to contribute by:
- Adding support for more document types
- Implementing better chunking strategies
- Adding more embedding/LLM model options
- Improving the user interface
- Adding evaluation metrics
This project is open-source and available under the MIT License.
- markitdown: For PDF text extraction
- ChromaDB: For vector database functionality
- Hugging Face: For pre-trained models
- Sentence Transformers: For embedding models
- Transformers: For language models
- Pydantic: For data validation and modeling