A powerful pipeline for extracting, processing, and querying structured data from PDF documents using Azure Document Intelligence and local LLMs.
- Document Analysis: Extract text, tables, and layout information from PDFs using Azure Document Intelligence
- Structured Output: Save extracted data in both Excel and JSON formats
- Natural Language Querying: Ask questions about your documents using local LLMs via Ollama
- RAG Integration: Implements Retrieval-Augmented Generation for accurate question answering
- Python 3.8+
- Azure subscription with Document Intelligence service
- Ollama installed and running locally
- Hugging Face account and API key
-
Clone the repository:
git clone https://github.com/yourusername/document-intelligence-pipeline.git cd document-intelligence-pipeline -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use: .\venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Install Ollama and download the model:
# Install Ollama from https://ollama.ai/ ollama pull gemma3:1b # or your preferred model
- Create a
.envfile in the project root with your credentials:AZURE_ENDPOINT=your_azure_endpoint AZURE_KEY=your_azure_key PDF_FILE_PATH=path/to/your/document.pdf HUGGINGFACEHUB_API=your_huggingface_api_key
python analyze_layout.pyThis will process your PDF and generate:
document_analysis.xlsx: Excel file with tables and page informationtables_data.json: Structured JSON of extracted tables
python llm_prompt.pyThe script will process questions from std_queries.py and save answers to output_file.json
analyze_layout.py- Extracts text and tables from PDFsllm_prompt.py- Implements RAG-based question answeringstd_queries.py- Contains predefined questions for document queryingrequirements.txt- Python dependencies.env.example- Example environment configuration
- azure-ai-documentintelligence
- PyPDF2
- openpyxl
- python-dotenv
- sentence-transformers
- torch
- transformers
- huggingface-hub
- Multiple sheets for each table found
- Document metadata and layout information
- Structured data in machine-readable format
- Organized by tables and document sections
- Answers to standard questions in JSON format
- Easy to integrate with other applications