CLI • Web UI • API — Extract structured data from PDFs, compare OCR & document-processing libraries, and benchmark conversion quality before building your RAG or AI pipelines.
Supports: PyMuPDF4LLM • Unstructured • Marker • Docling • Tesseract OCR • More coming soon
PDFStract is a developer toolkit to:
- 🧾 Extract structured text, tables, and metadata from PDFs
- 🔍 Benchmark & compare multiple PDF/OCR extraction libraries side-by-side
- 🧪 Visualize results before integrating into RAG / AI pipelines
- 🌐 Run as a CLI, Web UI, or API service
Use it to choose the best extraction engine for your dataset instead of guessing.
PDFstract can be run as a local web ui - it comes with FastAPI backend and react frontend
Here are some quick screenshots of the Web UI
- 🚀 10+ Conversion Libraries: PyMuPDF4LLM, MarkItDown, Marker, Docling, PaddleOCR, DeepSeek-OCR, Tesseract, MinerU, Unstructured, and more
- 📱 Modern React UI: Beautiful, responsive design with Tailwind CSS
- 💻 Command-Line Interface: Full CLI with batch processing, multi-library comparison, and automation
- 🎯 Multiple Output Formats: Markdown, JSON, and Plain Text
- ⏱️ Performance Benchmarking: Real-time timer shows conversion speed for each library
- 👁️ Live Preview: View converted content with syntax highlighting
- 🔄 Library Status Dashboard: See which libraries are available/unavailable with error messages
- 💾 Easy Download: Download results in your preferred format
- 🐳 Docker Support: One-command deployment
- 🔗 REST API: Programmatic access to conversion features
- ⚡ Batch Processing: Parallel conversion of 100+ PDFs with detailed reporting
- 🌙 Dark Mode Ready: Works seamlessly in light and dark themes
| Library | Version | Type | Status | Notes |
|---|---|---|---|---|
| pymupdf4llm | >=0.0.26 | Text Extraction | Fast | Best for simple PDFs |
| markitdown | >=0.1.2 | Markdown | Balanced | Microsoft's conversion tool |
| marker | >=1.8.1 | Advanced ML | High Quality | Excellent results, slower |
| docling | >=2.41.0 | Document Intelligence | Advanced | IBM's document platform |
| paddleocr | >=3.3.2 | OCR | Accurate | Great for scanned PDFs |
| unstructured | >=0.15.0 | Document Parsing | Smart | Intelligent element extraction |
| deepseekocr | Latest | GPU OCR | Fast (GPU only) | Requires CUDA GPU |
| pytesseract | >=0.3.10 | OCR | Classic | Tesseract-based (requires system binary) |
- Python: 3.11+
- UV: Fast Python package manager (install)
- Node.js: 20+ (for frontend development)
- Docker (optional): For containerized deployment
- Clone the repository:
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract- Install Python dependencies:
uv sync- Install frontend dependencies:
cd frontend
npm install
cd ..Terminal 1: Start the FastAPI Backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reloadTerminal 2: Start the React Frontend (Development)
cd frontend
npm run devAccess the Application:
- Frontend: http://localhost:5173 (with hot-reload)
- Backend API: http://localhost:8000
Note: The frontend development server proxies API calls to the backend at port 8000 (configured in frontend/vite.config.js)
To build the React app for production:
cd frontend
npm run buildThis creates an optimized build in frontend/dist/ which gets copied to /static by the Docker build process.
docker-compose up --buildThe application will be available at http://localhost:8000
PDFStract includes a powerful CLI for batch processing and automation.
# List available libraries
pdfstract libs
# Convert a single PDF
pdfstract convert document.pdf --library unstructured --output result.md
# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l unstructured -l marker -l pymupdf4llm --output ./comparison
# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents --library unstructured --output ./converted --parallel 4
# Test which library works best on your corpus
pdfstract batch-compare ./papers -l marker -l unstructured --max-files 50 --output ./test✨ Full Features:
- Single file conversion
- Multi-library comparison
- Parallel batch processing (1-16 workers)
- Batch quality testing across corpus
- JSON reporting with detailed statistics
- Error handling and retry options
- Progress indicators and rich formatting
📊 Batch Processing:
- Convert 1000+ PDFs with parallel workers
- Detailed JSON reports (success rate, per-file status)
- Automatic error handling and logging
- Perfect for production jobs and legacy migrations
→ Full CLI Documentation - See complete guide with real-world examples
Check available libraries:
curl http://localhost:8000/librariesResponse:
{
"libraries": [
{
"name": "pymupdf4llm",
"available": true,
"error": null
},
{
"name": "deepseekocr",
"available": false,
"error": "GPU required but not available"
}
]
}Convert a PDF:
curl -X POST \
-F "file=@sample.pdf" \
-F "library=unstructured" \
-F "output_format=markdown" \
http://localhost:8000/convertResponse:
{
"success": true,
"library_used": "unstructured",
"filename": "sample.pdf",
"format": "markdown",
"content": "# Document Title\n\n... extracted markdown ..."
}For Batch Processing: Use the CLI instead
pdfstract batch ./documents --library unstructured --output ./converted --parallel 4Advantages of CLI for batch jobs:
- Parallel processing with configurable workers
- JSON report with statistics (success rate, per-file status)
- Error handling and retry options
- Perfect for production automation
- See CLI_README.md for full batch documentation
| Endpoint | Method | Description | Parameters |
|---|---|---|---|
/ |
GET | Web interface | - |
/health |
GET | Health check | - |
/libraries |
GET | List available libraries | - |
/convert |
POST | Convert PDF | file, library, output_format |
Use the built-in timer feature to benchmark:
| Library | Speed | Quality | Best For |
|---|---|---|---|
| pymupdf4llm | ⚡⚡⚡ | ⭐⭐ | Simple text extraction |
| unstructured | ⚡⚡ | ⭐⭐⭐ | Complex layouts |
| markitdown | ⚡⚡ | ⭐⭐⭐ | Balanced performance |
| marker | ⚡ | ⭐⭐⭐⭐ | Highest quality (ML-based) |
| docling | ⚡ | ⭐⭐⭐⭐ | Document intelligence |
| paddleocr | ⚡ | ⭐⭐⭐ | Scanned PDFs |
| deepseekocr | ⚡ | ⭐⭐⭐ | Scanned PDFs |
| pytesseract | ⚡ | ⭐⭐⭐ | Scanned PDFs |
NOTE: The performance comparison is based on the performance of the libraries when used with the default settings of the application. The performance may vary depending on the complexity of the PDF and the settings of the library.
- File uploads are stored temporarily and deleted after conversion
- No data is persisted or logged
- Use HTTPS in production
- API endpoints are not authenticated (add authentication for production)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
If you encounter issues or have questions - please create an issue
- FastAPI: Modern Python web framework
- React: UI library
- Tailwind CSS: Utility-first CSS framework
- Lucide Icons: Beautiful icon library
- All the amazing PDF extraction libraries (PyMuPDF, Marker, Docling, etc.)
**Made with ❤️ for PDF enthusiasts **


