A desktop application for automated data extraction from various source files (images, PDFs) and compilation into Excel worksheets.
A comprehensive solution for healthcare organizations to automate data extraction from various source files and compile results into properly formatted Excel worksheets. The application provides an intuitive interface that enables users to configure extraction parameters without source code modifications.
- PyQt6 Interface - Responsive user interface with real-time processing feedback
- Configurable Extraction Rules - User-defined patterns and data transformations
- Multi-format Support - Image processing via OCR and PDF text extraction
- Excel Output Generation - Automated column generation based on extraction rules
- Progress Monitoring - Status tracking, progress indicators, and detailed activity logging
- Localization Support - Built-in support for Turkish language data mapping
- Concurrent Processing - Multi-threaded extraction for enhanced performance
- File Structure Navigation - Automated handling of nested directory structures
| Main Interface | Manage Rule Creation |
|---|---|
![]() |
![]() |
| Edit Rule Creation | Test Rule Creation |
![]() |
![]() |
cde-gui/
├── src/
│ ├── ui/ # PyQt6 user interface components
│ │ ├── main_window.py # Main application window
│ │ └── settings_window.py # Rules management window
│ ├── core/ # Core data processing logic
│ │ ├── text_extractor.py # OCR and PDF text extraction
│ │ ├── data_processor.py # Data processing and Excel generation
│ │ └── extraction_engine.py # Main extraction coordinator
│ └── utils/ # Utility functions and helpers
│ ├── config_manager.py # Configuration management
│ ├── data_transformer.py # Data transformation logic
│ └── file_navigator.py # File system navigation
├── config/ # Configuration files
│ ├── app_config.json # Application settings
│ └── default_rules.json # Default extraction rules
├── main.py # Application entry point
└── requirements.txt # Python dependencies
pip install -r requirements.txt- Windows: Download from UB Mannheim
- Add installation directory to system PATH
python main.py-
Application Launch
python main.py
-
Input Configuration
- Select root data directory using "Browse" button
- Specify subject list file (.txt format) containing subject identifiers
-
Target File Configuration
- Define target filename pattern (e.g.,
A_RAPOR_1.jpg,summary.pdf)
- Define target filename pattern (e.g.,
-
Extraction Rules Management
- Access "Manage Rules" interface to configure data extraction patterns
- Configure, test, and validate regex patterns
- Apply data transformation rules as required
-
Processing Execution
- Initiate extraction process via "Start Extraction"
- Monitor real-time progress and status updates
- Review detailed processing logs
-
Results Export
- Generate Excel output using "Export to Excel"
- Specify output location and filename
Root Folder/
├── SubjectID_PatientName/
│ ├── 1/
│ │ ├── target_file.jpg
│ │ └── other_files...
│ ├── 2/
│ │ └── target_file.jpg
│ └── ...
├── AnotherSubject_Name/
│ └── 1/
│ └── target_file.jpg
└── ...
Subject identifier file format (one ID per line):
001
002
003
SUBJ_123
Extraction rules are configured through the management interface and consist of:
- Field Name: Output column identifier (e.g., "Age", "Gender")
- Search Pattern: Regular expression with capture group (e.g.,
Age\s*:\s*([\d.]+)) - Transformation: Data transformation options:
none- No transformation appliedage_round- Age rounding (up if decimal > 0.50)gender_turkish- Turkish-to-English gender term mapping
[
{
"name": "Age",
"pattern": "Age\\s*:\\s*([\\d.]+)",
"transform": "age_round"
},
{
"name": "Gender",
"pattern": "Gender\\s*:\\s*(\\w+)",
"transform": "gender_turkish"
},
{
"name": "Date of Test",
"pattern": "(?:Date of Test|Test Date|Date)\\s*:\\s*([\\d\\-\\/\\.]+)",
"transform": "none"
},
{
"name": "Clinician",
"pattern": "(?:Clinician|Doctor|Physician|Dr\\.)\\s*:?\\s*([A-Za-z\\s\\.]+)",
"transform": "none"
}
]Configuration options in config/app_config.json:
- Window dimensions and interface settings
- Tesseract OCR configuration parameters
- Supported file format specifications
- Processing performance optimization settings
Excel output includes:
- Data Sheet: Extracted information with configurable column structure
- Summary Sheet: Processing statistics and success metrics
- Formatted Headers: Professional styling with optimized column widths
- Data Validation: Clear distinction between successful and failed extractions
-
Dependency Errors
- Verify installation:
pip install -r requirements.txt - Confirm Python version compatibility (3.8+)
- Verify installation:
-
Tesseract OCR Issues
- Ensure Tesseract OCR installation
- Verify system PATH configuration or set TESSDATA_PREFIX environment variable
-
File Access Problems
- Confirm read permissions for data directories
- Verify write permissions for output locations
-
OCR Accuracy Issues
- Check source image quality and resolution
- Adjust Tesseract configuration parameters
- Consider image preprocessing for optimization
- Python 3.8+
- PyQt6
- pytesseract (requires Tesseract OCR)
- PyMuPDF
- openpyxl
- pandas
- Pillow
The application provides enterprise-grade features for healthcare environments:
- Intuitive Interface for non-technical personnel
- Configurable Processing Rules without code modification requirements
- Batch Processing Capabilities for high-volume workflows
- Comprehensive Audit Logging for compliance requirements
- Excel Compatibility with existing healthcare information systems
- Localization Support for international deployments
For technical assistance:
- Review application activity logs
- Analyze console output for error details
- Verify file permissions and directory structure compliance



