govdoc-scanner

📚 Documentation

For detailed documentation, including setup guides, API references, contribution guidelines, and GSoC 2025 information, visit our comprehensive documentation site.

Project Overview

The Problem

In Greece, essential public company data exists in thousands of unstructured documents across the Γ.Ε.ΜΗ. (GEMI) portal. This creates significant barriers for:

Citizens seeking transparency in corporate activities
Researchers analyzing business trends and economic patterns
Policymakers requiring data-driven insights for legislation
Journalists investigating corporate structures and ownership

The current format limits transparency and makes systematic analysis nearly impossible.

The Solution

GovDoc Scanner is an open-source tool designed to convert unstructured GEMI portal PDFs into a fully searchable database accessible via a REST API. It automates the complete document processing pipeline with AI-powered extraction and production-ready infrastructure:

Smart Crawling: Automated document discovery and download from GEMI portal with advanced filtering
AI Extraction: Google Gemini 2.5 Flash processes Greek legal documents with specialized prompts
Structured Data: Comprehensive metadata extraction including representatives, ownership, and change tracking
Full-Text Search: OpenSearch integration with Greek language analyzers for powerful querying
REST API: Production-ready server with authentication, rate limiting, and comprehensive documentation

Current Functionality/Implementation

The repository currently includes five main applications:

cli: A unified command-line interface that orchestrates the complete workflow, combining crawling and scanning with interactive prompts and automated batch processing (recommended for most users).
doc-scanner: Processes .pdf, .doc and .docx documents for a given GEMI company, extracting comprehensive metadata with chronological processing and intelligent representative tracking using Gemini 2.5 Flash Lite.
crawler: Scrapes the GEMI portal to search for companies using advanced filters and downloads all available public documents with enhanced date extraction, intelligent file management, and robust retry mechanisms.
api: Fastify-based REST API server providing search endpoints for companies and representatives with OpenSearch integration.
opensearch: Complete OpenSearch integration with development and production configurations for searchable data indexing.

All applications are organized under the apps/ directory for better project structure and maintainability.

All tools are implemented in Node.js and use a combination of CLI interfaces and automated scripts. The project uses npm workspaces for managing multiple applications.

Usage Instructions

Requirements

Node.js: v18.x or newer (recommended: v20.x) Check your versions with:

node --version

Docker & Docker Compose (optional, for OpenSearch and RESTAPI): Required only if using the OpenSearch integration for search and analytics.
.env file: Copy the example environment file and update it with your Gemini API key:

cp .env.example .env

Then, open .env and set:

GEMINI_API_KEY=your_gemini_api_key_here

# Optional: Set custom working directory (default: ~/.govdoc)
WORKING_DIR=~/.govdoc

Quick Start

Install Dependencies
```
npm install
```
Run the Tool (choose one of the following)

Most Common Usage - Interactive Workflow:
```
npm start govdoc
```
This runs an interactive CLI that guides you through the complete workflow. Use -- to pass args.

Just Search & Download Documents:
```
npm start crawler
```
Just Process Existing Documents:
```
npm start scanner
```
Get Help:
```
npm start help
```

Detailed Usage

1. Interactive Workflow (Recommended)

npm start govdoc

This launches an interactive CLI that guides you through the process:

File Input: Process GEMI IDs from a .gds file
Manual Entry: Enter specific GEMI IDs directly
Random Selection: Process random companies with date-based search filters

2. Command Line Usage (for automation)

# Process from file
npm start govdoc -- --input ./companies.gds

# Process random companies
npm start govdoc -- --company-random 10

# Show help
npm start govdoc -- --help

The command line mode:

Runs without interactive prompts (perfect for automation)
Accepts the same input methods as interactive mode
Provides the same processing and output capabilities
Shows progress tracking and comprehensive summary

Both modes:

Show clear progress tracking with visual indicators
Provide comprehensive summary when complete
Save output in organized working directories (default: ~/.govdoc/)

3. Manual Workflow

If you prefer to run each step separately, make sure to use LOG_LEVEL=DEBUG for detailed output when running the separate apps:

Step 1: Search & Download

npm start crawler

Use the interactive CLI to search for companies or download documents by GEMI ID(s)
Results are saved in ~/.govdoc/crawler/downloads/{GEMI_ID}/ (or custom WORKING_DIR)

Step 2: Process Documents

npm start scanner

Place documents in ~/.govdoc/doc-scanner/input/{GEMI_ID}/ (or custom WORKING_DIR)
Output is generated in ~/.govdoc/doc-scanner/output/{GEMI_ID}/

Alternative Commands

You can also run commands directly:

npm run crawler (same as npm start crawler)
npm run scanner (same as npm start scanner)
npm run govdoc (same as npm start govdoc)

OpenSearch + REST API integration

Quick Setup: Read apps/opensearch/README.md
Detailed Guide: OpenSearch Installation Documentation
Quick Setup: Read apps/api/README.md
Detailed Guide: REST API Installation Documentation

Features Offered

Unified CLI Tool: Complete end-to-end workflow with both interactive and command-line modes for different use cases and automation needs.
Automated Document Downloading: Bulk or single download of all public documents for any Greek company listed in GEMI with enhanced date extraction for proper filename organization.
Advanced Company Search: Filter by name, legal type, status, location, and more.
Intelligent Metadata Extraction: Uses Gemini 2.5 Flash Lite for accurate extraction of company information, representative details, and ownership data from Greek legal documents.
Chronological Processing: Processes documents in date order to track company evolution and changes over time.
Representative Tracking: Accurately identifies company representatives, their active status, and ownership percentages with advanced duplicate prevention.
Change Tracking: Automatically summarizes significant changes between document versions, including role changes, ownership transfers, and address updates with intelligent processing optimization.
Incremental Processing: Skip processing when metadata indicates all documents are up to date, reducing unnecessary API calls and processing time.
Greek Legal Optimization: Specialized for Greek corporate legal terminology and GEMI document structures.
Enhanced Reliability: Robust retry mechanisms and improved error handling for stable operation.
Interactive CLI: User-friendly command-line interfaces with guided prompts for all workflows.
Multiple Input Methods: Support for file input, manual entry, and random selection with date-based search filters.
Progress Tracking: Unified progress bar and summary for batch operations.
OpenSearch Integration: Optional integration with OpenSearch 3.1+ for full-text search, analytics, and data visualization with automated index management and bulk operations.

Documentation

This project includes comprehensive documentation built with Docusaurus. The documentation provides:

Getting Started Guide: Step-by-step setup and usage instructions
Development Setup: Detailed guide for contributors and developers
Code Examples: Practical examples for each application component
GSoC 2025 Overview: Project background and future roadmap

Accessing the Documentation

Online: Visit the project documentation site

Local Development: To run the documentation locally:

cd docs-site
npm install
npm start

The documentation site will be available at http://localhost:3000 with live reloading for development.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github/workflows		.github/workflows
apps		apps
docs-site		docs-site
shared		shared
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
run.js		run.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

govdoc-scanner

📚 Documentation

Project Overview

The Problem

The Solution

Current Functionality/Implementation

Usage Instructions

Requirements

Quick Start

Detailed Usage

1. Interactive Workflow (Recommended)

2. Command Line Usage (for automation)

3. Manual Workflow

Alternative Commands

OpenSearch + REST API integration

Features Offered

Documentation

Accessing the Documentation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

flexivian/govdoc-scanner

Folders and files

Latest commit

History

Repository files navigation

govdoc-scanner

📚 Documentation

Project Overview

The Problem

The Solution

Current Functionality/Implementation

Usage Instructions

Requirements

Quick Start

Detailed Usage

1. Interactive Workflow (Recommended)

2. Command Line Usage (for automation)

3. Manual Workflow

Alternative Commands

OpenSearch + REST API integration

Features Offered

Documentation

Accessing the Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages