HCSO Submission to the Eurostat MNE Discovery Challenge

This repository contains the official submission of the Hungarian Central Statistical Office (HCSO) for the Eurostat Multinational Enterprise (MNE) Discovery Challenge, focused on automatically locating publicly available financial documents of multinational enterprise groups on the web.

Competition Page

https://statistics-awards.eu/competitions/23

Challenge Overview

Multinational enterprise groups play a major role in the European economy. In all EU and EFTA countries, they contribute substantially to the production of goods and services, employment and investments. Due to their importance, they are closely monitored by the National Statistical Institutes and Eurostat. According to the data of the Euro Groups Register (the European statistical business register on MNE groups created by the European Statistical System and managed by Eurostat), for the reference year 2022, MNE groups employed over 47 million people in EU-EFTA countries. This means that around 28 % of people employed in Europe worked for a multinational enterprise group. The majority (82 %) of them, worked in a small number of large multinational enterprise groups.

The goal of the Multinational Enterprise Group Data Discovery Challenge is to develop approaches that automatically identify sources of annual financial data on the World Wide Web for MNE Groups.

Team Members

Róbert Lakatos
László Mészáros
István Porupsánszki
Miklós Salánki
István Lakatos

Achievements

Our solution received two awards from Eurostat:

1st Place – Innovativeness Award Awarded to the team demonstrating the most original and creative approach.
3rd Place – Reusability Award Recognizing submissions with strong potential for scaling and integration into European statistical production.

Methodology

Our system consists of a modular, configurable pipeline with three main stages:

1. Web Discovery of Financial Reports

We automatically query a search engine to locate publicly available PDF documents likely to contain financial reports for a given MNE.

Searches are formulated as: "<MNE name> financial report" filetype:pdf
Script: query-reports.py

2. PDF Parsing and Text Extraction

Located PDF documents are processed to extract text and tables. The system supports multiple extraction backends:

pdfplumber
pypdf2
pytesseract

Extracted content is split into smaller text chunks for downstream analysis.

Script: pdf-text-extraction.py

3. LLM-Based Document Assessment

Each chunk is evaluated using a large language model (LLM) to determine relevance and content type. Example questions include:

Does the text contain financial information?
Does the text refer to the specified MNE?

We used the gemini-2.0-flash-lite-001 model from Google.

Script: llm.py

Project Structure

The solution is highly modular and easily configurable. Intermediate and final outputs are stored in the data/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
README.md		README.md
llm.py		llm.py
pdf-text-extraction.py		pdf-text-extraction.py
query-reports.py		query-reports.py
requirements.txt		requirements.txt
search_results.csv		search_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HCSO Submission to the Eurostat MNE Discovery Challenge

Competition Page

Challenge Overview

Team Members

Achievements

Methodology

1. Web Discovery of Financial Reports

2. PDF Parsing and Text Extraction

3. LLM-Based Document Assessment

Project Structure

About

Uh oh!

Languages

daergoth/Eurostat-MNE-Discovery

Folders and files

Latest commit

History

Repository files navigation

HCSO Submission to the Eurostat MNE Discovery Challenge

Competition Page

Challenge Overview

Team Members

Achievements

Methodology

1. Web Discovery of Financial Reports

2. PDF Parsing and Text Extraction

3. LLM-Based Document Assessment

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages