Skip to content

daergoth/Eurostat-MNE-Discovery

Repository files navigation

HCSO Submission to the Eurostat MNE Discovery Challenge

This repository contains the official submission of the Hungarian Central Statistical Office (HCSO) for the Eurostat Multinational Enterprise (MNE) Discovery Challenge, focused on automatically locating publicly available financial documents of multinational enterprise groups on the web.

Competition Page

https://statistics-awards.eu/competitions/23


Challenge Overview

Multinational enterprise groups play a major role in the European economy. In all EU and EFTA countries, they contribute substantially to the production of goods and services, employment and investments. Due to their importance, they are closely monitored by the National Statistical Institutes and Eurostat. According to the data of the Euro Groups Register (the European statistical business register on MNE groups created by the European Statistical System and managed by Eurostat), for the reference year 2022, MNE groups employed over 47 million people in EU-EFTA countries. This means that around 28 % of people employed in Europe worked for a multinational enterprise group. The majority (82 %) of them, worked in a small number of large multinational enterprise groups.

The goal of the Multinational Enterprise Group Data Discovery Challenge is to develop approaches that automatically identify sources of annual financial data on the World Wide Web for MNE Groups.


Team Members

  • Róbert Lakatos
  • László Mészáros
  • István Porupsánszki
  • Miklós Salánki
  • István Lakatos

Achievements

Our solution received two awards from Eurostat:

  • 1st Place – Innovativeness Award Awarded to the team demonstrating the most original and creative approach.

  • 3rd Place – Reusability Award Recognizing submissions with strong potential for scaling and integration into European statistical production.


Methodology

Our system consists of a modular, configurable pipeline with three main stages:

1. Web Discovery of Financial Reports

We automatically query a search engine to locate publicly available PDF documents likely to contain financial reports for a given MNE.

  • Searches are formulated as: "<MNE name> financial report" filetype:pdf
  • Script: query-reports.py

2. PDF Parsing and Text Extraction

Located PDF documents are processed to extract text and tables. The system supports multiple extraction backends:

  • pdfplumber
  • pypdf2
  • pytesseract

Extracted content is split into smaller text chunks for downstream analysis.

  • Script: pdf-text-extraction.py

3. LLM-Based Document Assessment

Each chunk is evaluated using a large language model (LLM) to determine relevance and content type. Example questions include:

  • Does the text contain financial information?
  • Does the text refer to the specified MNE?

We used the gemini-2.0-flash-lite-001 model from Google.

  • Script: llm.py

Project Structure

The solution is highly modular and easily configurable. Intermediate and final outputs are stored in the data/ directory.

About

HCSO submission to Eurostat MNE Discovery Challenge (https://statistics-awards.eu/competitions/23)

Topics

Resources

Stars

Watchers

Forks

Languages