This repository contains the official submission of the Hungarian Central Statistical Office (HCSO) for the Eurostat Multinational Enterprise (MNE) Discovery Challenge, focused on automatically locating publicly available financial documents of multinational enterprise groups on the web.
https://statistics-awards.eu/competitions/23
Multinational enterprise groups play a major role in the European economy. In all EU and EFTA countries, they contribute substantially to the production of goods and services, employment and investments. Due to their importance, they are closely monitored by the National Statistical Institutes and Eurostat. According to the data of the Euro Groups Register (the European statistical business register on MNE groups created by the European Statistical System and managed by Eurostat), for the reference year 2022, MNE groups employed over 47 million people in EU-EFTA countries. This means that around 28 % of people employed in Europe worked for a multinational enterprise group. The majority (82 %) of them, worked in a small number of large multinational enterprise groups.
The goal of the Multinational Enterprise Group Data Discovery Challenge is to develop approaches that automatically identify sources of annual financial data on the World Wide Web for MNE Groups.
- Róbert Lakatos
- László Mészáros
- István Porupsánszki
- Miklós Salánki
- István Lakatos
Our solution received two awards from Eurostat:
-
1st Place – Innovativeness Award Awarded to the team demonstrating the most original and creative approach.
-
3rd Place – Reusability Award Recognizing submissions with strong potential for scaling and integration into European statistical production.
Our system consists of a modular, configurable pipeline with three main stages:
We automatically query a search engine to locate publicly available PDF documents likely to contain financial reports for a given MNE.
- Searches are formulated as:
"<MNE name> financial report" filetype:pdf - Script:
query-reports.py
Located PDF documents are processed to extract text and tables. The system supports multiple extraction backends:
pdfplumberpypdf2pytesseract
Extracted content is split into smaller text chunks for downstream analysis.
- Script:
pdf-text-extraction.py
Each chunk is evaluated using a large language model (LLM) to determine relevance and content type. Example questions include:
- Does the text contain financial information?
- Does the text refer to the specified MNE?
We used the gemini-2.0-flash-lite-001 model from Google.
- Script:
llm.py
The solution is highly modular and easily configurable.
Intermediate and final outputs are stored in the data/ directory.