Repository: https://github.com/chirindaopensource/measuring_corruption_from_text_data
Owner: 2025 Craig Chirinda (Open Source Projects)
This repository contains an independent, professional-grade Python implementation of the research methodology from the 2025 paper entitled "Measuring Corruption from Text Data" by:
- Arieda Muço (Central European University)
The project provides a complete, end-to-end computational framework for replicating the paper's findings. It delivers a modular, auditable, and extensible pipeline that executes the entire research workflow: from the heuristic extraction of irregularities from unstructured audit reports and dictionary-based classification to dimensionality reduction via PCA, rigorous econometric validation against human experts, and supervised learning robustness checks.
- Introduction
- Theoretical Background
- Features
- Methodology Implemented
- Core Components (Notebook Structure)
- Key Callable:
execute_full_research_pipeline - Prerequisites
- Installation
- Input Data Structure
- Usage
- Output Structure
- Project Structure
- Customization
- Contributing
- Recommended Extensions
- License
- Citation
- Acknowledgments
This project provides a Python implementation of the analytical framework presented in Muço (2025). The core of this repository is the iPython Notebook measuring_corruption_from_text_data_draft.ipynb, which contains a comprehensive suite of functions to replicate the paper's findings. The pipeline is designed to be a generalizable toolkit for quantifying institutional quality from unstructured administrative text, specifically focusing on the Brazilian municipal audit program (CGU).
The paper addresses the fundamental challenge of measuring corruption—a hidden phenomenon—by leveraging the "administrative exhaust" of government audits. This codebase operationalizes the paper's framework, allowing users to:
- Rigorously validate and manage the entire experimental configuration via a single
config.yamlfile. - Extract granular irregularity segments from heterogeneous PDF-derived text using regex-based heuristics.
- Classify irregularities as "severe" or "non-severe" using a domain-specific Portuguese dictionary.
- Construct a latent Corruption Index via Principal Component Analysis (PCA) on text-derived features.
- Validate the automated measure against hand-coded datasets (Ferraz & Finan, Timmons & Garfias) using fixed-effects regression models.
- Verify robustness using supervised machine learning classifiers (Logistic Regression, Naive Bayes) and Leave-One-Out (LOO) sensitivity analysis.
The implemented methods combine techniques from Natural Language Processing (NLP), Unsupervised Learning, and Econometrics.
1. Text-as-Data Extraction & Classification: The pipeline treats audit reports as data. It handles structural shifts in reporting (introduction of summaries in later lotteries) to isolate "irregularity segments."
-
Dictionary Method: A deterministic rule classifies an irregularity
$I_{ij}$ as severe if it contains specific n-grams (e.g., "empresa fantasma", "fraud") from a curated lexicon$\mathcal{L}$ : $$ Severe(I_{ij}) = \mathbb{1}{\exists \ell \in \mathcal{L} : Match(\ell, \phi(I_{ij}))} $$ where$\phi(\cdot)$ represents the text normalization pipeline (stemming, stopword removal).
2. Dimensionality Reduction (PCA):
To synthesize a single measure from correlated text features (image counts, page counts, severe irregularity counts), PCA is applied to the standardized feature matrix
3. Econometric Validation:
The automated index is validated by regressing human-coded corruption counts (
Below is a diagram which summarizes the proposed approach:
The provided iPython Notebook (measuring_corruption_from_text_data_draft.ipynb) implements the full research pipeline, including:
- Modular, Multi-Task Architecture: The pipeline is decomposed into 19 distinct, modular tasks, each with its own orchestrator function.
- Configuration-Driven Design: All study parameters (lottery cutoffs, regex markers, PCA thresholds) are managed in an external
config.yamlfile. - Rigorous Data Validation: A multi-stage validation process checks schema integrity, key uniqueness, and logical consistency of the corpus and validation data.
- Advanced NLP Pipeline: Implements NFD normalization, accent stripping, and Porter stemming tailored for Portuguese administrative text.
- Robustness Verification: Includes automated Leave-One-Out (LOO) analysis and a parallel Supervised Learning pipeline to cross-validate the dictionary-based measure.
- Reproducible Artifacts: Generates structured dictionaries and serializable outputs for every intermediate result, ensuring full auditability.
The core analytical steps directly implement the methodology from the paper:
- Validation & Cleansing (Tasks 1-6): Ingests raw corpus and validation data, normalizes identifiers (7-digit IBGE codes), cleanses text encoding, and validates numeric metadata.
- Extraction & NLP (Tasks 7-9): Applies heuristic parsing to extract irregularity segments, normalizes text and lexicon into a shared matching space, and classifies irregularities as severe/non-severe.
- Index Construction (Task 10): Builds the feature matrix, standardizes data, computes PCA, and generates the Corruption Index.
- Econometric Validation (Tasks 11-14): Merges the index with external datasets, constructs agreement samples, and estimates validation regressions (Tables 1, 2, and 3).
- Robustness Checks (Tasks 16-19): Performs LOO sensitivity analysis and executes a full supervised learning pipeline (training classifiers, rebuilding the index with ML predictions) to confirm result stability.
The measuring_corruption_from_text_data_draft.ipynb notebook is structured as a logical pipeline with modular orchestrator functions for each of the 19 major tasks. All functions are self-contained, fully documented with type hints and docstrings, and designed for professional-grade execution.
The project is designed around a single, top-level user-facing interface function:
execute_full_research_pipeline: This master orchestrator function runs the entire automated research pipeline from end-to-end. A single call to this function reproduces the entire computational portion of the project, managing data flow between the main analysis, LOO robustness, and supervised learning robustness modules.
- Python 3.9+
- Core dependencies:
pandas,numpy,scikit-learn,statsmodels,nltk,pyyaml.
-
Clone the repository:
git clone https://github.com/chirindaopensource/measuring_corruption_from_text_data.git cd measuring_corruption_from_text_data -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Python dependencies:
pip install pandas numpy scikit-learn statsmodels nltk pyyaml
-
Download NLTK Data: The pipeline will attempt to download necessary NLTK data (stopwords) automatically, but you can pre-install them:
import nltk nltk.download('stopwords')
The pipeline requires two primary DataFrames:
df_raw_corpus: The corpus of audit reports with columns:report_id,municipality_id,report_full_text,report_summary_text,lottery,year,image_count,page_count, etc.df_validation_raw: External validation data with columns:municipality_id,ff_corruption_count,gt_corruption_count,cgu_severe_count, and municipal covariates (literacy_rate,gdp_per_capita, etc.).
The measuring_corruption_from_text_data_draft.ipynb notebook provides a complete, step-by-step guide. The primary workflow is to execute the final cell of the notebook, which demonstrates how to use the top-level execute_full_research_pipeline orchestrator:
# Final cell of the notebook
# This block serves as the main entry point for the entire project.
if __name__ == '__main__':
# 1. Load the master configuration from the YAML file.
import yaml
with open('config.yaml', 'r') as f:
study_config = yaml.safe_load(f)
# 2. Load raw datasets (Example using synthetic generator provided in the notebook)
# In production, load from CSV/Parquet: pd.read_csv(...)
df_raw_corpus = ...
df_validation_raw = ...
# 3. Define Lexicon
raw_lexicon_list = ["empresa fantasma", "fraud", "conluio", ...]
# 4. Execute the entire replication study.
results = execute_full_research_pipeline(
df_raw_corpus=df_raw_corpus,
df_validation_raw=df_validation_raw,
language="Portuguese",
raw_lexicon_list=raw_lexicon_list,
study_configuration=study_config
)
# 5. Access results
print(f"Main Pipeline R2 (Strict Agreement): {results['main_pipeline']['table1_results']['Strict Agreement']['R2']}")The pipeline returns a master dictionary containing all analytical artifacts:
-
main_pipeline: Containsdf_corpus_with_index(the final index),table1_results(validation regressions),table2_results(CGU validation),table3_results(correlates), andpca_artifacts. -
loo_analysis: Containsdetailed_results(per-iteration stats) andsummary(min/max ranges for$\beta$ and$R^2$ ). -
supervised_robustness: Containsclassification_reports,ml_pca_index, andcomparison_stats(correlation between dictionary and ML indices).
measuring_corruption_from_text_data/
│
├── measuring_corruption_from_text_data_draft.ipynb # Main implementation notebook
├── config.yaml # Master configuration file
├── requirements.txt # Python package dependencies
│
├── LICENSE # MIT Project License File
└── README.md # This file
The pipeline is highly customizable via the config.yaml file. Users can modify study parameters such as:
- Parsing Logic:
lottery_cutoff,regex_start_marker. - NLP Settings:
stemmer_algorithm,dictionary_ngram_range. - PCA Settings:
pca_input_features,eigenvalue_threshold. - Econometrics:
robust_se_type,validation_fixed_effects.
Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.
Future extensions could include:
- LLM Integration: Replacing the dictionary classifier with Large Language Models (e.g., BERT, GPT) to test if contextual embeddings improve severity classification.
- Temporal Analysis: Extending the model to analyze trends in corruption severity over time.
- Cross-National Application: Adapting the dictionary and extraction logic for audit reports from other countries (e.g., Mexico, Puerto Rico).
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this code or the methodology in your research, please cite the original paper:
@article{muco2025measuring,
title={Measuring Corruption from Text Data},
author={Muço, Arieda},
journal={arXiv preprint arXiv:2512.09652},
year={2025}
}For the implementation itself, you may cite this repository:
Chirinda, C. (2025). Automated Quantification of Institutional Quality: An Open Source Implementation.
GitHub repository: https://github.com/chirindaopensource/measuring_corruption_from_text_data
- Credit to Arieda Muço for the foundational research that forms the entire basis for this computational replication.
- This project is built upon the exceptional tools provided by the open-source community. Sincere thanks to the developers of the scientific Python ecosystem, including Pandas, NumPy, Scikit-Learn, and Statsmodels.
--
This README was generated based on the structure and content of the measuring_corruption_from_text_data_draft.ipynb notebook and follows best practices for research software documentation.
