Skip to content

End-to-end Python implementation of Ma et al.'s (2025) matrix-variate diffusion index models for macroeconomic forecasting. Features α-PCA factor extraction, supervised screening, and ILS estimation for high-dimensional forecasting with preserved structural information.

License

Notifications You must be signed in to change notification settings

chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models

Repository files navigation

README.md

High-Dimensional Matrix-Variate Diffusion Index Models

License: MIT Python Version Code style: black Imports: isort Type Checking: mypy Pandas NumPy SciPy Statsmodels Scikit-learn Jupyter arXiv Research Discipline Methodology Year

Repository: https://github.com/chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models

Owner: 2025 Craig Chirinda (Open Source Projects)

This repository contains an independent, professional-grade Python implementation of the research methodology from the 2025 paper entitled "High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting" by:

  • Zhiren Ma
  • Qian Zhao
  • Riquan Zhang
  • Zhaoxing Gao

The project provides a complete, end-to-end computational framework for forecasting a scalar time series using a high-dimensional, matrix-valued panel of predictors. It moves beyond traditional vectorized factor models by preserving the intrinsic row-column structure of the data, offering a more powerful and nuanced approach to dimension reduction in modern data-rich environments. The goal is to provide a transparent, robust, and computationally efficient toolkit for researchers and practitioners to replicate, validate, and extend the paper's findings.

Table of Contents

Introduction

This project provides a Python implementation of the methodologies presented in the 2025 paper "High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting." The core of this repository is the iPython Notebook high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb, which contains a comprehensive suite of functions to replicate the paper's findings, from initial data validation and cleansing to the final generation of performance tables, diagnostic plots, and a full reproducibility report.

Traditional diffusion index models require vectorizing predictor panels, which can destroy valuable structural information (e.g., the relationship between different economic indicators for a single country). This project implements a matrix-variate approach that preserves this structure, potentially leading to more powerful factors and more accurate forecasts.

This codebase enables users to:

  • Rigorously validate and prepare complex panel datasets, ensuring stationarity and preventing data leakage in a forecasting context.
  • Extract low-dimensional latent factor matrices from a high-dimensional data tensor using the flexible α-PCA methodology.
  • Estimate a bilinear forecasting model using a numerically stable Iterative Least Squares (ILS) algorithm.
  • Apply a novel supervised screening technique to refine the predictor set and improve forecast accuracy.
  • Conduct a full-scale Monte Carlo simulation to validate the statistical properties of the estimators.
  • Perform a comprehensive empirical study, including benchmark comparisons and robustness checks.
  • Automatically generate all key tables and figures from the paper for direct comparison and validation.

Theoretical Background

The implemented methods are grounded in modern high-dimensional econometrics, extending classical factor model theory to matrix- and tensor-valued time series.

1. The Matrix-Variate Diffusion Index Model: The core model assumes that a high-dimensional predictor matrix $X_t \in \mathbb{R}^{p \times q}$ and a future scalar outcome $y_{t+h}$ are driven by a common low-dimensional latent factor matrix $F_t \in \mathbb{R}^{k \times r}$.

  • Observation Equation: $X_t = R F_t C' + E_t$
  • Forecasting Equation: $y_{t+h} = \alpha' F_t \beta + e_{t+h}$

2. α-Principal Component Analysis (α-PCA): Unlike standard PCA which only considers the covariance matrix (second moments), α-PCA constructs aggregation matrices that are a weighted average of both first and second moments of the data. This allows the factor extraction to be sensitive to both the mean structure and the variance structure of the predictors. The key constructs are the moment aggregation matrices: $$ \widehat{\boldsymbol{M}}R = \frac{1}{pq} \left[ (1+\alpha) \overline{\boldsymbol{X}} \overline{\boldsymbol{X}}^\prime + \frac{1}{T} \sum{t=1}^T (\boldsymbol{X}_t - \overline{\boldsymbol{X}}) (\boldsymbol{X}_t - \overline{\boldsymbol{X}})^\prime \right] $$ The loading matrices $R$ and $C$ are derived from the eigendecomposition of $\widehat{\boldsymbol{M}}_R$ and its column-wise equivalent $\widehat{\boldsymbol{M}}_C$.

3. Supervised Screening: To improve the signal-to-noise ratio, the paper proposes a supervised pre-processing step. It computes the correlation of each individual predictor series $x_{ij,t}$ with the target $y_t$. Rows (e.g., countries) and columns (e.g., indicators) with low average absolute correlation are removed before the α-PCA is applied, focusing the dimension reduction on the most relevant parts of the data.

Features

The provided iPython Notebook (high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb) implements the full research pipeline, including:

  • Data Pipeline: A robust, leak-free validation and preparation module that performs stationarity testing, transformation, and centralization appropriate for forecasting.
  • High-Performance Analytics: Elite-grade, vectorized implementations of the α-PCA and ILS algorithms using advanced NumPy and SciPy features.
  • Statistical Rigor: A complete suite of benchmark models (AR, Vec-OLS, Vec-Lasso) and a robust Diebold-Mariano test with small-sample corrections for fair and accurate model comparison.
  • Automated Orchestration: A master function that runs the entire end-to-end workflow, including the empirical study, simulations, and robustness checks, with a single call.
  • Comprehensive Reporting: Automated generation of publication-quality summary tables (replicating Tables 1, 2, 4, 5 from the paper), diagnostic plots, and a full reproducibility report.
  • Full Research Lifecycle: The codebase covers the entire research process from data ingestion to final output generation, providing a complete and transparent replication package.

Methodology Implemented

The core analytical steps directly implement the methodology from the paper:

  1. Data Validation and Preparation (Tasks 1-2): The pipeline ingests the raw panel data, performs structural and quality checks, and applies a leak-free stationarity and centralization protocol.
  2. α-PCA Factor Extraction (Tasks 3-6): It computes sample statistics, constructs the moment aggregation matrices, performs eigendecomposition to find the loadings, and projects the data to recover the latent factor matrices.
  3. LSE Parameter Estimation (Tasks 7-8): It prepares the training data and uses a numerically stable Iterative Least Squares algorithm to estimate the forecasting parameters $\alpha$ and $\beta$.
  4. Supervised Screening (Tasks 9-10): It computes training-data-only correlations, applies thresholds to filter the data, and prepares a refined data tensor.
  5. Out-of-Sample Forecasting and Evaluation (Tasks 12-14): It generates forecasts on unseen data and evaluates them using MSFE and the Diebold-Mariano test.
  6. Simulation Study (Tasks 16-17): It implements the full Monte Carlo simulation framework to generate synthetic data and validate the estimators' statistical properties.
  7. Orchestration and Reporting (Tasks 18-20): Master functions orchestrate the empirical study, robustness checks, and the generation of all final tables and reports.

Core Components (Notebook Structure)

The high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb notebook is structured as a logical pipeline with modular functions for each task, from Task 1 (Data Validation) to Task 20 (Results Compilation).

Key Callable: run_complete_study

The central function in this project is run_complete_study. It orchestrates the entire analytical workflow from raw data to a final, comprehensive report object.

def run_complete_study(
    country_data: Dict[str, pd.DataFrame],
    y_series: pd.Series,
    study_manifest: Dict[str, Any],
    run_empirical_study: bool = True,
    run_simulation_study: bool = True,
    generate_reports: bool = True
) -> Dict[str, Any]:
    """
    Executes the entire research pipeline from data to final report.
    """
    # ... (implementation is in the notebook)

Prerequisites

  • Python 3.9+
  • Core dependencies: pandas, numpy, scipy, statsmodels, scikit-learn, matplotlib, joblib, tqdm.

Installation

  1. Clone the repository:

    git clone https://github.com/chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models.git
    cd high_dimensional_matrix_variate_diffusion_index_models
  2. Create and activate a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  3. Install Python dependencies:

    pip install pandas numpy scipy statsmodels scikit-learn matplotlib joblib tqdm

Input Data Structure

The pipeline requires two primary data inputs passed to the run_complete_study function:

  1. country_data: A Python dictionary where keys are string identifiers for the row entities (e.g., countries) and values are pandas.DataFrames. Each DataFrame must have a DatetimeIndex and identical column names representing the predictor variables.
  2. y_series: A pandas.Series containing the scalar target variable, with a DatetimeIndex that is identical to those in the country_data DataFrames.
  3. study_manifest: A nested Python dictionary that controls all parameters of the analysis. A fully specified example is provided in the notebook.

Usage

The high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb notebook provides a complete, step-by-step guide. The core workflow is:

  1. Prepare Inputs: Load your panel data into the required dictionary and Series formats. Define the study_manifest dictionary.
  2. Execute Pipeline: Call the master orchestrator function:
    final_results = run_complete_study(
        country_data=my_raw_country_data,
        y_series=my_raw_y_series,
        study_manifest=my_study_manifest,
        run_empirical_study=True,
        run_simulation_study=False, # Optional: very time-consuming
        generate_reports=True
    )
  3. Inspect Outputs: Programmatically access any result from the returned final_results dictionary. For example, to view the main results table for the unscreened model:
    table4_panel_a = final_results['reports']['Table 4']['Panel A']
    # In a Jupyter Notebook, this will render the styled table
    display(table4_panel_a)

Output Structure

The run_complete_study function returns a single, comprehensive dictionary with the following top-level keys:

  • empirical_study: A deeply nested dictionary containing all raw numerical results from the empirical analysis, including performance DataFrames for the main model (screened and unscreened), benchmark results, and significance tests.
  • simulation_study: A dictionary containing the raw pd.DataFrame from the Monte Carlo simulation (if run).
  • reports: A dictionary containing all generated outputs, including styled pd.DataFrame objects for each table, matplotlib figure objects for plots, and the final reproducibility report.

Project Structure

high_dimensional_matrix_variate_diffusion_index_models/
│
├── high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb  # Main implementation notebook   
├── requirements.txt                                                      # Python package dependencies
├── LICENSE                                                               # MIT license file
└── README.md                                                             # This documentation file

Customization

The pipeline is highly customizable via the master study_config dictionary. Users can easily modify:

  • The alpha_grid and factor_dimensions_grid for the empirical study.
  • The train_test_split_config to change the out-of-sample period.
  • All parameters for the Monte Carlo simulations (p, q, T, DGPs, etc.).
  • The screening_thresholds for the supervised refinement step.

Contributing

Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use this code or the methodology in your research, please cite the original paper:

@article{ma2025high,
  title={High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting},
  author={Ma, Zhiren and Zhao, Qian and Zhang, Riquan and Gao, Zhaoxing},
  journal={arXiv preprint arXiv:2508.04259},
  year={2025}
}

For the implementation itself, you may cite this repository:

Chirinda, C. (2025). A Python Implementation of "High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting". 
GitHub repository: https://github.com/chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models

Acknowledgments

  • Credit to Zhiren Ma, Qian Zhao, Riquan Zhang, and Zhaoxing Gao for their clear and insightful research.
  • Thanks to the developers of the scientific Python ecosystem (numpy, pandas, scipy, statsmodels, scikit-learn) that makes this work possible.

--

This README was generated based on the structure and content of high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb and follows best practices for research software documentation.

About

End-to-end Python implementation of Ma et al.'s (2025) matrix-variate diffusion index models for macroeconomic forecasting. Features α-PCA factor extraction, supervised screening, and ILS estimation for high-dimensional forecasting with preserved structural information.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published