Repository: https://github.com/chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models
Owner: 2025 Craig Chirinda (Open Source Projects)
This repository contains an independent, professional-grade Python implementation of the research methodology from the 2025 paper entitled "High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting" by:
- Zhiren Ma
- Qian Zhao
- Riquan Zhang
- Zhaoxing Gao
The project provides a complete, end-to-end computational framework for forecasting a scalar time series using a high-dimensional, matrix-valued panel of predictors. It moves beyond traditional vectorized factor models by preserving the intrinsic row-column structure of the data, offering a more powerful and nuanced approach to dimension reduction in modern data-rich environments. The goal is to provide a transparent, robust, and computationally efficient toolkit for researchers and practitioners to replicate, validate, and extend the paper's findings.
- Introduction
- Theoretical Background
- Features
- Methodology Implemented
- Core Components (Notebook Structure)
- Key Callable: run_complete_study
- Prerequisites
- Installation
- Input Data Structure
- Usage
- Output Structure
- Project Structure
- Customization
- Contributing
- License
- Citation
- Acknowledgments
This project provides a Python implementation of the methodologies presented in the 2025 paper "High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting." The core of this repository is the iPython Notebook high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb, which contains a comprehensive suite of functions to replicate the paper's findings, from initial data validation and cleansing to the final generation of performance tables, diagnostic plots, and a full reproducibility report.
Traditional diffusion index models require vectorizing predictor panels, which can destroy valuable structural information (e.g., the relationship between different economic indicators for a single country). This project implements a matrix-variate approach that preserves this structure, potentially leading to more powerful factors and more accurate forecasts.
This codebase enables users to:
- Rigorously validate and prepare complex panel datasets, ensuring stationarity and preventing data leakage in a forecasting context.
- Extract low-dimensional latent factor matrices from a high-dimensional data tensor using the flexible α-PCA methodology.
- Estimate a bilinear forecasting model using a numerically stable Iterative Least Squares (ILS) algorithm.
- Apply a novel supervised screening technique to refine the predictor set and improve forecast accuracy.
- Conduct a full-scale Monte Carlo simulation to validate the statistical properties of the estimators.
- Perform a comprehensive empirical study, including benchmark comparisons and robustness checks.
- Automatically generate all key tables and figures from the paper for direct comparison and validation.
The implemented methods are grounded in modern high-dimensional econometrics, extending classical factor model theory to matrix- and tensor-valued time series.
1. The Matrix-Variate Diffusion Index Model:
The core model assumes that a high-dimensional predictor matrix
-
Observation Equation:
$X_t = R F_t C' + E_t$ -
Forecasting Equation:
$y_{t+h} = \alpha' F_t \beta + e_{t+h}$
2. α-Principal Component Analysis (α-PCA):
Unlike standard PCA which only considers the covariance matrix (second moments), α-PCA constructs aggregation matrices that are a weighted average of both first and second moments of the data. This allows the factor extraction to be sensitive to both the mean structure and the variance structure of the predictors. The key constructs are the moment aggregation matrices:
$$
\widehat{\boldsymbol{M}}R = \frac{1}{pq} \left[ (1+\alpha) \overline{\boldsymbol{X}} \overline{\boldsymbol{X}}^\prime + \frac{1}{T} \sum{t=1}^T (\boldsymbol{X}_t - \overline{\boldsymbol{X}}) (\boldsymbol{X}_t - \overline{\boldsymbol{X}})^\prime \right]
$$
The loading matrices
3. Supervised Screening:
To improve the signal-to-noise ratio, the paper proposes a supervised pre-processing step. It computes the correlation of each individual predictor series
The provided iPython Notebook (high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb) implements the full research pipeline, including:
- Data Pipeline: A robust, leak-free validation and preparation module that performs stationarity testing, transformation, and centralization appropriate for forecasting.
- High-Performance Analytics: Elite-grade, vectorized implementations of the α-PCA and ILS algorithms using advanced NumPy and SciPy features.
- Statistical Rigor: A complete suite of benchmark models (AR, Vec-OLS, Vec-Lasso) and a robust Diebold-Mariano test with small-sample corrections for fair and accurate model comparison.
- Automated Orchestration: A master function that runs the entire end-to-end workflow, including the empirical study, simulations, and robustness checks, with a single call.
- Comprehensive Reporting: Automated generation of publication-quality summary tables (replicating Tables 1, 2, 4, 5 from the paper), diagnostic plots, and a full reproducibility report.
- Full Research Lifecycle: The codebase covers the entire research process from data ingestion to final output generation, providing a complete and transparent replication package.
The core analytical steps directly implement the methodology from the paper:
- Data Validation and Preparation (Tasks 1-2): The pipeline ingests the raw panel data, performs structural and quality checks, and applies a leak-free stationarity and centralization protocol.
- α-PCA Factor Extraction (Tasks 3-6): It computes sample statistics, constructs the moment aggregation matrices, performs eigendecomposition to find the loadings, and projects the data to recover the latent factor matrices.
-
LSE Parameter Estimation (Tasks 7-8): It prepares the training data and uses a numerically stable Iterative Least Squares algorithm to estimate the forecasting parameters
$\alpha$ and$\beta$ . - Supervised Screening (Tasks 9-10): It computes training-data-only correlations, applies thresholds to filter the data, and prepares a refined data tensor.
- Out-of-Sample Forecasting and Evaluation (Tasks 12-14): It generates forecasts on unseen data and evaluates them using MSFE and the Diebold-Mariano test.
- Simulation Study (Tasks 16-17): It implements the full Monte Carlo simulation framework to generate synthetic data and validate the estimators' statistical properties.
- Orchestration and Reporting (Tasks 18-20): Master functions orchestrate the empirical study, robustness checks, and the generation of all final tables and reports.
The high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb notebook is structured as a logical pipeline with modular functions for each task, from Task 1 (Data Validation) to Task 20 (Results Compilation).
The central function in this project is run_complete_study. It orchestrates the entire analytical workflow from raw data to a final, comprehensive report object.
def run_complete_study(
country_data: Dict[str, pd.DataFrame],
y_series: pd.Series,
study_manifest: Dict[str, Any],
run_empirical_study: bool = True,
run_simulation_study: bool = True,
generate_reports: bool = True
) -> Dict[str, Any]:
"""
Executes the entire research pipeline from data to final report.
"""
# ... (implementation is in the notebook)- Python 3.9+
- Core dependencies:
pandas,numpy,scipy,statsmodels,scikit-learn,matplotlib,joblib,tqdm.
-
Clone the repository:
git clone https://github.com/chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models.git cd high_dimensional_matrix_variate_diffusion_index_models -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Python dependencies:
pip install pandas numpy scipy statsmodels scikit-learn matplotlib joblib tqdm
The pipeline requires two primary data inputs passed to the run_complete_study function:
country_data: A Python dictionary where keys are string identifiers for the row entities (e.g., countries) and values arepandas.DataFrames. Each DataFrame must have aDatetimeIndexand identical column names representing the predictor variables.y_series: Apandas.Seriescontaining the scalar target variable, with aDatetimeIndexthat is identical to those in thecountry_dataDataFrames.study_manifest: A nested Python dictionary that controls all parameters of the analysis. A fully specified example is provided in the notebook.
The high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb notebook provides a complete, step-by-step guide. The core workflow is:
- Prepare Inputs: Load your panel data into the required dictionary and Series formats. Define the
study_manifestdictionary. - Execute Pipeline: Call the master orchestrator function:
final_results = run_complete_study( country_data=my_raw_country_data, y_series=my_raw_y_series, study_manifest=my_study_manifest, run_empirical_study=True, run_simulation_study=False, # Optional: very time-consuming generate_reports=True )
- Inspect Outputs: Programmatically access any result from the returned
final_resultsdictionary. For example, to view the main results table for the unscreened model:table4_panel_a = final_results['reports']['Table 4']['Panel A'] # In a Jupyter Notebook, this will render the styled table display(table4_panel_a)
The run_complete_study function returns a single, comprehensive dictionary with the following top-level keys:
empirical_study: A deeply nested dictionary containing all raw numerical results from the empirical analysis, including performance DataFrames for the main model (screened and unscreened), benchmark results, and significance tests.simulation_study: A dictionary containing the rawpd.DataFramefrom the Monte Carlo simulation (if run).reports: A dictionary containing all generated outputs, including styledpd.DataFrameobjects for each table,matplotlibfigure objects for plots, and the final reproducibility report.
high_dimensional_matrix_variate_diffusion_index_models/
│
├── high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb # Main implementation notebook
├── requirements.txt # Python package dependencies
├── LICENSE # MIT license file
└── README.md # This documentation file
The pipeline is highly customizable via the master study_config dictionary. Users can easily modify:
- The
alpha_gridandfactor_dimensions_gridfor the empirical study. - The
train_test_split_configto change the out-of-sample period. - All parameters for the Monte Carlo simulations (
p,q,T, DGPs, etc.). - The
screening_thresholdsfor the supervised refinement step.
Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this code or the methodology in your research, please cite the original paper:
@article{ma2025high,
title={High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting},
author={Ma, Zhiren and Zhao, Qian and Zhang, Riquan and Gao, Zhaoxing},
journal={arXiv preprint arXiv:2508.04259},
year={2025}
}For the implementation itself, you may cite this repository:
Chirinda, C. (2025). A Python Implementation of "High-Dimensional Matrix-Variate Diffusion Index Models for Time Series Forecasting".
GitHub repository: https://github.com/chirindaopensource/high_dimensional_matrix_variate_diffusion_index_models
- Credit to Zhiren Ma, Qian Zhao, Riquan Zhang, and Zhaoxing Gao for their clear and insightful research.
- Thanks to the developers of the scientific Python ecosystem (
numpy,pandas,scipy,statsmodels,scikit-learn) that makes this work possible.
--
This README was generated based on the structure and content of high_dimensional_matrix_variate_diffusion_index_models_draft.ipynb and follows best practices for research software documentation.