Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 2 additions & 32 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,7 @@ on:
branches: [ main ]

jobs:
Check-MDN-Changes:
runs-on: ubuntu-latest
outputs:
mdn_changed: ${{ steps.check.outputs.mdn_changed }}
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Check for MDN-related file changes
id: check
run: |
# Get list of changed files in this push
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD)
echo "Changed files:"
echo "$CHANGED_FILES"

# Check if any MDN-related files were changed
if echo "$CHANGED_FILES" | grep -qE "(mdn|MDN)"; then
echo "mdn_changed=true" >> $GITHUB_OUTPUT
echo "MDN-related files were changed"
else
echo "mdn_changed=false" >> $GITHUB_OUTPUT
echo "No MDN-related files were changed"
fi

Test:
needs: Check-MDN-Changes
runs-on: ubuntu-latest
strategy:
matrix:
Expand All @@ -56,12 +30,8 @@ jobs:
run: |
sudo Rscript -e 'install.packages("StatMatch", repos="https://cloud.r-project.org")'
sudo Rscript -e 'install.packages("clue", repos="https://cloud.r-project.org")'
- name: Install full dependencies without MDN (Python 3.13)
if: matrix.python-version == '3.13' && needs.Check-MDN-Changes.outputs.mdn_changed != 'true'
run: |
uv pip install -e ".[dev,docs,matching,images]" --system
- name: Install full dependencies with MDN (Python 3.13)
if: matrix.python-version == '3.13' && needs.Check-MDN-Changes.outputs.mdn_changed == 'true'
- name: Install full dependencies (Python 3.13)
if: matrix.python-version == '3.13'
run: |
uv pip install -e ".[dev,docs,matching,mdn,images]" --system
- name: Install minimal dependencies (Python 3.12)
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Microimpute enables variable imputation through a variety of statistical methods
- **Ordinary Least Squares (OLS)**: Linear regression-based imputation
- **Quantile Regression**: Distribution-aware regression imputation
- **Quantile Random Forests (QRF)**: Non-parametric forest-based approach
- **Mixture Density Networks (MDN)**: Neural network with Gaussian mixture approximation head

### Automated method selection
- **AutoImpute**: Automatically compares and selects the best imputation method for your data
Expand Down
4 changes: 4 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- bump: minor
changes:
added:
- Updates to documentation and Myst deployment.
52 changes: 44 additions & 8 deletions docs/imputation-benchmarking/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,8 @@ $$W_p(P, Q) = \left(\inf_{\gamma \in \Pi(P, Q)} \int_{X \times Y} d(x, y)^p d\ga

where $\Pi(P, Q)$ denotes the set of all joint distributions whose marginals are $P$ and $Q$ respectively. The Wasserstein distance measures the minimum "work" required to transform one distribution into another, where work is the amount of distribution mass moved times the distance moved. Lower values indicate better preservation of the original distribution's shape.

When sample weights are provided, the weighted Wasserstein distance accounts for varying observation importance, which is essential when comparing survey data with different sampling designs. We use scipy's `wasserstein_distance` implementation, which supports sample weights via the `u_weights` and `v_weights` parameters.

### Kullback-Leibler divergence

For discrete distributions (categorical and boolean variables), KL divergence quantifies how one probability distribution diverges from a reference:
Expand All @@ -124,24 +126,56 @@ $$D_{KL}(P||Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right

where $P$ is the reference distribution (original data), $Q$ is the approximation (imputed data), and $\mathcal{X}$ is the set of all possible categorical values. KL divergence measures how much information is lost when using the imputed distribution to approximate the true distribution. Lower values indicate better preservation of the original categorical distribution.

When sample weights are provided, the probability distributions are computed as weighted proportions rather than simple counts, ensuring proper comparison of weighted survey data.

### kl_divergence

Computes the Kullback-Leibler divergence between two categorical distributions, with optional sample weights.

```python
def kl_divergence(
donor_values: np.ndarray,
receiver_values: np.ndarray,
donor_weights: Optional[np.ndarray] = None,
receiver_weights: Optional[np.ndarray] = None,
) -> float
```

| Parameter | Type | Default used | Description |
|-----------|------|---------|-------------|
| donor_values | np.ndarray | - | Categorical values from donor data (reference distribution) |
| receiver_values | np.ndarray | - | Categorical values from receiver data (approximation) |
| donor_weights | np.ndarray | None | Optional sample weights for donor values |
| receiver_weights | np.ndarray | None | Optional sample weights for receiver values |

Returns KL divergence value (float >= 0), where 0 indicates identical distributions.

### compare_distributions

Compares distributions between donor and receiver data, automatically selecting the appropriate metric based on variable type and supporting sample weights for survey data.

```python
def compare_distributions(
donor_data: pd.DataFrame,
receiver_data: pd.DataFrame,
imputed_variables: List[str],
donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
) -> pd.DataFrame
```

| Parameter | Type | Description |
|-----------|------|-------------|
| donor_data | pd.DataFrame | Original donor data |
| receiver_data | pd.DataFrame | Receiver data with imputations |
| imputed_variables | List[str] | Variables to compare |
| Parameter | Type | Default used | Description |
|-----------|------|---------|-------------|
| donor_data | pd.DataFrame | - | Original donor data |
| receiver_data | pd.DataFrame | - | Receiver data with imputations |
| imputed_variables | List[str] | - | Variables to compare |
| donor_weights | pd.Series or np.ndarray | None | Sample weights for donor data (must match donor_data length) |
| receiver_weights | pd.Series or np.ndarray | None | Sample weights for receiver data (must match receiver_data length) |

Returns a DataFrame with columns `Variable`, `Metric`, and `Distance`. The function automatically selects Wasserstein distance for numerical variables and KL divergence for categorical variables.

Note that data must not contain null or infinite values. If your data contains such values, filter them before calling this function.

## Predictor analysis

Understanding which predictors contribute most to imputation quality helps with feature selection and model interpretation. These tools analyze predictor-target relationships and evaluate sensitivity to predictor selection.
Expand Down Expand Up @@ -251,11 +285,13 @@ metrics_df = compare_metrics(
imputed_variables=imputed_variables
)

# Evaluate distributional match
dist_df = compare_distributions(
# Evaluate distributional match with survey weights
dist_df_weighted = compare_distributions(
donor_data=donor,
receiver_data=receiver_with_imputations,
imputed_variables=imputed_variables
imputed_variables=imputed_variables,
donor_weights=donor["sample_weight"],
receiver_weights=receiver["sample_weight"],
)

# Analyze predictor importance
Expand Down
14 changes: 10 additions & 4 deletions docs/myst.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,28 @@ project:
- file: models/quantreg/index
children:
- file: models/quantreg/quantreg-imputation
- file: models/mdn/index
children:
- file: models/mdn/mdn-imputation
- title: Imputation and benchmarking
children:
- file: imputation-benchmarking/index
children:
- file: imputation-benchmarking/preprocessing
- file: imputation-benchmarking/cross-validation
- file: imputation-benchmarking/metrics
- file: imputation-benchmarking/visualizations
- file: imputation-benchmarking/benchmarking-methods
- file: imputation-benchmarking/imputing-across-surveys
- title: AutoImpute
children:
- file: autoimpute/index
children:
- file: autoimpute/autoimpute
- title: SCF to CPS example
- title: Use cases
children:
- file: examples/scf_to_cps/index
- file: use_cases/index
children:
- file: examples/scf_to_cps/imputing-from-scf-to-cps
- file: use_cases/scf_to_cps/imputing-from-scf-to-cps
site:
options:
logo: logo.png
Expand Down
23 changes: 14 additions & 9 deletions microimpute/comparisons/autoimpute_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,15 +163,20 @@ def prepare_data_for_imputation(
predictor_log = [c for c in log_cols if c in predictors]
predictor_asinh = [c for c in asinh_cols if c in predictors]

transformed_imputing, _ = preprocess_data(
imputing_data[predictors],
full_data=True,
train_size=train_size,
test_size=test_size,
normalize=predictor_normalize if predictor_normalize else False,
log_transform=predictor_log if predictor_log else False,
asinh_transform=predictor_asinh if predictor_asinh else False,
)
if predictor_normalize or predictor_log or predictor_asinh:
transformed_imputing, _ = preprocess_data(
imputing_data[predictors],
full_data=True,
train_size=train_size,
test_size=test_size,
normalize=(
predictor_normalize if predictor_normalize else False
),
log_transform=predictor_log if predictor_log else False,
asinh_transform=predictor_asinh if predictor_asinh else False,
)
else:
transformed_imputing = imputing_data[predictors].copy()

training_data = transformed_training
if weight_col:
Expand Down
95 changes: 85 additions & 10 deletions microimpute/comparisons/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"""

import logging
from typing import Dict, List, Literal, Optional, Tuple
from typing import Dict, List, Literal, Optional, Tuple, Union

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -497,7 +497,10 @@ def compare_metrics(


def kl_divergence(
donor_values: np.ndarray, receiver_values: np.ndarray
donor_values: np.ndarray,
receiver_values: np.ndarray,
donor_weights: Optional[np.ndarray] = None,
receiver_weights: Optional[np.ndarray] = None,
) -> float:
"""Calculate Kullback-Leibler (KL) Divergence between two categorical distributions.

Expand All @@ -512,6 +515,10 @@ def kl_divergence(
Args:
donor_values: Array of categorical values from donor data (reference distribution P).
receiver_values: Array of categorical values from receiver data (approximation Q).
donor_weights: Optional weights for donor values. If provided, computes
weighted probability distribution.
receiver_weights: Optional weights for receiver values. If provided,
computes weighted probability distribution.

Returns:
KL divergence value >= 0, where 0 indicates identical distributions
Expand All @@ -536,9 +543,30 @@ def kl_divergence(
np.unique(donor_values), np.unique(receiver_values)
)

# Calculate probability distributions
donor_counts = pd.Series(donor_values).value_counts(normalize=True)
receiver_counts = pd.Series(receiver_values).value_counts(normalize=True)
# Calculate probability distributions (weighted if weights provided)
if donor_weights is not None:
# Compute weighted probabilities
donor_df = pd.DataFrame(
{"value": donor_values, "weight": donor_weights}
)
donor_grouped = donor_df.groupby("value")["weight"].sum()
donor_total = donor_grouped.sum()
donor_counts = donor_grouped / donor_total
else:
donor_counts = pd.Series(donor_values).value_counts(normalize=True)

if receiver_weights is not None:
# Compute weighted probabilities
receiver_df = pd.DataFrame(
{"value": receiver_values, "weight": receiver_weights}
)
receiver_grouped = receiver_df.groupby("value")["weight"].sum()
receiver_total = receiver_grouped.sum()
receiver_counts = receiver_grouped / receiver_total
else:
receiver_counts = pd.Series(receiver_values).value_counts(
normalize=True
)

# Create probability arrays for all categories
p_donor = np.array([donor_counts.get(cat, 0.0) for cat in all_categories])
Expand All @@ -563,6 +591,8 @@ def compare_distributions(
donor_data: pd.DataFrame,
receiver_data: pd.DataFrame,
imputed_variables: List[str],
donor_weights: Optional[Union[pd.Series, np.ndarray]] = None,
receiver_weights: Optional[Union[pd.Series, np.ndarray]] = None,
) -> pd.DataFrame:
"""Compare distributions between donor and receiver data for imputed variables.

Expand All @@ -574,6 +604,10 @@ def compare_distributions(
donor_data: DataFrame containing original donor data.
receiver_data: DataFrame containing receiver data with imputations.
imputed_variables: List of variable names to compare.
donor_weights: Optional array or Series of sample weights for donor data.
Must have same length as donor_data.
receiver_weights: Optional array or Series of sample weights for receiver
data. Must have same length as receiver_data.

Returns:
DataFrame with columns 'Variable', 'Metric', and 'Distance' containing
Expand Down Expand Up @@ -608,14 +642,45 @@ def compare_distributions(
receiver_data, imputed_variables, "receiver_data"
)

# Convert weights to numpy arrays if provided
donor_weights_arr = None
receiver_weights_arr = None
if donor_weights is not None:
donor_weights_arr = np.asarray(donor_weights)
if len(donor_weights_arr) != len(donor_data):
raise ValueError(
f"donor_weights length ({len(donor_weights_arr)}) must match "
f"donor_data length ({len(donor_data)})"
)
if receiver_weights is not None:
receiver_weights_arr = np.asarray(receiver_weights)
if len(receiver_weights_arr) != len(receiver_data):
raise ValueError(
f"receiver_weights length ({len(receiver_weights_arr)}) must "
f"match receiver_data length ({len(receiver_data)})"
)

results = []

# Detect metric type and compute distance for each variable
detector = VariableTypeDetector()
for var in imputed_variables:
# Get values from both datasets
donor_values = donor_data[var].dropna().values
receiver_values = receiver_data[var].dropna().values
donor_values = donor_data[var].values
receiver_values = receiver_data[var].values

# Check for null values - these are not allowed when comparing
if np.any(pd.isna(donor_values)):
raise ValueError(
f"Variable '{var}' in donor_data contains null values. "
"Please remove or impute null values before comparing "
"distributions."
)
if np.any(pd.isna(receiver_values)):
raise ValueError(
f"Variable '{var}' in receiver_data contains null values. "
"Please remove or impute null values before comparing "
"distributions."
)

if len(donor_values) == 0 or len(receiver_values) == 0:
log.warning(
Expand All @@ -633,14 +698,24 @@ def compare_distributions(
if var_type in ["bool", "categorical", "numeric_categorical"]:
# Use KL Divergence for categorical
metric_name = "kl_divergence"
distance = kl_divergence(donor_values, receiver_values)
distance = kl_divergence(
donor_values,
receiver_values,
donor_weights=donor_weights_arr,
receiver_weights=receiver_weights_arr,
)
log.debug(
f"KL divergence for categorical variable '{var}': {distance:.6f}"
)
else:
# Use Wasserstein Distance for numerical
metric_name = "wasserstein_distance"
distance = wasserstein_distance(donor_values, receiver_values)
distance = wasserstein_distance(
donor_values,
receiver_values,
u_weights=donor_weights_arr,
v_weights=receiver_weights_arr,
)
log.debug(
f"Wasserstein distance for numerical variable '{var}': {distance:.6f}"
)
Expand Down
Loading