Skip to content

Conversation

@zsusswein
Copy link

@zsusswein zsusswein commented Nov 25, 2025

As promised!

The motivation here is that I've been interested/excited lately about
CORP decomposition of scores. This is a version of the classic decomposition:

score = miscalibration - descrimination + uncertainty

for a negatively oriented score, so it's rewarded for better (i.e., lower) discrimination.
The particular CORP decomposition has some nice properties. See the paper
here for an explanation of why it should be preferred: https://arxiv.org/abs/2008.03033

Approach

The algorithm is implemented in the {reliabilitydiag} package. All I
did was take the targets pipeline and pull out the model data and subset
to 25A.

I expanded out the variant frequencies to successes/failures for
scoring.

However -- the all_model_outputs_for_heatmap target was too big for me
to load into RAM and run anything else. I had to subset to 25A and
convert UGA-multicast from samples to means in DuckDB
and save it to a parquet. I imagine you don't have this RAM limitation
on your machine and could just run it.
variant-25a.parquet.zip

Results

corp_decomposition_plot_CA corp_decomposition_plot_Other

(plot formatting taken from https://arxiv.org/pdf/2311.14122 for their CRPS decomposition plots)

I really like the description in ?summary.reliabilitydiag

‘miscalibration’ a measure of miscalibration
(how reliable is the prediction method?),
smaller is better.
‘discrimination’ a measure of discrimination
(how variable are the recalibrated predictions?),
larger is better.
‘uncertainty’ the mean score of a constant prediction at the
value of the average observation.

There's more good stuff in there so take a look!

But my gist is something like "the HMLR and CADPH-CATaLog models have
more information content about clade 25A emergence than baseline after recalibration.
However, both are less reliable than baseline, with this miscalibration
severe enough in the HMLR model to reduce its score below baseline."

There are some fairly extensive limitations here:

  • First and most importantly, this relies on the same approach as brier_point.
    It assumes "25A against all else" and scores on that, ignoring any
    multinomial structure.
  • It's also using only the mean forecasts, not the uncertainty. So this
    captures the structural binomial uncertainty but not the model
    uncertainty (I imagine this might particularly penalize the HMLR).
    • It's possible we could rectify this by including multiple samples
      from each model? I'm not sure and need to read through some more
      to see if that's ok.
  • Needs to run on complete cases. I dropped CATaMaran because it wasn't present
    for the whole beginning bit of 25A and the decomposition isn't
    comparable across subsets of data.
  • Manually took the mean over UGA-multicast samples. Not sure if this makes sense.

I promise I won't be offended if this doesn't make it in -- I really
did it for fun! If it's interesting enough to do more on, I'm happy to try to stick this in your pipeline if RAM permits.

As promised!

The motivation here is that I've been interested/excited lately about
CORP decomposition of scores. This is a version of the classic decomposition:

score = miscalibration - descrimination + uncertainty

but the particular CORP decomposition has some nice properties. See the paper
here for an explanation of why it should be preferred: https://arxiv.org/abs/2008.03033

The algorithm is implemented in the `{reliabilitydiag}` package. All I
did was take the targets pipeline and pull out the model data and subset
to 25A.

I expanded out the variant frequencies to successes/failures for
scoring.

However -- the `all_model_outputs_for_heatmap` target was too big for me
to load into RAM and run anything else. I had to subset to 25A and
convert UGA-multicast from samples to means in DuckDB
and save it to a parquet. I imagine you don't have this RAM limitation
on your machine and could just run it.

I really like the description in `?summary.reliabilitydiag`

>        ‘miscalibration’  a measure of miscalibration
                         (_how reliable is the prediction method?_),
                         smaller is better.
       ‘discrimination’  a measure of discrimination
                         (_how variable are the recalibrated predictions?_),
                         larger is better.
       ‘uncertainty’     the mean score of a constant prediction at the
                         value of the average observation.

There's more good stuff in there so take a look!

But my gist is something like "the HMLR and CADPH-CATaLog models have
more information content about clade 25A emergence than baseline after recalibration.
However, both are less reliable than baseline, with this miscalibration
severe enough in the HMLR model to reduce its score below baseline."

There are some fairly extensive limitations here:

* First and most importantly, this relies on the same approach as `brier_point`.
  It assumes "25A against all else" and scores on that, ignoring any
  multinomial structure.
* It's also using only the mean forecasts, not the uncertainty. So this
  captures the structural binomial uncertainty but not the model
  uncertainty (I imagine this might particularly penalize the HMLR).
     * It's possible we could rectify this by including multiple samples
       from each model? I'm not sure and need to read through some more
       to see if that's ok.
* Needs to run on complete cases. I dropped CATaMaran because it wasn't present
  for the whole beginning bit of 25A and the decomposition isn't
  comparable across subsets of data.

I promise I won't be offended if this doesn't make it in -- I really
did it for fun!
@coderabbitai
Copy link

coderabbitai bot commented Nov 25, 2025

Walkthrough

Introduces a new R analysis script that generates CORP decompositions of Brier scores for forecast evaluation using reliabilitydiag. The script loads data, performs reliability analysis, and produces visualisation plots for California and other US regions separately.

Changes

Cohort / File(s) Change Summary
CORP decomposition analysis
analysis/generate_corp_decomposition.R
Adds new analysis script defining run_corp_analysis function to construct CORP decompositions of Brier scores using reliabilitydiag. Performs data preparation, reliability analysis, and generates calibration visualisation plots with model-specific metrics for distinct geographic regions.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing a CORP decomposition analysis of Brier scores for clade 25A, which directly matches the new analysis script added to the repository.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
analysis/generate_corp_decomposition.R (4)

5-13: Missing cli library in explicit loads.

The script uses cli::cli_alert_info at line 80, but cli is not loaded with the other libraries. While the namespace-qualified call will work if cli is installed, consider adding library(cli) for consistency with the other library declarations.

 library(glue)
 library(arrow)
+library(cli)

60-69: Consider a more robust approach for identifying model columns.

The current approach relies on maintaining a complete list of non_model_cols. If new columns are added to the joined data (e.g., from upstream changes), they might be incorrectly treated as model columns. Consider using a naming convention or attribute to explicitly mark model columns.


100-104: Guard against potential data integrity issue with negative failures.

If sequences > total_sequences due to data errors upstream, failures would be negative, causing uncount() to fail or produce unexpected results. Consider adding a validation check.

     df_failure <- data_joined |>
         mutate(failures = total_sequences - sequences) |>
+        filter(failures >= 0) |>  # Guard against data errors
         select(all_of(c(model_cols, "failures"))) |>
         uncount(failures) |>
         mutate(y = 0)

Alternatively, add an assertion to catch data issues early:

stopifnot(all(data_joined$sequences <= data_joined$total_sequences))

124-126: Redundant column creation.

Creating brier_score = mean_score is essentially a rename. Consider using rename() instead for clarity, or simply use mean_score directly in subsequent code.

-    corp_summary <- corp_summary |>
-        mutate(brier_score = mean_score)
+    corp_summary <- corp_summary |>
+        rename(brier_score = mean_score)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5efba98 and 682ab8a.

📒 Files selected for processing (1)
  • analysis/generate_corp_decomposition.R (1 hunks)
🔇 Additional comments (4)
analysis/generate_corp_decomposition.R (4)

22-37: LGTM!

The function signature and observations preparation logic are well-structured. Using a function parameter for location filtering provides good flexibility for the different analysis runs.


49-54: Verify pivot_wider handles potential duplicates gracefully.

If there are multiple rows with the same (location, nowcast_date, target_date, clade, model_id) combination, pivot_wider will produce list-columns or a warning. Consider adding values_fn = mean or similar to handle potential duplicates explicitly.

     preds_wide <- preds_long |>
-        pivot_wider(names_from = model_id, values_from = prediction)
+        pivot_wider(
+            names_from = model_id,
+            values_from = prediction,
+            values_fn = list(prediction = mean)  # or first, to handle duplicates
+        )

205-215: LGTM!

The California-specific analysis configuration is well-structured. Excluding CADPH-CATaMaran and requiring CADPH-CATaLog presence aligns with the PR description's note about handling incomplete coverage.


217-229: LGTM!

The configuration for non-California states appropriately removes columns that are entirely NA and excludes California-specific CADPH models.

Comment on lines +17 to +19
all_model_outputs_for_heatmap <- read_parquet(
"~/variant-25a.parquet"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded user-specific path breaks portability.

The path "~/variant-25a.parquet" is specific to your home directory. Other collaborators or CI systems will not find this file. Consider using a project-relative path or parameterising the file location.

-all_model_outputs_for_heatmap <- read_parquet(
-    "~/variant-25a.parquet"
-)
+# Option 1: Use a project-relative path
+all_model_outputs_for_heatmap <- read_parquet(
+    here::here("data", "variant-25a.parquet")
+)
+
+# Option 2: Accept as a parameter or environment variable
+parquet_path <- Sys.getenv("VARIANT_PARQUET_PATH", "data/variant-25a.parquet")
+all_model_outputs_for_heatmap <- read_parquet(parquet_path)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In analysis/generate_corp_decomposition.R around lines 17 to 19, the code reads
a hardcoded user-specific path ("~/variant-25a.parquet") which breaks
portability; replace the literal with a project-relative path or configurable
parameter (e.g., use here::here("data","variant-25a.parquet") or accept a
function/CLI argument/env var for the file location) and update callers/tests to
pass the path so the script works across machines and CI.

Comment on lines +187 to +189
btitle = glue::glue(
"Location: {location_subset_name}; Variant: 25A; Uncertainty = {unc_str}"
),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

btitle is not a standard ggplot2 label and will be ignored.

The btitle argument in labs() is not a recognised ggplot2 aesthetic. This appears to be dead code that will be silently ignored. If this was intended for a custom purpose, consider documenting it; otherwise, remove it.

         labs(
             title = "CORP Decomposition of Brier Score",
             subtitle = glue::glue(
                 "Location: {location_subset_name}; Variant: 25A; Dates: {min_date} to {max_date}; Uncertainty = {unc_str}"
             ),
-            btitle = glue::glue(
-                "Location: {location_subset_name}; Variant: 25A; Uncertainty = {unc_str}"
-            ),
             x = "Miscalibration (MCB) - Lower is better",
             y = "Discrimination (DSC) - Higher is better"
         ) +

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In analysis/generate_corp_decomposition.R around lines 187 to 189, the labs()
call includes a non‑standard argument btitle which ggplot2 will ignore; remove
the btitle argument (or if you intended a plot title, rename it to title or use
ggtitle()) and, if btitle was meant for downstream custom handling, add a clear
comment documenting its purpose so it isn’t silently left as dead code.

Copy link
Collaborator

@kaitejohnson kaitejohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zsusswein Thanks for including this -- it is really interesting. Sadly, I am slightly more confused how this works now that I have seen the code and am trying to think of how we

  1. might do this on the energy score -- which would have to compare the multinomial draws of the predicted observations to the true observations
  2. might incorporate multiple samples in this framework still using the Brier score (e.g. instead of getting the mean, do this for each sample and then summarise??)

I am happy to merge this and then consider if/how we want to include it in the targets pipeline as a separate PR.

library(arrow)

# Load the necessary data from the targets store
tar_load(clean_variant_data_final_all_states)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meep tiny problem here is that this is the final data -- its not the rolling evaluation datasets we are using to evaluate the nowcasts (again, you will probably have a RAM issue). What you actually want to load in is variant_data_eval_all (all nowcast dates, all states) or variant_data_for_eval_mult_nowcasts (the range around 25A emergence for 3 states)

@zsusswein
Copy link
Author

All good questions.

I'm out getting pies this morning, but I'll try to make some tweaks to clarify once I'm back.

@zsusswein
Copy link
Author

zsusswein commented Nov 28, 2025

Ok I'm going to give a ittle more detail on how this works then go through the questions below. Taking from the paper linked above:

The decomposition takes (1) the forecasted probabilities $x$ and (2) the observed Bernoulli outcomes $y$. It uses those to score the forecast with the Brier score $S_x = BS(x, y)$. Then it decomposes the Brier score into the discrimination, miscalibration, and uncertainty components.

This decomposition is performed with the provided forecast and additional bounds on performance from two more forecasts: $r$ and $c$.

The "reference" forecast $r$ is constructed using the mean over the observed $y$ values

$$r = \sum^{n}_{i = 1} \frac{y_{i}}{n}$$

One of the papers linked above refers to this as a "climatological" forecast in weather. I read this as something like "If all we knew was it rained 20% of days, we could say every day has a 20% chance of rain and it would be a properly calibrated baseline." It also wouldn't be a very good baseline! So this is used as a lower bound on performance -- the uncertainty component. The uncertainty component is produced by scoring $r$: $S_R = \text{Brier}(r, y) = \text{uncertainty}$.

The other forecast $c$ is a calibrated version of the original forecast $y$. Mechanically, it takes the original provided forecast $x$ and bins the predicted probabilities to produce the event probability of $y=1$ conditional on the forecast value $x$. The details of the binning procedure are the point of the paper but they show doing it using their CORP method has nice properties.

But the key bit is if the binned probabilities don't match the conditional event probabilities (CEPs) then the forecast is miscalibrated. We're predicting something happens 10% of the time when it really happens 20% of the time. So it recalibrates the forecast by adjusting $x$ to become $r$. Visually, this is taking the below plot (of the Hub-baseline forecasts and 25A) and projecting the red points of the true forecast CEPs horizontally onto the 1-to-1 line.

image

(this plot is produced by {reliabilitydiag} for each model to visually evaluate model calibration across the range of outcomes)

In the best case, $c$ would be stretched all along the 1-1 line and be great at distinguishing across regimes in $y$. In the worst case, $c$ would be tightly clustered around the marginal probability $r$ used as the reference forecast and have very little ability to distinguish anything beyond the average probability.

The recalibrated forecast $c$ is scored $S_C = BS(c, y)$ and, together with $x$ and $r$ is used for the decomposition.

Screenshot 2025-11-28 at 10 02 02 AM

So MCB says "how much of my potential score am I giving up due to miscalibration?", DSC says "how good could my score possibly be after corrected for miscalibration?" and UNC says "how hard is my forecasting problem?"


I think its fine to do this on the mean as a starting point, but I am a little bit confused fundamentally how the "recalibration" works when we are only assessing a point forecast/nowcast? I understand that the metric isn't actually recalibrating the forecast, but how do I interpret this?

Ok, update, because you are taking into account the uncertainty in the multinomial observation process I think I can maybe follow this, though I still don't think that's the "right"/only uncertainty we want to evaluate against as we want to evaluate the calibration of the forecasters posterior estimate of the latent clade proportions (and here we are just evaluating their mean).

I agree. I hope the above explanation clears up how the recalibration works, but for sure it's only recalibrating errors in the mean not posterior uncertainty in the mean's value or the full multinomial posterior predictive. I don't think this is the right/only thing in an ideal world but, as you say, I think the only way around that is to score the multinomial outcomes.

It's tricky because these are Bernoulli outcomes -- the mean defines the variance so errors in the mean are errors in calibration. We can't provide a distribution of draws as independent outcomes because that would tell the recalibration procedure we observed $n_{draws}$ times as many outcomes as we really did.

But provided you're comfortable using the Brier score to evaluate one-against-all (and I think you are!), using the mean is probably the right thing here. It allows us to identify systematic errors and biases and their contributions to the mean score. I'm not sure what by decomposing by draw would add on top of that?

If we wanted to do further breakdowns, I think it would make sense to partition and do the decomposition on the means by nowcast horizon or by clade emerging/dominant.

Is this required just for evaluating the Brier score since it expects binary data (unless you use the calculation we used in the global variant sequencing paper where you can compute the brier score with the sequence counts by clade and total sequences)?

Yes, the CORP Brier score decomposition works on predictions and outcomes in [0, 1].

You labeled the figures as being on the Brier score -- but what about this uses the Brier score?

Answered above, hopefully!

How do I interpret a mean uncertainty across all the models?

The comment is misleading -- this is a good flag.

Because the decomposition is only comparable across sets of forecasts on the same data, I subset to only dates with all models submitting. The uncertainty is the average 25A frequency across all nowcast dates, which is the same across all models.

might do this on the energy score -- which would have to compare the multinomial draws of the predicted observations to the true observations

This is a really good question. There's an extension to do this decomposition on the CRPS. It's the same intuition except (1) it uses the empirical CDF of $y$ instead of the mean as the uncertainty component and (2) it does take an ensemble of draws (the full posterior predictive).

Because the CRPS is a special case of the more general energy score, I bet the CRPS version of the decomposition could be extended to the energy score. But I don't know how! It seems tricky to me to set up the baseline properly and I don't know what exactly the recalibration would do with the correlation bits.

might incorporate multiple samples in this framework still using the Brier score (e.g. instead of getting the mean, do this for each sample and then summarise??)

Same as above -- I agree totally feasible and maybe adds something? But not sure what!

@kaitejohnson
Copy link
Collaborator

Thanks for the explanation, this is making a lot more sense now and I now am following the calibration. I see why it might not really make sense to extend to the full posterior distribution on the Brier score.

I do think it would be interesting but outside of the scope of this work to see if the decomposition works on the energy score. If we wanted something on the full distribution, we could compute the decomposition of the CRPS?

I think this might be a really nice add on to the 25A emergence figure -- we could add a set of panels with the miscalibration, discrimination, and uncertainty? A bit confusing because they don't neatly sum to the score but we could plot each separately?

fig_zoom_25A

@zsusswein
Copy link
Author

I do think it would be interesting but outside of the scope of this work to see if the decomposition works on the energy score. If we wanted something on the full distribution, we could compute the decomposition of the CRPS?

Yeah +1 to all of this

A bit confusing because they don't neatly sum to the score but we could plot each separately

No they should sum exactly to the Brier score? They won't sum to the Energy score though

I think this might be a really nice add on to the 25A emergence figure -- we could add a set of panels with the miscalibration, discrimination, and uncertainty? A bit confusing because they don't neatly sum to the score but we could plot each separately?

Sounds like a fun idea. I put some options below. My preference is for the one averaged over the 3 states but I defer to your vision here!

Here's one averaged over all 3 locations for your set of dates

image

And here's one per-state:

CA

image

IL

image

MN

image

@zsusswein
Copy link
Author

And this one is a little much but it puts them all on the same plot

image

@kaitejohnson
Copy link
Collaborator

kaitejohnson commented Dec 1, 2025

No they should sum exactly to the Brier score? They won't sum to the Energy score though

I just meant they're not additive so you can't just make a stacked bar chart because score = mcb - dsc + unc

Ok, I need to think about this a bit more and discuss further with the rest of the authors on whether/how we want to include results from this decomposition in the main text/ supplement. But I think we should merge this as is and I can deal with inserting it into the targets pipeline in the right place(s) -- both in the 25A emergence focused section and in the overall results summaries.

Do you have a hypothesis as to why UMass-HMLR has better discrimination overall across the 3 states, but then within the individual states it looks worse than the others? I am struggling a bit to interpret in light of that?

@zsusswein
Copy link
Author

zsusswein commented Dec 1, 2025

Do you have a hypothesis as to why UMass-HMLR has better discrimination overall across the 3 states, but then within the individual states it looks worse than the others? I am struggling a bit to interpret in light of that?

Yeah I was also confused by that. I don't know what's causing it.

If I were to speculate, I would think it's possible that the HMLR is better at discriminating between state trends while maintaining calibration. Or, in other words, the HMLR has between-state structure that is preserved in the recalibration procedure but that structure doesn't survive (as much) in the other methods because either it's not present or it can't be can't be distinguished from noise.

An alternative theory would be that it's individual sampling noise from the low-ish number of forecasts from this 25A period and the decomposition resolves better with a larger sample size -- we could check this one.

It could also be a bug. I could have left an extra something in the "all" data frame by mistake. I'll double check this morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants