-
Notifications
You must be signed in to change notification settings - Fork 1
CORP decomposition of "point" Brier scores for clade 25A #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
As promised! The motivation here is that I've been interested/excited lately about CORP decomposition of scores. This is a version of the classic decomposition: score = miscalibration - descrimination + uncertainty but the particular CORP decomposition has some nice properties. See the paper here for an explanation of why it should be preferred: https://arxiv.org/abs/2008.03033 The algorithm is implemented in the `{reliabilitydiag}` package. All I did was take the targets pipeline and pull out the model data and subset to 25A. I expanded out the variant frequencies to successes/failures for scoring. However -- the `all_model_outputs_for_heatmap` target was too big for me to load into RAM and run anything else. I had to subset to 25A and convert UGA-multicast from samples to means in DuckDB and save it to a parquet. I imagine you don't have this RAM limitation on your machine and could just run it. I really like the description in `?summary.reliabilitydiag` > ‘miscalibration’ a measure of miscalibration (_how reliable is the prediction method?_), smaller is better. ‘discrimination’ a measure of discrimination (_how variable are the recalibrated predictions?_), larger is better. ‘uncertainty’ the mean score of a constant prediction at the value of the average observation. There's more good stuff in there so take a look! But my gist is something like "the HMLR and CADPH-CATaLog models have more information content about clade 25A emergence than baseline after recalibration. However, both are less reliable than baseline, with this miscalibration severe enough in the HMLR model to reduce its score below baseline." There are some fairly extensive limitations here: * First and most importantly, this relies on the same approach as `brier_point`. It assumes "25A against all else" and scores on that, ignoring any multinomial structure. * It's also using only the mean forecasts, not the uncertainty. So this captures the structural binomial uncertainty but not the model uncertainty (I imagine this might particularly penalize the HMLR). * It's possible we could rectify this by including multiple samples from each model? I'm not sure and need to read through some more to see if that's ok. * Needs to run on complete cases. I dropped CATaMaran because it wasn't present for the whole beginning bit of 25A and the decomposition isn't comparable across subsets of data. I promise I won't be offended if this doesn't make it in -- I really did it for fun!
WalkthroughIntroduces a new R analysis script that generates CORP decompositions of Brier scores for forecast evaluation using reliabilitydiag. The script loads data, performs reliability analysis, and produces visualisation plots for California and other US regions separately. Changes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (4)
analysis/generate_corp_decomposition.R (4)
5-13: Missingclilibrary in explicit loads.The script uses
cli::cli_alert_infoat line 80, butcliis not loaded with the other libraries. While the namespace-qualified call will work ifcliis installed, consider addinglibrary(cli)for consistency with the other library declarations.library(glue) library(arrow) +library(cli)
60-69: Consider a more robust approach for identifying model columns.The current approach relies on maintaining a complete list of
non_model_cols. If new columns are added to the joined data (e.g., from upstream changes), they might be incorrectly treated as model columns. Consider using a naming convention or attribute to explicitly mark model columns.
100-104: Guard against potential data integrity issue with negative failures.If
sequences > total_sequencesdue to data errors upstream,failureswould be negative, causinguncount()to fail or produce unexpected results. Consider adding a validation check.df_failure <- data_joined |> mutate(failures = total_sequences - sequences) |> + filter(failures >= 0) |> # Guard against data errors select(all_of(c(model_cols, "failures"))) |> uncount(failures) |> mutate(y = 0)Alternatively, add an assertion to catch data issues early:
stopifnot(all(data_joined$sequences <= data_joined$total_sequences))
124-126: Redundant column creation.Creating
brier_score = mean_scoreis essentially a rename. Consider usingrename()instead for clarity, or simply usemean_scoredirectly in subsequent code.- corp_summary <- corp_summary |> - mutate(brier_score = mean_score) + corp_summary <- corp_summary |> + rename(brier_score = mean_score)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
analysis/generate_corp_decomposition.R(1 hunks)
🔇 Additional comments (4)
analysis/generate_corp_decomposition.R (4)
22-37: LGTM!The function signature and observations preparation logic are well-structured. Using a function parameter for location filtering provides good flexibility for the different analysis runs.
49-54: Verifypivot_widerhandles potential duplicates gracefully.If there are multiple rows with the same
(location, nowcast_date, target_date, clade, model_id)combination,pivot_widerwill produce list-columns or a warning. Consider addingvalues_fn = meanor similar to handle potential duplicates explicitly.preds_wide <- preds_long |> - pivot_wider(names_from = model_id, values_from = prediction) + pivot_wider( + names_from = model_id, + values_from = prediction, + values_fn = list(prediction = mean) # or first, to handle duplicates + )
205-215: LGTM!The California-specific analysis configuration is well-structured. Excluding
CADPH-CATaMaranand requiringCADPH-CATaLogpresence aligns with the PR description's note about handling incomplete coverage.
217-229: LGTM!The configuration for non-California states appropriately removes columns that are entirely NA and excludes California-specific CADPH models.
| all_model_outputs_for_heatmap <- read_parquet( | ||
| "~/variant-25a.parquet" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded user-specific path breaks portability.
The path "~/variant-25a.parquet" is specific to your home directory. Other collaborators or CI systems will not find this file. Consider using a project-relative path or parameterising the file location.
-all_model_outputs_for_heatmap <- read_parquet(
- "~/variant-25a.parquet"
-)
+# Option 1: Use a project-relative path
+all_model_outputs_for_heatmap <- read_parquet(
+ here::here("data", "variant-25a.parquet")
+)
+
+# Option 2: Accept as a parameter or environment variable
+parquet_path <- Sys.getenv("VARIANT_PARQUET_PATH", "data/variant-25a.parquet")
+all_model_outputs_for_heatmap <- read_parquet(parquet_path)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In analysis/generate_corp_decomposition.R around lines 17 to 19, the code reads
a hardcoded user-specific path ("~/variant-25a.parquet") which breaks
portability; replace the literal with a project-relative path or configurable
parameter (e.g., use here::here("data","variant-25a.parquet") or accept a
function/CLI argument/env var for the file location) and update callers/tests to
pass the path so the script works across machines and CI.
| btitle = glue::glue( | ||
| "Location: {location_subset_name}; Variant: 25A; Uncertainty = {unc_str}" | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btitle is not a standard ggplot2 label and will be ignored.
The btitle argument in labs() is not a recognised ggplot2 aesthetic. This appears to be dead code that will be silently ignored. If this was intended for a custom purpose, consider documenting it; otherwise, remove it.
labs(
title = "CORP Decomposition of Brier Score",
subtitle = glue::glue(
"Location: {location_subset_name}; Variant: 25A; Dates: {min_date} to {max_date}; Uncertainty = {unc_str}"
),
- btitle = glue::glue(
- "Location: {location_subset_name}; Variant: 25A; Uncertainty = {unc_str}"
- ),
x = "Miscalibration (MCB) - Lower is better",
y = "Discrimination (DSC) - Higher is better"
) +Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In analysis/generate_corp_decomposition.R around lines 187 to 189, the labs()
call includes a non‑standard argument btitle which ggplot2 will ignore; remove
the btitle argument (or if you intended a plot title, rename it to title or use
ggtitle()) and, if btitle was meant for downstream custom handling, add a clear
comment documenting its purpose so it isn’t silently left as dead code.
kaitejohnson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zsusswein Thanks for including this -- it is really interesting. Sadly, I am slightly more confused how this works now that I have seen the code and am trying to think of how we
- might do this on the energy score -- which would have to compare the multinomial draws of the predicted observations to the true observations
- might incorporate multiple samples in this framework still using the Brier score (e.g. instead of getting the mean, do this for each sample and then summarise??)
I am happy to merge this and then consider if/how we want to include it in the targets pipeline as a separate PR.
| library(arrow) | ||
|
|
||
| # Load the necessary data from the targets store | ||
| tar_load(clean_variant_data_final_all_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meep tiny problem here is that this is the final data -- its not the rolling evaluation datasets we are using to evaluate the nowcasts (again, you will probably have a RAM issue). What you actually want to load in is variant_data_eval_all (all nowcast dates, all states) or variant_data_for_eval_mult_nowcasts (the range around 25A emergence for 3 states)
|
All good questions. I'm out getting pies this morning, but I'll try to make some tweaks to clarify once I'm back. |
I just meant they're not additive so you can't just make a stacked bar chart because Ok, I need to think about this a bit more and discuss further with the rest of the authors on whether/how we want to include results from this decomposition in the main text/ supplement. But I think we should merge this as is and I can deal with inserting it into the targets pipeline in the right place(s) -- both in the 25A emergence focused section and in the overall results summaries. Do you have a hypothesis as to why UMass-HMLR has better discrimination overall across the 3 states, but then within the individual states it looks worse than the others? I am struggling a bit to interpret in light of that? |
Yeah I was also confused by that. I don't know what's causing it. If I were to speculate, I would think it's possible that the HMLR is better at discriminating between state trends while maintaining calibration. Or, in other words, the HMLR has between-state structure that is preserved in the recalibration procedure but that structure doesn't survive (as much) in the other methods because either it's not present or it can't be can't be distinguished from noise. An alternative theory would be that it's individual sampling noise from the low-ish number of forecasts from this 25A period and the decomposition resolves better with a larger sample size -- we could check this one. It could also be a bug. I could have left an extra something in the "all" data frame by mistake. I'll double check this morning. |








As promised!
The motivation here is that I've been interested/excited lately about
CORP decomposition of scores. This is a version of the classic decomposition:
score = miscalibration - descrimination + uncertainty
for a negatively oriented score, so it's rewarded for better (i.e., lower) discrimination.
The particular CORP decomposition has some nice properties. See the paper
here for an explanation of why it should be preferred: https://arxiv.org/abs/2008.03033
Approach
The algorithm is implemented in the
{reliabilitydiag}package. All Idid was take the targets pipeline and pull out the model data and subset
to 25A.
I expanded out the variant frequencies to successes/failures for
scoring.
However -- the
all_model_outputs_for_heatmaptarget was too big for meto load into RAM and run anything else. I had to subset to 25A and
convert UGA-multicast from samples to means in DuckDB
and save it to a parquet. I imagine you don't have this RAM limitation
on your machine and could just run it.
variant-25a.parquet.zip
Results
(plot formatting taken from https://arxiv.org/pdf/2311.14122 for their CRPS decomposition plots)
I really like the description in
?summary.reliabilitydiagThere's more good stuff in there so take a look!
But my gist is something like "the HMLR and CADPH-CATaLog models have
more information content about clade 25A emergence than baseline after recalibration.
However, both are less reliable than baseline, with this miscalibration
severe enough in the HMLR model to reduce its score below baseline."
There are some fairly extensive limitations here:
brier_point.It assumes "25A against all else" and scores on that, ignoring any
multinomial structure.
para of https://arxiv.org/pdf/2311.14122 for an explanation.
captures the structural binomial uncertainty but not the model
uncertainty (I imagine this might particularly penalize the HMLR).
from each model? I'm not sure and need to read through some more
to see if that's ok.
for the whole beginning bit of 25A and the decomposition isn't
comparable across subsets of data.
I promise I won't be offended if this doesn't make it in -- I really
did it for fun! If it's interesting enough to do more on, I'm happy to try to stick this in your pipeline if RAM permits.