Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
71e3e38
updated for anomaly score input
devonjkohler Apr 13, 2025
81a9a1f
Spectronaut Anomaly Model
devonjkohler Apr 29, 2025
d1059ee
data health function
devonjkohler May 13, 2025
e1c19cc
exposed iso forest hyperparams to user
devonjkohler May 13, 2025
47b758f
added auto option for max_depth
devonjkohler May 13, 2025
a1552b9
flexible temporal params
devonjkohler May 23, 2025
7db28eb
handle missing temporal values
devonjkohler May 26, 2025
843c586
bug in column selection
devonjkohler May 27, 2025
a0908ac
Merge branch 'feature-anomaly_model' of https://github.com/Vitek-Lab/…
devonjkohler May 27, 2025
c28dcfa
fix(anomaly-model): Add n_trees and other variables for parallelization
tonywu1999 Jun 18, 2025
2841f12
run devtools::document after renaming runAnomalyModel file name
tonywu1999 Jun 19, 2025
11d48af
fix namespace
tonywu1999 Jun 19, 2025
3704c0c
fix build by leaving default option for .summarizeMultipleMeasurements
tonywu1999 Jun 19, 2025
371535a
added logging
devonjkohler Jun 23, 2025
830df2b
fragment loss type
devonjkohler Jul 8, 2025
6db1141
revert
devonjkohler Jul 8, 2025
d3e3750
documentation for anomaly model
devonjkohler Jul 11, 2025
845bd24
updated for anomaly score input
devonjkohler Apr 13, 2025
6ad87bc
Spectronaut Anomaly Model
devonjkohler Apr 29, 2025
b2d1df4
data health function
devonjkohler May 13, 2025
e82625a
exposed iso forest hyperparams to user
devonjkohler May 13, 2025
4803d4e
added auto option for max_depth
devonjkohler May 13, 2025
fdb863d
bug in column selection
devonjkohler May 27, 2025
71bad8d
flexible temporal params
devonjkohler May 23, 2025
e6bcd1b
handle missing temporal values
devonjkohler May 26, 2025
8ca9b36
fix(anomaly-model): Add n_trees and other variables for parallelization
tonywu1999 Jun 18, 2025
fc641fc
run devtools::document after renaming runAnomalyModel file name
tonywu1999 Jun 19, 2025
0b9c6ea
fix namespace
tonywu1999 Jun 19, 2025
6ab7fa2
fix build by leaving default option for .summarizeMultipleMeasurements
tonywu1999 Jun 19, 2025
b2041ce
added logging
devonjkohler Jun 23, 2025
ddd19c6
fragment loss type
devonjkohler Jul 8, 2025
1744ad4
revert
devonjkohler Jul 8, 2025
e325f43
documentation for anomaly model
devonjkohler Jul 11, 2025
e12c16a
tests(spectronaut): add unit testing code for anamoly model
tonywu1999 Jul 29, 2025
0121d24
Merge branch 'feature-anomaly_model' of https://github.com/Vitek-Lab/…
devonjkohler Aug 4, 2025
f1e2e4a
test(anomaly_model): finalized unit tests for anomaly subfunctions
tonywu1999 Aug 22, 2025
d58e315
example data
devonjkohler Sep 4, 2025
da8fa5f
clip anomaly scores
devonjkohler Sep 4, 2025
98706b3
refactor validation code to be robust to other converters
tonywu1999 Sep 8, 2025
9ebd83e
Merge pull request #110 from Vitek-Lab/tony-feature-anomaly_model
devonjkohler Sep 9, 2025
aa1e754
Update R/converters_SpectronauttoMSstatsFormat.R
devonjkohler Sep 9, 2025
2ccd44b
added imports
devonjkohler Sep 9, 2025
3decde3
merge
devonjkohler Sep 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,16 @@ Imports:
methods,
checkmate,
utils,
stringi
stringi,
Rcpp,
parallel
Suggests:
tinytest,
covr,
knitr,
rmarkdown,
arrow
arrow,
rmarkdown
LinkingTo: Rcpp
Collate:
'clean_ProteinProspector.R'
'clean_Metamorpheus.R'
Expand All @@ -46,6 +49,7 @@ Collate:
'clean_MaxQuant.R'
'clean_DIAUmpire.R'
'MSstatsConvert_core_functions.R'
'RcppExports.R'
'converters_DIANNtoMSstatsFormat.R'
'converters_DIAUmpiretoMSstatsFormat.R'
'converters_FragPipetoMSstatsFormat.R'
Expand All @@ -60,10 +64,12 @@ Collate:
'converters_SpectronauttoMSstatsFormat.R'
'utils_MSstatsConvert.R'
'utils_annotation.R'
'utils_anomaly_score.R'
'utils_balanced_design.R'
'utils_checks.R'
'utils_classes.R'
'utils_clean_features.R'
'utils_data_health.R'
'utils_documentation.R'
'utils_dt_operations.R'
'utils_filtering.R'
Expand Down
5 changes: 5 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

S3method(as.data.frame,MSstatsValidated)
S3method(as.data.table,MSstatsValidated)
export(CheckDataHealth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

New export CheckDataHealth — looks good; verify man and implementation exist

Export aligns with the PR goals. Please confirm the function is defined and documented so R CMD check doesn’t warn about missing objects.


🏁 Script executed:

#!/bin/bash
rg -nP '(^|::)CheckDataHealth\b' R
rg -n '\bCheckDataHealth\b' man | rg '\.Rd$' -n

Length of output: 72


Add documentation for exported function CheckDataHealth
Implementation found at R/MSstatsConvert_core_functions.R:579, but no corresponding man page was detected. Please add a man/CheckDataHealth.Rd file to prevent R CMD check warnings.

🤖 Prompt for AI Agents
In NAMESPACE around line 5, the function CheckDataHealth is exported but lacks a
man page; create man/CheckDataHealth.Rd documenting the exported function
implemented in R/MSstatsConvert_core_functions.R at or near line 579. Add
standard Rd sections: \name, \alias (CheckDataHealth), \title, \description,
\usage (matching the R function signature), \arguments (document each
parameter), \value (describe return value), \examples (minimal reproducible
example), and \seealso or \references if relevant, and include \keyword and
\export if your build uses those; ensure the usage/signature and argument names
match the source so R CMD check warnings are resolved.

export(DIANNtoMSstatsFormat)
export(DIAUmpiretoMSstatsFormat)
export(FragPipetoMSstatsFormat)
export(MSstatsAnomalyScores)
export(MSstatsBalancedDesign)
export(MSstatsClean)
export(MSstatsImport)
Expand All @@ -25,7 +27,9 @@ export(getDataType)
export(getInputFile)
exportMethods(getDataType)
exportMethods(getInputFile)
import(Rcpp)
import(data.table)
import(parallel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

parallel import: either use it or drop it; also list it in DESCRIPTION

If you’re not calling parallel::* explicitly, remove this import to avoid check notes. If you do use it, add parallel to DESCRIPTION Imports.

Apply if needed (DESCRIPTION):

 Imports: 
     data.table,
     log4r,
     methods,
     checkmate,
     utils,
     stringi,
     Rcpp,
+    parallel,

🏁 Script executed:

#!/bin/bash
rg -nP '\bparallel::|\blibrary\s*\(\s*parallel\s*\)' -g 'R/**'

Length of output: 197


Add parallel to DESCRIPTION Imports
Add parallel to the Imports: field in DESCRIPTION to match the import(parallel) in NAMESPACE and avoid CRAN check notes.

DESCRIPTION:

 Imports:
     data.table,
     log4r,
     methods,
     checkmate,
     utils,
     stringi,
     Rcpp,
+    parallel,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import(parallel)
Description:
...
Imports:
data.table,
log4r,
methods,
checkmate,
utils,
stringi,
Rcpp,
parallel,
...
🤖 Prompt for AI Agents
In NAMESPACE around line 32, you have import(parallel) declared but the
DESCRIPTION file's Imports: field doesn't include parallel; update the
DESCRIPTION file by adding parallel to the Imports: section (e.g., add
"parallel" to the comma-separated list of packages) so CRAN checks no longer
warn, and ensure the DESCRIPTION maintains valid formatting and dependency
ordering.

importFrom(data.table,as.data.table)
importFrom(data.table,fread)
importFrom(data.table,melt)
Expand All @@ -37,3 +41,4 @@ importFrom(log4r,file_appender)
importFrom(methods,new)
importFrom(stats,na.omit)
importFrom(utils,sessionInfo)
useDynLib(MSstatsConvert, .registration = TRUE)
99 changes: 93 additions & 6 deletions R/MSstatsConvert_core_functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -322,6 +322,7 @@ setMethod("MSstatsClean", signature = "MSstatsProteinProspectorFiles",
#' names defined by the names of this list and values corresponding to its elements
#' will be added to the output `data.frame`.
#' @param aggregate_isotopic logical. If `TRUE`, isotopic peaks will by summed.
#' @param anomaly_metrics character vector of names of columns with quality metrics. Default is missing and is not required if anomaly model not run.
#' @param ... additional parameters to `data.table::fread`.
#'
#' @return data.table
Expand Down Expand Up @@ -363,7 +364,7 @@ MSstatsPreprocess = function(
summarize_multiple_psms = max),
score_filtering = list(), exact_filtering = list(),
pattern_filtering = list(), columns_to_fill = list(),
aggregate_isotopic = FALSE, ...
aggregate_isotopic = FALSE, anomaly_metrics = c(), ...
) {
.checkMSstatsParams(input, annotation, feature_columns,
remove_shared_peptides,
Expand All @@ -380,8 +381,10 @@ MSstatsPreprocess = function(
input = .handleIsotopicPeaks(input, aggregate_isotopic)
input = .filterFewMeasurements(input, 1, FALSE)
input = .handleSharedPeptides(input, remove_shared_peptides)
input = .cleanByFeature(input, feature_columns, feature_cleaning)
input = .handleSingleFeaturePerProtein(input, remove_single_feature_proteins)
input = .cleanByFeature(input, feature_columns,
feature_cleaning, anomaly_metrics)
input = .handleSingleFeaturePerProtein(input,
remove_single_feature_proteins)
input = .mergeAnnotation(input, annotation)
.fillValues(input, columns_to_fill)
.adjustIntensities(input)
Expand All @@ -406,6 +409,7 @@ MSstatsPreprocess = function(
#' If "na_to_zero", missing values will be replaced by zeros.
#' @param remove_few lgl, if TRUE, features with one or two measurements
#' across runs will be removed.
#' @param anomaly_metrics character vector of names of columns with quality metrics
#'
#' @export
#' @return data.frame of class `MSstatsValidated`
Expand All @@ -422,7 +426,7 @@ MSstatsPreprocess = function(
#'
MSstatsBalancedDesign = function(input, feature_columns, fill_incomplete = TRUE,
handle_fractions = TRUE, fix_missing = NULL,
remove_few = TRUE) {
remove_few = TRUE, anomaly_metrics = c()) {
feature = NULL

input[, feature := do.call(".combine", .SD), .SDcols = feature_columns]
Expand All @@ -435,7 +439,7 @@ MSstatsBalancedDesign = function(input, feature_columns, fill_incomplete = TRUE,
getOption("MSstatsLog")("INFO", msg_fractions)
getOption("MSstatsMsg")("INFO", msg_fractions)
}
input = .makeBalancedDesign(input, fill_incomplete)
input = .makeBalancedDesign(input, fill_incomplete, anomaly_metrics)
msg_balanced = paste("** Updated quantification data to make balanced design.",
"Missing values are marked by NA")
getOption("MSstatsLog")("INFO", msg_balanced)
Expand All @@ -445,7 +449,7 @@ MSstatsBalancedDesign = function(input, feature_columns, fill_incomplete = TRUE,
with = FALSE]

getOption("MSstatsLog")("INFO", "\n")
.MSstatsFormat(input)
.MSstatsFormat(input, anomaly_metrics)
}


Expand Down Expand Up @@ -512,3 +516,86 @@ MSstatsMakeAnnotation = function(input, annotation, ...) {
getOption("MSstatsMsg")("INFO", msg)
annotation
}

#' Run Anomaly Model
#'
#' @param input data.table preprocessed by the MSstatsBalancedDesign function
#' @param quality_metrics character vector of quality metrics to use in the model
#' @param temporal_direction character vector of same length as quality_metrics indicating temporal feature to create.
#' @param missing_run_count numeric, maximum allowed fraction of missing runs per feature.
#' @param n_feat numeric, maximum number of features per protein to use in the model.
#' @param run_order data.frame with two columns: Run and Order. Order should be numeric and indicate the order of runs.
#' @param n_trees numeric, number of trees to use in the isolation forest model. Default is 100.
#' @param max_depth numeric or "auto", maximum depth of each tree. Default is "auto" which sets depth to log2(N) where N is the number of runs.
#' @param cores numeric, number of cores to use for parallel processing. Default is 1.
#' @useDynLib MSstatsConvert, .registration = TRUE
#'
#' @return data.table
#' @export
MSstatsAnomalyScores = function(input, quality_metrics, temporal_direction,
missing_run_count, n_feat, run_order, n_trees,
max_depth, cores){

input = .prepareSpectronautAnomalyInput(input, quality_metrics,
run_order, n_feat,
missing_run_count)
input$PSM = paste0(input$PeptideSequence, input$PrecursorCharge)

for (i in seq_along(quality_metrics)){
if (temporal_direction[i] != FALSE){
quality_metrics = c(quality_metrics,
paste0(quality_metrics[i], ".",
temporal_direction[i]))
}
}

input = .runAnomalyModel(input,
n_trees=n_trees,
max_depth=max_depth,
cores=cores,
split_column="PSM",
quality_metrics=quality_metrics)

subset_cols = c("Run", "ProteinName", "PeptideSequence",
"PrecursorCharge", "FragmentIon",
"ProductCharge", "IsotopeLabelType",
"Condition", "BioReplicate",
"Fraction", "Intensity", "AnomalyScores",
quality_metrics)

subset_cols = subset_cols[subset_cols %in% names(input)]
input = input[, ..subset_cols]

return(input)

}

#' Takes as input the output of the SpectronauttoMSstatsFormat function and calculates various quality metrics to assess the health of the data. Requires Anomaly Detection model to be fit.
#'
#' @param input MSstats input which is the output of Spectronaut converter
#' @return list of two data.tables
#'
#' @export
CheckDataHealth = function(input){

input = as.data.table(input)

# All intensity characteristics
missing_percent = .checkMissing(input)
zero_truncated = .checkIntensityDistribution(input)

# Feature specific characteristics
input$Feature = paste(input$PeptideSequence,
input$PrecursorCharge,
input$FragmentIon,
input$ProductCharge, sep="_")
feature_data = .checkFeatureSD(input)
outlier_info = .checkFeatureOutliers(input, feature_data)
feature_data = outlier_info[[1]]
outlier_summary = outlier_info[[2]]
feature_data = .checkFeatureCoverage(input, feature_data)

skew_results = .checkAnomalySkew(input)

return(list(feature_data, skew_results))
}
7 changes: 7 additions & 0 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

calculate_anomaly_score <- function(df, n_trees, max_depth) {
.Call(`_MSstatsConvert_calculate_anomaly_score`, df, n_trees, max_depth)
}

25 changes: 16 additions & 9 deletions R/clean_Spectronaut.R
Original file line number Diff line number Diff line change
@@ -1,32 +1,39 @@
#' Clean raw Spectronaut output.
#' @param msstats_object an object of class `MSstatsSpectronautFiles`.
#' @param intensity chr, specifies which column will be used for Intensity.
#' @param calculateAnomalyScores logical, whether to calculate anomaly scores
#' @param anomalyModelFeatures character vector, specifies which columns will be used for anomaly detection model. Can be NULL if calculateAnomalyScores=FALSE.
#' @return `data.table`
Comment on lines +4 to 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add defaults to new parameters for backward compatibility

Set calculateAnomalyScores = FALSE and anomalyModelFeatures = NULL in the signature; update roxygen to document defaults. Prevents breakage at indirect call sites.

-#' @param calculateAnomalyScores logical, whether to calculate anomaly scores
-#' @param anomalyModelFeatures character vector, specifies which columns will be used for anomaly detection model. Can be NULL if calculateAnomalyScores=FALSE.
+.#' @param calculateAnomalyScores logical (default: FALSE), whether to calculate anomaly scores
+.#' @param anomalyModelFeatures character vector or NULL (default: NULL); columns used for anomaly detection model. Ignored when calculateAnomalyScores=FALSE.
@@
-.cleanRawSpectronaut = function(msstats_object, intensity,
-                                calculateAnomalyScores, 
-                                anomalyModelFeatures) {
+.cleanRawSpectronaut = function(msstats_object, intensity,
+                                calculateAnomalyScores = FALSE,
+                                anomalyModelFeatures = NULL) {

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In R/clean_Spectronaut.R around lines 4 to 6, the new parameters
calculateAnomalyScores and anomalyModelFeatures lack defaults which can break
indirect callers; change the function signature to set calculateAnomalyScores =
FALSE and anomalyModelFeatures = NULL, and update the roxygen param tags to
indicate these defaults (e.g., @param calculateAnomalyScores logical, whether to
calculate anomaly scores, default FALSE; @param anomalyModelFeatures character
vector..., default NULL). Ensure any internal uses respect the defaults and run
R CMD check/tests to confirm no further changes are required.

#' @keywords internal
.cleanRawSpectronaut = function(msstats_object, intensity) {
.cleanRawSpectronaut = function(msstats_object, intensity,
calculateAnomalyScores,
anomalyModelFeatures) {
FFrgLossType = FExcludedFromQuantification = NULL

spec_input = getInputFile(msstats_object, "input")
.validateSpectronautInput(spec_input)
spec_input = spec_input[FFrgLossType == "noloss", ]

if (is.character(spec_input$FExcludedFromQuantification)) {
spec_input = spec_input[FExcludedFromQuantification == "False", ]
} else {
spec_input = spec_input[!(as.logical(FExcludedFromQuantification)), ]
}

f_charge_col = .findAvailable(c("FCharge", "FFrgZ"), colnames(spec_input))
pg_qval_col = .findAvailable(c("PGQvalue"), colnames(spec_input))
interference_col = .findAvailable(c("FPossibleInterference"),
colnames(spec_input))
exclude_col = .findAvailable(c("FExcludedFromQuantification"),
colnames(spec_input))
cols = c("PGProteinGroups", "EGModifiedSequence", "FGCharge", "FFrgIon",
f_charge_col, "RFileName", "RCondition", "RReplicate",
"EGQvalue", pg_qval_col, paste0("F", intensity))
"EGQvalue", pg_qval_col, interference_col, exclude_col,
paste0("F", intensity))
if (calculateAnomalyScores){
cols = c(cols, anomalyModelFeatures)
}
Comment on lines +27 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fail fast if requested anomalyModelFeatures are missing

Currently missing features are silently dropped via intersect(), which can mask configuration errors. Validate when calculateAnomalyScores is TRUE.

-  if (calculateAnomalyScores){
-    cols = c(cols, anomalyModelFeatures)
-  }
+  if (isTRUE(calculateAnomalyScores)) {
+    if (is.null(anomalyModelFeatures) || length(anomalyModelFeatures) == 0L) {
+      stop("calculateAnomalyScores=TRUE requires non-empty anomalyModelFeatures.")
+    }
+    missing_feats = setdiff(anomalyModelFeatures, colnames(spec_input))
+    if (length(missing_feats)) {
+      stop(sprintf("Requested anomalyModelFeatures not found in Spectronaut input: %s",
+                   paste(missing_feats, collapse = ", ")))
+    }
+    cols = c(cols, anomalyModelFeatures)
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (calculateAnomalyScores){
cols = c(cols, anomalyModelFeatures)
}
if (isTRUE(calculateAnomalyScores)) {
if (is.null(anomalyModelFeatures) || length(anomalyModelFeatures) == 0L) {
stop("calculateAnomalyScores=TRUE requires non-empty anomalyModelFeatures.")
}
missing_feats = setdiff(anomalyModelFeatures, colnames(spec_input))
if (length(missing_feats)) {
stop(sprintf("Requested anomalyModelFeatures not found in Spectronaut input: %s",
paste(missing_feats, collapse = ", ")))
}
cols = c(cols, anomalyModelFeatures)
}
🤖 Prompt for AI Agents
In R/clean_Spectronaut.R around lines 27 to 29, when calculateAnomalyScores is
TRUE the code currently appends anomalyModelFeatures after an intersect which
can silently drop missing features; instead, validate presence first by
computing missing <- setdiff(anomalyModelFeatures, cols) and if length(missing)
> 0 call stop() with a clear message listing the missing features, otherwise
safely append anomalyModelFeatures to cols.

cols = intersect(cols, colnames(spec_input))
spec_input = spec_input[, cols, with = FALSE]
data.table::setnames(
spec_input,
c("PGProteinGroups", "EGModifiedSequence", "FGCharge", "FFrgIon",
f_charge_col, "RFileName", paste0("F", intensity), "RCondition", "RReplicate"),
f_charge_col, "RFileName", paste0("F", intensity),
"RCondition", "RReplicate"),
c("ProteinName", "PeptideSequence", "PrecursorCharge", "FragmentIon",
"ProductCharge", "Run", "Intensity", "Condition", "BioReplicate"),
skip_absent = TRUE)
Expand Down
Loading
Loading