-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
Component(s)
processor/cumulativetodelta
What happened?
Description
When configuring max_staleness without a duration suffix (e.g., max_staleness: 300 instead of max_staleness: 300s), the processor silently accepts the configuration but interprets the value as nanoseconds instead of the intended unit. This causes the processor's state tracking to expire almost immediately, breaking delta calculation without any error or warning.
The processor appears to work correctly—metrics flow through, AggregationTemporality is changed to Delta—but the actual delta values remain identical to the cumulative values because state is lost between scrapes.
Steps to Reproduce
- Configure the cumulativetodelta processor with a bare integer for
max_staleness:
processors:
cumulativetodelta:
include:
metric_types:
- "histogram"
initial_value: "keep"
max_staleness: 300 # Missing 's' suffix - interpreted as 300 nanoseconds!- Send histogram metrics through the processor
- Observe that while
AggregationTemporalityshowsDelta, theCount,Sum, andBucketCountsvalues remain cumulative (identical between scrapes when no new data arrives, instead of showing 0)
Expected Result
Either:
- Option A (Preferred): The collector should fail at startup with a clear validation error indicating that duration values require a unit suffix
- Option B: The collector should log a warning when a duration value seems unreasonably small (e.g., < 1 second for
max_staleness)
Actual Result
The configuration is silently accepted. The value 300 is interpreted as 300 nanoseconds (not 300 seconds), causing:
- State entries to expire in ~300ns (essentially immediately)
- Delta calculation to fail silently because previous values are never retained
- Metrics to pass through with
AggregationTemporality: Deltabut with incorrect (cumulative) values
Root Cause Analysis
The issue stems from how Go's type system interacts with the configuration parsing:
-
Config struct definition (
config.go):type Config struct { MaxStaleness time.Duration `mapstructure:"max_staleness"` // ... }
-
confmap decoder configuration (from
opentelemetry-collector/confmap/confmap.go):dc := &mapstructure.DecoderConfig{ WeaklyTypedInput: false, DecodeHook: composehook.ComposeDecodeHookFunc( // ... mapstructure.StringToTimeDurationHookFunc(), // Only handles strings // ... ), }
-
The parsing chain:
- YAML parses
max_staleness: 300as an integer (not a string) WeaklyTypedInput: falseprevents automatic int→string conversionStringToTimeDurationHookFunc()only activates for string inputs, so it's bypassed- Go's
time.Durationistype Duration int64(nanoseconds), allowing direct integer assignment - Result:
300becomes 300 nanoseconds
- YAML parses
-
Why the fix works:
max_staleness: 300sis parsed by YAML as a string, triggeringStringToTimeDurationHookFunc()which correctly callstime.ParseDuration("300s")→ 300 seconds
This is a known class of problem documented in go-yaml/yaml#200.
Proposed Solution
Add validation in the processor's Validate() method to catch unreasonably small duration values:
func (c *Config) Validate() error {
// Existing validation...
// Validate max_staleness is reasonable (if set)
if c.MaxStaleness > 0 && c.MaxStaleness < time.Second {
return fmt.Errorf(
"max_staleness value %v appears to be in nanoseconds; "+
"duration values require a unit suffix (e.g., '300s', '5m')",
c.MaxStaleness,
)
}
return nil
}Alternatively, this could be addressed at the confmap level in the core collector by adding a decode hook that rejects raw integers for time.Duration fields, which would benefit all components.
Collector version
v0.131.0 (otel/opentelemetry-collector-contrib:0.131.0)
Environment information
OS: macOS Darwin 25.1.0 (also reproduced in Docker linux/arm64)
Collector: otel/opentelemetry-collector-contrib:0.131.0
OpenTelemetry Collector configuration
receivers:
prometheus:
config:
scrape_configs:
- job_name: "worker"
metrics_path: "/metrics"
scrape_interval: "30s"
static_configs:
- targets: ["host.docker.internal:9394"]
processors:
batch:
send_batch_size: 8192
timeout: 10s
cumulativetodelta:
include:
metric_types:
- "histogram"
initial_value: "keep"
max_staleness: 300 # BUG: Missing 's' suffix
exporters:
debug:
verbosity: detailed
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [cumulativetodelta, batch]
exporters: [debug]Log output
Debug output showing the issue (note AggregationTemporality: Delta but values remain cumulative):
Metric #0
Descriptor:
-> Name: sidekiq_job_runtime_seconds
-> Unit: seconds
-> DataType: Histogram
-> AggregationTemporality: Delta
HistogramDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC # <-- Epoch zero indicates state was lost
Timestamp: 2025-12-11 07:15:46.773 +0000 UTC
Count: 7 # <-- Should be 0 if no new jobs, but shows cumulative total
Sum: 43.749000 # <-- Should be 0.0 if no new jobs, but shows cumulative total
After fix (max_staleness: 300s):
Metric #0
Descriptor:
-> Name: sidekiq_job_runtime_seconds
-> DataType: Histogram
-> AggregationTemporality: Delta
HistogramDataPoints #0
StartTimestamp: 2025-12-11 07:19:16.773 +0000 UTC # <-- Proper timestamp
Timestamp: 2025-12-11 07:19:46.773 +0000 UTC
Count: 0 # <-- Correct delta (no new jobs)
Sum: 0.000000 # <-- Correct delta
Additional context
This issue is particularly problematic because:
- No startup error - The collector starts successfully
- No runtime error - Metrics flow through without issues
- Subtle misbehavior - The output looks correct (shows Delta temporality) but values are wrong
- Hard to debug - Requires deep understanding of the processor internals to diagnose
The fix is trivial once identified (300 → 300s), but discovering the root cause required tracing through the confmap parsing logic, mapstructure configuration, and understanding Go's time.Duration type alias.
A validation check would save users significant debugging time and surface the misconfiguration immediately at startup.