Skip to content

[processor/cumulativetodelta] Silent misconfiguration when max_staleness lacks duration suffix #44901

@saviogl

Description

@saviogl

Component(s)

processor/cumulativetodelta

What happened?

Description

When configuring max_staleness without a duration suffix (e.g., max_staleness: 300 instead of max_staleness: 300s), the processor silently accepts the configuration but interprets the value as nanoseconds instead of the intended unit. This causes the processor's state tracking to expire almost immediately, breaking delta calculation without any error or warning.

The processor appears to work correctly—metrics flow through, AggregationTemporality is changed to Delta—but the actual delta values remain identical to the cumulative values because state is lost between scrapes.

Steps to Reproduce

  1. Configure the cumulativetodelta processor with a bare integer for max_staleness:
processors:
  cumulativetodelta:
    include:
      metric_types:
        - "histogram"
    initial_value: "keep"
    max_staleness: 300    # Missing 's' suffix - interpreted as 300 nanoseconds!
  1. Send histogram metrics through the processor
  2. Observe that while AggregationTemporality shows Delta, the Count, Sum, and BucketCounts values remain cumulative (identical between scrapes when no new data arrives, instead of showing 0)

Expected Result

Either:

  • Option A (Preferred): The collector should fail at startup with a clear validation error indicating that duration values require a unit suffix
  • Option B: The collector should log a warning when a duration value seems unreasonably small (e.g., < 1 second for max_staleness)

Actual Result

The configuration is silently accepted. The value 300 is interpreted as 300 nanoseconds (not 300 seconds), causing:

  • State entries to expire in ~300ns (essentially immediately)
  • Delta calculation to fail silently because previous values are never retained
  • Metrics to pass through with AggregationTemporality: Delta but with incorrect (cumulative) values

Root Cause Analysis

The issue stems from how Go's type system interacts with the configuration parsing:

  1. Config struct definition (config.go):

    type Config struct {
        MaxStaleness time.Duration `mapstructure:"max_staleness"`
        // ...
    }
  2. confmap decoder configuration (from opentelemetry-collector/confmap/confmap.go):

    dc := &mapstructure.DecoderConfig{
        WeaklyTypedInput: false,
        DecodeHook: composehook.ComposeDecodeHookFunc(
            // ...
            mapstructure.StringToTimeDurationHookFunc(),  // Only handles strings
            // ...
        ),
    }
  3. The parsing chain:

    • YAML parses max_staleness: 300 as an integer (not a string)
    • WeaklyTypedInput: false prevents automatic int→string conversion
    • StringToTimeDurationHookFunc() only activates for string inputs, so it's bypassed
    • Go's time.Duration is type Duration int64 (nanoseconds), allowing direct integer assignment
    • Result: 300 becomes 300 nanoseconds
  4. Why the fix works: max_staleness: 300s is parsed by YAML as a string, triggering StringToTimeDurationHookFunc() which correctly calls time.ParseDuration("300s") → 300 seconds

This is a known class of problem documented in go-yaml/yaml#200.

Proposed Solution

Add validation in the processor's Validate() method to catch unreasonably small duration values:

func (c *Config) Validate() error {
    // Existing validation...
    
    // Validate max_staleness is reasonable (if set)
    if c.MaxStaleness > 0 && c.MaxStaleness < time.Second {
        return fmt.Errorf(
            "max_staleness value %v appears to be in nanoseconds; "+
            "duration values require a unit suffix (e.g., '300s', '5m')",
            c.MaxStaleness,
        )
    }
    
    return nil
}

Alternatively, this could be addressed at the confmap level in the core collector by adding a decode hook that rejects raw integers for time.Duration fields, which would benefit all components.

Collector version

v0.131.0 (otel/opentelemetry-collector-contrib:0.131.0)

Environment information

OS: macOS Darwin 25.1.0 (also reproduced in Docker linux/arm64)
Collector: otel/opentelemetry-collector-contrib:0.131.0

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "worker"
          metrics_path: "/metrics"
          scrape_interval: "30s"
          static_configs:
            - targets: ["host.docker.internal:9394"]

processors:
  batch:
    send_batch_size: 8192
    timeout: 10s
  cumulativetodelta:
    include:
      metric_types:
        - "histogram"
    initial_value: "keep"
    max_staleness: 300    # BUG: Missing 's' suffix

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [cumulativetodelta, batch]
      exporters: [debug]

Log output

Debug output showing the issue (note AggregationTemporality: Delta but values remain cumulative):

Metric #0
Descriptor:
     -> Name: sidekiq_job_runtime_seconds
     -> Unit: seconds
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC   # <-- Epoch zero indicates state was lost
Timestamp: 2025-12-11 07:15:46.773 +0000 UTC
Count: 7           # <-- Should be 0 if no new jobs, but shows cumulative total
Sum: 43.749000     # <-- Should be 0.0 if no new jobs, but shows cumulative total

After fix (max_staleness: 300s):

Metric #0
Descriptor:
     -> Name: sidekiq_job_runtime_seconds
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
StartTimestamp: 2025-12-11 07:19:16.773 +0000 UTC   # <-- Proper timestamp
Timestamp: 2025-12-11 07:19:46.773 +0000 UTC
Count: 0           # <-- Correct delta (no new jobs)
Sum: 0.000000      # <-- Correct delta

Additional context

This issue is particularly problematic because:

  1. No startup error - The collector starts successfully
  2. No runtime error - Metrics flow through without issues
  3. Subtle misbehavior - The output looks correct (shows Delta temporality) but values are wrong
  4. Hard to debug - Requires deep understanding of the processor internals to diagnose

The fix is trivial once identified (300300s), but discovering the root cause required tracing through the confmap parsing logic, mapstructure configuration, and understanding Go's time.Duration type alias.

A validation check would save users significant debugging time and surface the misconfiguration immediately at startup.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions