Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Dec 17, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Dec 17, 2025 9:42pm
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Dec 17, 2025 9:42pm
rivet-inspector Ignored Ignored Preview Dec 17, 2025 9:42pm
rivet-site Ignored Ignored Preview Dec 17, 2025 9:42pm

Copy link
Contributor Author

MasterPtato commented Dec 17, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Dec 17, 2025

PR Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics infrastructure from OpenTelemetry (OTEL) to Prometheus, introducing a pull-based metrics collection system instead of the previous push-based OTLP approach. The change simplifies the metrics pipeline and removes the need for a separate otel-collector-client service.


Code Quality & Architecture

Strengths

  1. Simplified Architecture: Removing the otel-collector-client service reduces complexity and the number of moving parts in the development environment.

  2. Clean Template-Based Configuration: Changes follow the repository's pattern of modifying templates in engine/docker/template/ and regenerating configs, which is good practice per CLAUDE.md.

  3. Prometheus Registry Implementation: The new registry.rs (engine/packages/metrics/src/registry.rs:1-11) properly initializes a Prometheus registry with standard labels including service name, pod ID, and source hash.

  4. HTTP Metrics Server: The standalone metrics server (engine/packages/metrics/src/server.rs:14-37) properly exposes Prometheus metrics on a configurable endpoint.


Issues & Concerns

🔴 Critical Issues

  1. Hardcoded Port in OTEL Config (engine/docker/template/src/services/edge/otel-collector.ts:29)
    Problem: The metrics port 6430 is hardcoded in the OTEL collector scrape config. This should reference the config value used in the metrics server to avoid drift between server and scraper configurations.

  2. Missing Error Handling in Metrics Encoding (engine/packages/metrics/src/server.rs:46-48)
    Problem: Using .expect() will crash the entire metrics server if encoding fails. Should return a proper error response instead.

  3. Response Building Could Panic (engine/packages/metrics/src/server.rs:50-54)
    Problem: Response building uses .expect() which is not ideal for a server. Should handle errors properly.

⚠️ Medium Priority Issues

  1. OTEL Collector Still Receives OTLP (engine/docker/template/src/services/edge/otel-collector.ts:38-46)
    The OTEL collector still has OTLP receivers configured for both gRPC and HTTP. Question: Is this intentional for traces/logs?

  2. Empty Prometheus Scrape Configs (engine/docker/template/src/services/core/prometheus.ts:12)
    The Prometheus service has no scrape configs defined. All scraping is done by OTEL collector via Prometheus receiver and sent via remote write. This is architecturally valid but unconventional. Document this design choice if intentional.

  3. Potential Label Cardinality Issue (engine/packages/metrics/src/registry.rs:8)
    Pod IDs as labels can cause high cardinality in Kubernetes environments with frequent pod churn. Consider if this label is necessary globally.

💡 Minor Issues & Style

  1. TODO Comment Left In Code (engine/packages/metrics/src/server.rs:11)
    Either implement this or create a tracking issue and reference it.

  2. Resource to Telemetry Conversion (engine/docker/template/src/services/edge/otel-collector.ts:99-101)
    This converts resource attributes to metric labels, which could increase cardinality. Ensure this is intentional.


Performance Considerations

Positive Changes

  1. Pull-Based Metrics: Prometheus's pull model is more efficient than pushing metrics for the OTEL collector to aggregate.

  2. Direct Prometheus Registry: Using the Prometheus client library directly avoids the overhead of converting metrics through OTLP.

  3. Batch Processing: OTEL collector batch processor with 5s timeout and 10k batch size is well-configured.

⚠️ Concerns

  1. Scrape Interval Hardcoded: 15s scrape interval is hardcoded in multiple places. Consider making this configurable.

  2. No Metrics Endpoint Rate Limiting: The metrics server has no rate limiting. While scraping is typically controlled, consider if protection is needed.


Security Concerns

⚠️ Issues

  1. Insecure TLS Configuration (engine/docker/template/src/services/edge/otel-collector.ts:96-98)
    TLS is set to insecure for dev environment. Severity: Low (dev only). Add a comment indicating this is acceptable for dev but should be changed for production.

  2. No Authentication on Metrics Endpoint: The metrics endpoint has no authentication. This is typical for Prometheus but worth documenting.


Test Coverage

Missing Tests

  1. No Tests for Metrics Server: The new server.rs file has no associated tests. Add tests for successful metrics scraping, encoding, and error handling.

  2. No Integration Tests: No tests verify the OTEL collector can successfully scrape the Prometheus endpoint.


Documentation

⚠️ Missing Documentation

  1. Architecture Decision: This significant change lacks explanation about why the revert was needed.

  2. Migration Guide: No guidance for existing deployments on how this affects their metrics.

  3. Old Dashboards Preserved: The PR moves old dashboards to grafana-dashboards-old/ but doesn't document what changed or why both are kept.


Recommendations

High Priority

  1. Fix panic-inducing .expect() calls in metrics server
  2. Extract hardcoded port to shared config
  3. Add error handling for metrics encoding
  4. Add tests for metrics server

Medium Priority

  1. Document why OTLP receivers remain in OTEL collector
  2. Document the Prometheus remote-write architecture
  3. Review label cardinality (especially kubernetes_pod_id)
  4. Consider making scrape interval configurable

Low Priority

  1. Add TODO tracking issue reference
  2. Add security comments for insecure TLS in dev
  3. Clean up old dashboard directory or document retention reason

Verdict

Recommend: Merge with minor fixes

The architectural change is sound and simplifies the metrics pipeline. The main concerns are:

  1. Error handling that could crash the metrics server
  2. Hardcoded values that could drift
  3. Missing tests and documentation

These should be addressed before or shortly after merge, but they don't block the core functionality.


Note: This review follows the conventions in CLAUDE.md and focuses on the Rust/Docker configuration aspects of the change. The extensive Grafana dashboard JSON changes were not reviewed in detail as they appear to be auto-generated.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Dec 17, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3656

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3656

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3656

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3656

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3656

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3656

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3656

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3656

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3656

commit: 5a3e37b

@claude
Copy link

claude bot commented Dec 17, 2025

Pull Request Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics collection system from OpenTelemetry metrics back to native Prometheus metrics. The changes span across the entire codebase, affecting metrics collection, Docker infrastructure, and Grafana dashboards.


Code Quality & Best Practices

✅ Strengths

  1. Consistent Migration Pattern: The migration from OTEL to Prometheus follows a consistent pattern across all packages:

    • Changed from opentelemetry::metrics::* to rivet_metrics::{REGISTRY, prometheus::*}
    • Replaced OTEL meter creation with Prometheus lazy_static! macros
    • Updated all metric registration to use register_*_with_registry! macros
  2. Centralized Configuration: New metrics configuration properly added to engine/packages/config/src/config/metrics.rs with sensible defaults.

  3. Proper Metric Naming: All metrics follow Prometheus naming conventions with appropriate prefixes (e.g., gasoline_, api_, cache_).

⚠️ Issues Found

  1. Indentation Inconsistency (engine/packages/api-builder/src/middleware.rs:189)

     			metrics::API_REQUEST_DURATION  // Extra space before metrics

    This line has inconsistent indentation. Should align with the surrounding code.

  2. Missing Error Handling (engine/packages/metrics/src/server.rs:46-48)

    encoder
        .encode(&metric_families, &mut buffer)
        .expect("encode");

    Using .expect() for encoding could panic. While unlikely to fail, consider logging the error instead.

  3. Hard-Coded Port (engine/docker/template/src/services/edge/otel-collector.ts:29)
    Port 6430 is hard-coded but defined as a constant in defaults.rs. Should reference the constant or add a comment indicating this must match defaults::ports::METRICS.


Potential Bugs

  1. Metric Label Cardinality Risk (Multiple files)

    Several metrics use potentially high-cardinality labels:

    • engine/packages/api-builder/src/metrics.rs:22: path label could create many time series if paths are dynamic
    • engine/packages/gasoline/src/metrics.rs:7: worker_id as a label

    Recommendation: Monitor cardinality and consider using path templates instead of full paths for API metrics.

  2. Empty String for Error Code (engine/packages/api-builder/src/middleware.rs:182-188)

    let error_code: String = if status.is_success() {
        String::new()
    } else if let Some(err) = &error {
        format!("{}.{}", err.group, err.code)
    } else {
        String::new()
    };

    Using empty strings for label values is valid but less clear than using a constant like "none" or "-". Empty strings can make Grafana queries more confusing.

  3. Unwrap on Metric Registration (All metrics files)

    All metrics use .unwrap() on registration. While this is common in lazy_static! blocks, consider adding panic messages for better debugging:

    .expect("failed to register metric: gasoline_workflow_total")

Performance Considerations

✅ Good

  1. Efficient Scraping: Prometheus pull-based model is more efficient than OTEL push for high-frequency metrics.

  2. Proper Bucketing: Histogram buckets are well-defined in engine/packages/metrics/src/buckets.rs with different granularities for different use cases.

  3. Batch Processing: OTEL collector configuration includes proper batching (timeout: 5s, batch size: 10000).

⚠️ Consider

  1. Metric Collection Overhead: The middleware in api-builder/src/middleware.rs increments metrics on every request. With high request rates, this could add overhead. Consider:

    • Using sampling for very high-traffic endpoints
    • Monitoring the performance impact
  2. Label Combinations: Some metrics have 4 labels (e.g., API_REQUEST_DURATION with method, path, status, error_code), which creates label_value1 × label_value2 × label_value3 × label_value4 time series. Monitor cardinality growth.


Security Concerns

✅ Good

  1. Metrics Endpoint Binding: Uses configurable host/port with sensible defaults (IPv6 unspecified, port 6430).

  2. No Sensitive Data: Metrics don't expose sensitive user data or secrets.

⚠️ Minor Issues

  1. No Authentication on Metrics Endpoint (engine/packages/metrics/src/server.rs)

    The metrics server has no authentication. While this is common for internal Prometheus endpoints, ensure:

    • The metrics port (6430) is not exposed externally
    • Network policies restrict access to authorized scrapers

    Consider adding a TODO comment about potential authentication in production environments.

  2. ClickHouse Password in Environment (engine/docker/template/src/services/edge/otel-collector.ts:79)
    Using environment variables is correct, but ensure this is properly documented and rotated in production.


Test Coverage

⚠️ Missing

  1. No Tests for Metrics Server: The engine/packages/metrics/src/server.rs has no visible tests. Consider adding:

    • Test that the server binds correctly
    • Test that metrics endpoint returns valid Prometheus format
    • Test error handling for binding failures
  2. Metric Registration Tests: No tests verify that all metrics are registered without conflicts or panics.

  3. Integration Tests: No visible tests for the end-to-end metrics flow (application → Prometheus → Grafana).

Recommendation: Add at least basic unit tests for the metrics server and registration.


Architecture & Design

✅ Strengths

  1. Clean Separation: Metrics package is well-isolated with clear responsibilities.

  2. Reusable Constants: BUCKETS, MICRO_BUCKETS, and TASK_POLL_BUCKETS are defined centrally.

  3. Proper Labels: Registry includes useful default labels (service, pod_id, source_hash).

💡 Suggestions

  1. Docker Template Warning: The CLAUDE.md states not to edit docker configs directly but to use templates. Verify that all docker changes in this PR were generated from templates by running:

    cd engine/docker/template && pnpm start

    And confirming no unexpected diffs.

  2. Documentation: Consider adding a comment in metrics/src/lib.rs explaining the migration from OTEL to Prometheus and the reasoning.

  3. Metric Cleanup: The TODO comment at line 4 in api-builder/src/metrics.rs mentions tracking request/response body size. Consider creating an issue to track this future enhancement.


Summary & Verdict

Overall Assessment: The PR is well-structured and follows a consistent pattern across the codebase. The migration from OTEL to Prometheus is thorough.

Required Changes:

  1. Fix indentation inconsistency at api-builder/src/middleware.rs:189
  2. Add .expect() messages to all metric registration .unwrap() calls for better error messages

Recommended Changes:

  1. Consider using constant labels like "-" instead of empty strings for missing error codes
  2. Add basic unit tests for the metrics server
  3. Document port 6430 constant or reference it from defaults.rs
  4. Add authentication consideration comment for production deployment
  5. Verify all docker config changes were generated from templates

Performance: Monitor metric cardinality after deployment, especially for path labels in API metrics.

Security: Ensure metrics endpoint is not exposed externally.

The changes are safe to merge after addressing the required indentation fix and improving error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants