fix: revert otel metrics to prometheus #3656

MasterPtato · 2025-12-17T02:28:16Z

No description provided.

vercel · 2025-12-17T02:28:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivetkit-serverless	Error		Dec 17, 2025 9:42pm

3 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ignored		Dec 17, 2025 9:42pm
rivet-inspector	Ignored	Preview	Dec 17, 2025 9:42pm
rivet-site	Ignored	Preview	Dec 17, 2025 9:42pm

MasterPtato · 2025-12-17T02:28:32Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

chore: clean up metrics #3660
fix: add wf and guard metrics #3657
fix: revert otel metrics to prometheus #3656 👈 (View in Graphite)
chore: update dashboards #3655
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-12-17T02:30:49Z

PR Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics infrastructure from OpenTelemetry (OTEL) to Prometheus, introducing a pull-based metrics collection system instead of the previous push-based OTLP approach. The change simplifies the metrics pipeline and removes the need for a separate otel-collector-client service.

Code Quality & Architecture

✅ Strengths

Simplified Architecture: Removing the otel-collector-client service reduces complexity and the number of moving parts in the development environment.
Clean Template-Based Configuration: Changes follow the repository's pattern of modifying templates in engine/docker/template/ and regenerating configs, which is good practice per CLAUDE.md.
Prometheus Registry Implementation: The new registry.rs (engine/packages/metrics/src/registry.rs:1-11) properly initializes a Prometheus registry with standard labels including service name, pod ID, and source hash.
HTTP Metrics Server: The standalone metrics server (engine/packages/metrics/src/server.rs:14-37) properly exposes Prometheus metrics on a configurable endpoint.

Issues & Concerns

🔴 Critical Issues

Hardcoded Port in OTEL Config (engine/docker/template/src/services/edge/otel-collector.ts:29)
Problem: The metrics port 6430 is hardcoded in the OTEL collector scrape config. This should reference the config value used in the metrics server to avoid drift between server and scraper configurations.
Missing Error Handling in Metrics Encoding (engine/packages/metrics/src/server.rs:46-48)
Problem: Using .expect() will crash the entire metrics server if encoding fails. Should return a proper error response instead.
Response Building Could Panic (engine/packages/metrics/src/server.rs:50-54)
Problem: Response building uses .expect() which is not ideal for a server. Should handle errors properly.

⚠️ Medium Priority Issues

OTEL Collector Still Receives OTLP (engine/docker/template/src/services/edge/otel-collector.ts:38-46)
The OTEL collector still has OTLP receivers configured for both gRPC and HTTP. Question: Is this intentional for traces/logs?
Empty Prometheus Scrape Configs (engine/docker/template/src/services/core/prometheus.ts:12)
The Prometheus service has no scrape configs defined. All scraping is done by OTEL collector via Prometheus receiver and sent via remote write. This is architecturally valid but unconventional. Document this design choice if intentional.
Potential Label Cardinality Issue (engine/packages/metrics/src/registry.rs:8)
Pod IDs as labels can cause high cardinality in Kubernetes environments with frequent pod churn. Consider if this label is necessary globally.

💡 Minor Issues & Style

TODO Comment Left In Code (engine/packages/metrics/src/server.rs:11)
Either implement this or create a tracking issue and reference it.
Resource to Telemetry Conversion (engine/docker/template/src/services/edge/otel-collector.ts:99-101)
This converts resource attributes to metric labels, which could increase cardinality. Ensure this is intentional.

Performance Considerations

✅ Positive Changes

Pull-Based Metrics: Prometheus's pull model is more efficient than pushing metrics for the OTEL collector to aggregate.
Direct Prometheus Registry: Using the Prometheus client library directly avoids the overhead of converting metrics through OTLP.
Batch Processing: OTEL collector batch processor with 5s timeout and 10k batch size is well-configured.

⚠️ Concerns

Scrape Interval Hardcoded: 15s scrape interval is hardcoded in multiple places. Consider making this configurable.
No Metrics Endpoint Rate Limiting: The metrics server has no rate limiting. While scraping is typically controlled, consider if protection is needed.

Security Concerns

⚠️ Issues

Insecure TLS Configuration (engine/docker/template/src/services/edge/otel-collector.ts:96-98)
TLS is set to insecure for dev environment. Severity: Low (dev only). Add a comment indicating this is acceptable for dev but should be changed for production.
No Authentication on Metrics Endpoint: The metrics endpoint has no authentication. This is typical for Prometheus but worth documenting.

Test Coverage

❌ Missing Tests

No Tests for Metrics Server: The new server.rs file has no associated tests. Add tests for successful metrics scraping, encoding, and error handling.
No Integration Tests: No tests verify the OTEL collector can successfully scrape the Prometheus endpoint.

Documentation

⚠️ Missing Documentation

Architecture Decision: This significant change lacks explanation about why the revert was needed.
Migration Guide: No guidance for existing deployments on how this affects their metrics.
Old Dashboards Preserved: The PR moves old dashboards to grafana-dashboards-old/ but doesn't document what changed or why both are kept.

Recommendations

High Priority

Fix panic-inducing .expect() calls in metrics server
Extract hardcoded port to shared config
Add error handling for metrics encoding
Add tests for metrics server

Medium Priority

Document why OTLP receivers remain in OTEL collector
Document the Prometheus remote-write architecture
Review label cardinality (especially kubernetes_pod_id)
Consider making scrape interval configurable

Low Priority

Add TODO tracking issue reference
Add security comments for insecure TLS in dev
Clean up old dashboard directory or document retention reason

Verdict

Recommend: Merge with minor fixes

The architectural change is sound and simplifies the metrics pipeline. The main concerns are:

Error handling that could crash the metrics server
Hardcoded values that could drift
Missing tests and documentation

These should be addressed before or shortly after merge, but they don't block the core functionality.

Note: This review follows the conventions in CLAUDE.md and focuses on the Rust/Docker configuration aspects of the change. The extensive Grafana dashboard JSON changes were not reviewed in detail as they appear to be auto-generated.

pkg-pr-new · 2025-12-17T02:31:26Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3656

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3656

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3656

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3656

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3656

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3656

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3656

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3656

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3656

commit: 5a3e37b

claude · 2025-12-17T21:46:54Z

Pull Request Review: Revert OTEL Metrics to Prometheus

Summary

This PR reverts the metrics collection system from OpenTelemetry metrics back to native Prometheus metrics. The changes span across the entire codebase, affecting metrics collection, Docker infrastructure, and Grafana dashboards.

Code Quality & Best Practices

✅ Strengths

Consistent Migration Pattern: The migration from OTEL to Prometheus follows a consistent pattern across all packages:
- Changed from opentelemetry::metrics::* to rivet_metrics::{REGISTRY, prometheus::*}
- Replaced OTEL meter creation with Prometheus lazy_static! macros
- Updated all metric registration to use register_*_with_registry! macros
Centralized Configuration: New metrics configuration properly added to engine/packages/config/src/config/metrics.rs with sensible defaults.
Proper Metric Naming: All metrics follow Prometheus naming conventions with appropriate prefixes (e.g., gasoline_, api_, cache_).

⚠️ Issues Found

Indentation Inconsistency (engine/packages/api-builder/src/middleware.rs:189)
```
 			metrics::API_REQUEST_DURATION  // Extra space before metrics
```
This line has inconsistent indentation. Should align with the surrounding code.
Missing Error Handling (engine/packages/metrics/src/server.rs:46-48)
```
encoder
    .encode(&metric_families, &mut buffer)
    .expect("encode");
```
Using .expect() for encoding could panic. While unlikely to fail, consider logging the error instead.
Hard-Coded Port (engine/docker/template/src/services/edge/otel-collector.ts:29)
Port 6430 is hard-coded but defined as a constant in defaults.rs. Should reference the constant or add a comment indicating this must match defaults::ports::METRICS.

Potential Bugs

Metric Label Cardinality Risk (Multiple files)

Several metrics use potentially high-cardinality labels:
- engine/packages/api-builder/src/metrics.rs:22: path label could create many time series if paths are dynamic
- engine/packages/gasoline/src/metrics.rs:7: worker_id as a label
Recommendation: Monitor cardinality and consider using path templates instead of full paths for API metrics.
Empty String for Error Code (engine/packages/api-builder/src/middleware.rs:182-188)
```
let error_code: String = if status.is_success() {
    String::new()
} else if let Some(err) = &error {
    format!("{}.{}", err.group, err.code)
} else {
    String::new()
};
```
Using empty strings for label values is valid but less clear than using a constant like "none" or "-". Empty strings can make Grafana queries more confusing.
Unwrap on Metric Registration (All metrics files)

All metrics use .unwrap() on registration. While this is common in lazy_static! blocks, consider adding panic messages for better debugging:
```
.expect("failed to register metric: gasoline_workflow_total")
```

Performance Considerations

✅ Good

Efficient Scraping: Prometheus pull-based model is more efficient than OTEL push for high-frequency metrics.
Proper Bucketing: Histogram buckets are well-defined in engine/packages/metrics/src/buckets.rs with different granularities for different use cases.
Batch Processing: OTEL collector configuration includes proper batching (timeout: 5s, batch size: 10000).

⚠️ Consider

Metric Collection Overhead: The middleware in api-builder/src/middleware.rs increments metrics on every request. With high request rates, this could add overhead. Consider:
- Using sampling for very high-traffic endpoints
- Monitoring the performance impact
Label Combinations: Some metrics have 4 labels (e.g., API_REQUEST_DURATION with method, path, status, error_code), which creates label_value1 × label_value2 × label_value3 × label_value4 time series. Monitor cardinality growth.

Security Concerns

✅ Good

Metrics Endpoint Binding: Uses configurable host/port with sensible defaults (IPv6 unspecified, port 6430).
No Sensitive Data: Metrics don't expose sensitive user data or secrets.

⚠️ Minor Issues

No Authentication on Metrics Endpoint (engine/packages/metrics/src/server.rs)

The metrics server has no authentication. While this is common for internal Prometheus endpoints, ensure:
- The metrics port (6430) is not exposed externally
- Network policies restrict access to authorized scrapers
Consider adding a TODO comment about potential authentication in production environments.
ClickHouse Password in Environment (engine/docker/template/src/services/edge/otel-collector.ts:79)
Using environment variables is correct, but ensure this is properly documented and rotated in production.

Test Coverage

⚠️ Missing

No Tests for Metrics Server: The engine/packages/metrics/src/server.rs has no visible tests. Consider adding:
- Test that the server binds correctly
- Test that metrics endpoint returns valid Prometheus format
- Test error handling for binding failures
Metric Registration Tests: No tests verify that all metrics are registered without conflicts or panics.
Integration Tests: No visible tests for the end-to-end metrics flow (application → Prometheus → Grafana).

Recommendation: Add at least basic unit tests for the metrics server and registration.

Architecture & Design

✅ Strengths

Clean Separation: Metrics package is well-isolated with clear responsibilities.
Reusable Constants: BUCKETS, MICRO_BUCKETS, and TASK_POLL_BUCKETS are defined centrally.
Proper Labels: Registry includes useful default labels (service, pod_id, source_hash).

💡 Suggestions

Docker Template Warning: The CLAUDE.md states not to edit docker configs directly but to use templates. Verify that all docker changes in this PR were generated from templates by running:
```
cd engine/docker/template && pnpm start
```
And confirming no unexpected diffs.
Documentation: Consider adding a comment in metrics/src/lib.rs explaining the migration from OTEL to Prometheus and the reasoning.
Metric Cleanup: The TODO comment at line 4 in api-builder/src/metrics.rs mentions tracking request/response body size. Consider creating an issue to track this future enhancement.

Summary & Verdict

Overall Assessment: The PR is well-structured and follows a consistent pattern across the codebase. The migration from OTEL to Prometheus is thorough.

Required Changes:

Fix indentation inconsistency at api-builder/src/middleware.rs:189
Add .expect() messages to all metric registration .unwrap() calls for better error messages

Recommended Changes:

Consider using constant labels like "-" instead of empty strings for missing error codes
Add basic unit tests for the metrics server
Document port 6430 constant or reference it from defaults.rs
Add authentication consideration comment for production deployment
Verify all docker config changes were generated from templates

Performance: Monitor metric cardinality after deployment, especially for path labels in API metrics.

Security: Ensure metrics endpoint is not exposed externally.

The changes are safe to merge after addressing the required indentation fix and improving error messages.

MasterPtato requested a review from NathanFlurry December 17, 2025 02:28

MasterPtato mentioned this pull request Dec 17, 2025

chore: update dashboards #3655

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless December 17, 2025 02:30 Failure

MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from cfe978f to 475f32b Compare December 17, 2025 02:35

vercel bot had a problem deploying to Preview – rivetkit-serverless December 17, 2025 02:36 Failure

fix: revert otel metrics to prometheus

5a3e37b

MasterPtato force-pushed the 12-16-fix_revert_otel_metrics_to_prometheus branch from 475f32b to 5a3e37b Compare December 17, 2025 21:41

MasterPtato mentioned this pull request Dec 17, 2025

fix: add wf and guard metrics #3657

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless December 17, 2025 21:42 Failure

MasterPtato mentioned this pull request Dec 18, 2025

chore: clean up metrics #3660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: revert otel metrics to prometheus #3656

fix: revert otel metrics to prometheus #3656

Uh oh!

MasterPtato commented Dec 17, 2025

Uh oh!

vercel bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Dec 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: revert otel metrics to prometheus #3656

Are you sure you want to change the base?

fix: revert otel metrics to prometheus #3656

Uh oh!

Conversation

MasterPtato commented Dec 17, 2025

Uh oh!

vercel bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Revert OTEL Metrics to Prometheus

Summary

Code Quality & Architecture

✅ Strengths

Issues & Concerns

🔴 Critical Issues

⚠️ Medium Priority Issues

💡 Minor Issues & Style

Performance Considerations

✅ Positive Changes

⚠️ Concerns

Security Concerns

⚠️ Issues

Test Coverage

❌ Missing Tests

Documentation

⚠️ Missing Documentation

Recommendations

High Priority

Medium Priority

Low Priority

Verdict

Uh oh!

pkg-pr-new bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Dec 17, 2025

Pull Request Review: Revert OTEL Metrics to Prometheus

Summary

Code Quality & Best Practices

✅ Strengths

⚠️ Issues Found

Potential Bugs

Performance Considerations

✅ Good

⚠️ Consider

Security Concerns

✅ Good

⚠️ Minor Issues

Test Coverage

⚠️ Missing

Architecture & Design

✅ Strengths

💡 Suggestions

Summary & Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Dec 17, 2025 •

edited

Loading

MasterPtato commented Dec 17, 2025 •

edited

Loading

claude bot commented Dec 17, 2025 •

edited

Loading

pkg-pr-new bot commented Dec 17, 2025 •

edited

Loading