Skip to content

Conversation

@samikshya-db
Copy link
Contributor

Part 5 of 7-part Telemetry Implementation Stack

This layer adds HTTP export with retry logic and circuit breaker integration.

Summary

Implements DatabricksTelemetryExporter for reliable HTTP export to Databricks with retry logic, exponential backoff, and circuit breaker protection.

Components

DatabricksTelemetryExporter (lib/telemetry/DatabricksTelemetryExporter.ts)

HTTP export with resilience patterns:

Endpoints:

  • /api/2.0/sql/telemetry-ext - Authenticated export (default)
  • /api/2.0/sql/telemetry-unauth - Unauthenticated export (fallback)

Export Flow:

  1. Check circuit breaker state (skip if OPEN)
  2. Execute export with circuit breaker protection
  3. Retry on retryable errors with exponential backoff
  4. Circuit breaker tracks success/failure for endpoint health
  5. All exceptions swallowed and logged at debug level

Retry Strategy:

  • Max retries: 3 (configurable via telemetryMaxRetries)
  • Backoff: Exponential with jitter (100ms * 2^attempt + random(0-100ms))
  • Terminal errors → No retry (401, 403, 404, 400, AuthenticationError)
  • Retryable errors → Retry with backoff (429, 500, 502, 503, 504, timeouts)
  • Prevents thundering herd with jitter

Circuit Breaker Integration:

  • Success → circuitBreaker.recordSuccess()
  • Failure → circuitBreaker.recordFailure()
  • Circuit OPEN → Skip export, log at debug, save bandwidth
  • Automatic recovery via HALF_OPEN state testing

Payload Format:

{
  "workspace_id": "...",
  "session_id": "...",
  "driver_version": "...",
  "metrics": [...]
}

Critical Requirements ✅

  • NEVER throws exceptions - All errors caught and swallowed
  • Debug-only logging - No warn/error logs from telemetry
  • No console logging - Uses IDBSQLLogger only
  • Driver resilience - Driver continues when telemetry fails

Testing

  • 24 comprehensive unit tests (96% statement, 84% branch coverage)
  • Tests verify exception swallowing (CRITICAL)
  • Tests verify retry logic with backoff
  • Tests verify circuit breaker integration
  • Tests verify authenticated vs unauthenticated endpoints
  • TelemetryExporterStub for integration testing

Performance Characteristics

  • Circuit breaker prevents wasted calls to failing endpoints
  • Exponential backoff reduces server load during failures
  • Jitter prevents synchronized retry storms
  • Batch export amortizes HTTP overhead

Next Steps

This PR is followed by:

  • [6/7] Integration: Wire into DBSQLClient/Operation/Session
  • [7/7] Testing & Documentation

Dependencies

Depends on:

samikshya-db and others added 9 commits January 29, 2026 20:21
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure
This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management
This is part 5 of 7 in the telemetry implementation stack.

Components:
- DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker
- TelemetryExporterStub: Test stub for integration tests

DatabricksTelemetryExporter:
- Exports telemetry metrics to Databricks via HTTP POST
- Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth)
- Integrates with CircuitBreaker for per-host endpoint protection
- Retry logic with exponential backoff and jitter
- Exception classification (terminal vs retryable)

Export Flow:
1. Check circuit breaker state (skip if OPEN)
2. Execute with circuit breaker protection
3. Retry on retryable errors with backoff
4. Circuit breaker tracks success/failure
5. All exceptions swallowed and logged at debug level

Retry Strategy:
- Max retries: 3 (default, configurable)
- Exponential backoff: 100ms * 2^attempt
- Jitter: Random 0-100ms to prevent thundering herd
- Terminal errors: No retry (401, 403, 404, 400)
- Retryable errors: Retry with backoff (429, 500, 502, 503, 504)

Circuit Breaker Integration:
- Success → Record success with circuit breaker
- Failure → Record failure with circuit breaker
- Circuit OPEN → Skip export, log at debug
- Automatic recovery via HALF_OPEN state

Critical Requirements:
- All exceptions swallowed (NEVER throws)
- All logging at LogLevel.debug ONLY
- No console logging
- Driver continues when telemetry fails

Testing:
- 24 comprehensive unit tests
- 96% statement coverage, 84% branch coverage
- Tests verify exception swallowing
- Tests verify retry logic
- Tests verify circuit breaker integration
- TelemetryExporterStub for integration tests

Dependencies:
- Builds on all previous layers [1/7] through [4/7]
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Use getAuthHeaders() method for authenticated endpoint requests
- Remove TODO comments about missing authentication
- Add auth headers when telemetryAuthenticatedExport is true

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Use NODEJS client type instead of OSS_NODEJS for feature flags
- Use /telemetry-ext and /telemetry-unauth (not /api/2.0/sql/...)
- Update payload to match proto: system_configuration with snake_case
- Add URL utility to handle protocol correctly

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
@samikshya-db samikshya-db force-pushed the telemetry-4-event-aggregation branch from 31f847e to 29376a6 Compare January 29, 2026 20:21
Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants