Skip to content

Conversation

@volokluev
Copy link
Member

This is entirely vibecoded, don't review yet

Problem

The EndpointGetTraces endpoint had a performance bottleneck when retrieving metadata for large numbers of trace IDs. The current approach uses an IN clause with potentially thousands of trace IDs, which can hit ClickHouse query size
limits and cause failures.

Solution

Implemented a subquery optimization that eliminates large IN clauses by using SQL subqueries instead of passing trace ID lists back and forth.

Key Changes:

  • Added runtime configuration use_subquery_optimization_for_traces to enable/disable the optimization
  • Implemented _get_metadata_for_traces_with_subquery() method that uses placeholder substitution
  • Added helper functions for SQL extraction and direct execution
  • Supports both single-item and cross-item queries

Technical Approach:

  1. Generate trace ID subquery SQL using dry run mode
  2. Generate metadata query with UUID placeholder instead of actual trace IDs
  3. Replace placeholder pattern with subquery SQL: replaceAll(toString(trace_id), '-', '') IN (SELECT ...)
  4. Execute combined query directly and map result columns appropriately

Performance Impact

  • Eliminates query size limits: Handles unlimited trace ID counts
  • Reduces query complexity: Single optimized query vs multiple round trips
  • Backward compatible: Falls back to original approach when disabled
  • Runtime configurable: Can be enabled per environment

Testing

  • ✅ All 14 existing trace tests pass
  • ✅ New test validates optimization for cross-event queries
  • ✅ Verified all tests pass with optimization enabled globally

Configuration

# Enable optimization (disabled by default)
set_config("use_subquery_optimization_for_traces", True)

@volokluev volokluev requested review from a team as code owners January 8, 2026 01:10
@volokluev volokluev marked this pull request as draft January 8, 2026 01:10
"""
filter_expressions = []
for trace_filter in trace_filters:
assert hasattr(trace_filter, "fitler")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A typo in an assertion, hasattr(trace_filter, "fitler") instead of "filter", will cause cross-item queries to crash when a specific feature flag is enabled.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

In the _build_cross_item_query function, an assertion checks for a misspelled attribute fitler on the trace_filter object. However, the correct attribute name is filter. This typo will cause hasattr(trace_filter, "fitler") to return False, triggering an AssertionError. This will cause any cross-item query to fail when the use_subquery_optimization_for_traces feature flag is enabled, as this code path is executed under that condition.

💡 Suggested Fix

Correct the typo in the assert statement on line 492 from fitler to filter. The line should be assert hasattr(trace_filter, "filter").

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/web/rpc/v1/endpoint_get_traces.py#L492

Potential issue: In the `_build_cross_item_query` function, an assertion checks for a
misspelled attribute `fitler` on the `trace_filter` object. However, the correct
attribute name is `filter`. This typo will cause `hasattr(trace_filter, "fitler")` to
return `False`, triggering an `AssertionError`. This will cause any cross-item query to
fail when the `use_subquery_optimization_for_traces` feature flag is enabled, as this
code path is executed under that condition.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8313976

Comment on lines +1055 to +1061
placeholder_pattern = "in(trace_id, ['00000000-0000-0000-0000-000000000000'])"

# Use IN clause with subquery instead of placeholder
# Both queries use hex format for trace_id comparison
# Replace the placeholder IN clause with the subquery directly
in_subquery_clause = f"replaceAll(toString(trace_id), '-', '') IN ({subquery_sql})"
final_sql = metadata_sql.replace(placeholder_pattern, in_subquery_clause)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: SQL injection via fragile string replacement can fail silently, returning empty results. The function also returns incorrect metadata from a dry run, not the actual query.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

When subquery optimization is enabled, the code constructs a final SQL query by replacing a hardcoded placeholder string. This replacement is fragile and can fail silently if the SQL formatter's output changes, causing the query to execute with a placeholder UUID and return no results. Additionally, the function returns metadata and statistics from a dry_run query, not the actual executed query. This results in incorrect debug information and metrics, as the reported SQL and stats will not match what was actually run against the database.

💡 Suggested Fix

Instead of string replacement, use a more robust method to combine the subquery and the main query, such as modifying the query's abstract syntax tree (AST) before formatting. To fix the metadata issue, capture the results and metadata from the final _execute_direct_sql call and use that to construct the QueryResult that is returned.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/web/rpc/v1/endpoint_get_traces.py#L1055-L1061

Potential issue: When subquery optimization is enabled, the code constructs a final SQL
query by replacing a hardcoded placeholder string. This replacement is fragile and can
fail silently if the SQL formatter's output changes, causing the query to execute with a
placeholder UUID and return no results. Additionally, the function returns metadata and
statistics from a `dry_run` query, not the actual executed query. This results in
incorrect debug information and metrics, as the reported SQL and stats will not match
what was actually run against the database.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8313976

Comment on lines +629 to +638
trace_item_filters_expression = trace_item_filters_to_expression(
TraceItemFilter(
and_filter=AndFilter(
filters=[f.filter for f in in_msg.filters],
),
),
attribute_key_to_expression,
)
selected_columns: list[SelectedExpression] = [
SelectedExpression(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The subquery optimization path does not set the sampling tier, causing it to query the wrong storage table when trace sampling is enabled, leading to incorrect results.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

When both enable_trace_sampling and use_subquery_optimization_for_traces are enabled, the subquery optimization path fails to apply the sampling tier setting. The _get_subquery_sql function generates SQL for the subquery without calling settings.set_sampling_tier(), causing it to default to the non-sampled storage. However, the main query may be routed to a downsampled storage tier. This mismatch of storage tiers between the subquery and the main query can lead to incorrect results, schema mismatches, or query failures.

💡 Suggested Fix

In the _get_subquery_sql function, ensure the sampling_tier from the routing_decision is applied to the HTTPQuerySettings before the dry run query is executed. This will ensure the subquery is generated against the correct storage tier, matching the main query.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/web/rpc/v1/endpoint_get_traces.py#L629-L638

Potential issue: When both `enable_trace_sampling` and
`use_subquery_optimization_for_traces` are enabled, the subquery optimization path fails
to apply the sampling tier setting. The `_get_subquery_sql` function generates SQL for
the subquery without calling `settings.set_sampling_tier()`, causing it to default to
the non-sampled storage. However, the main query may be routed to a downsampled storage
tier. This mismatch of storage tiers between the subquery and the main query can lead to
incorrect results, schema mismatches, or query failures.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8313976

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants