Skip to content

Conversation

@xurui-c
Copy link
Member

@xurui-c xurui-c commented Jan 5, 2026

Product team reports pagination to be slow, we're pretty sure it's because the order by clause passed by the GetTrace endpoint doesn't exactly match the sort key of the CH table. This is related to the pre-existing bug uncovered as part of #7586

We don't know how much this will speed up pagination until we deploy it. In the meantime, I can only spot check. I got this trace from product team, clicked on it a couple times, then I filtered by my email in the trace explorer and found this trace that seems to have pagination sql queries (pasted in this notion doc)
I can then paste each of those pagination sql queries into the tracing tool and rewrote the order by (according to this PR) and compared performance. This notion doc tracks this work: https://www.notion.so/sentry/pagination-speed-ups-2e08b10e4b5d803596dbf6620ebd0db4?source=copy_link

@xurui-c xurui-c marked this pull request as ready for review January 5, 2026 22:31
@xurui-c xurui-c requested a review from a team as a code owner January 5, 2026 22:31
Copilot AI review requested due to automatic review settings January 5, 2026 22:31
@xurui-c xurui-c requested a review from a team as a code owner January 5, 2026 22:31
@xurui-c xurui-c changed the title fix(eap): GetTrace endpoint uses correct order by for pagination fix(eap): speed up GetTrace endpoint pagination Jan 5, 2026
Comment on lines 288 to +295
),
]
new_order_by = [
OrderBy(
direction=OrderByDirection.ASC,
expression=column("timestamp"),
),
OrderBy(
direction=OrderByDirection.ASC,
expression=column("item_timestamp"),
),
OrderBy(
direction=OrderByDirection.ASC,
expression=column("item_id"),
),
# we add organization_id and project_id to the order by to optimize data reading
# https://clickhouse.com/docs/sql-reference/statements/select/order-by#optimization-of-data-reading
OrderBy(direction=OrderByDirection.ASC, expression=column("organization_id")),
OrderBy(direction=OrderByDirection.ASC, expression=column("project_id")),
OrderBy(direction=OrderByDirection.ASC, expression=column("item_type")),

This comment was marked as outdated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we still need item_timestamp in the order by (between trace_id and item_id). its a more precise timestamp than timestamp but we want to ensure its ordering. you can double check with pierre, and you will have to verify the implications on performance.

did you already test the performance of the existing order by using set_skip_transform_order_by? If not I think that would be worth testing first before changing the order by.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a pagination performance issue in the GetTrace endpoint by aligning the ORDER BY clause with the ClickHouse table's sort key. The mismatch was causing slow pagination as ClickHouse couldn't efficiently skip data.

  • Updated ORDER BY to match table sort key: organization_id, project_id, item_type, timestamp, trace_id, item_id
  • Added set_skip_transform_order_by(True) to prevent column transformations that would break the alignment
  • Added test to verify the SQL ORDER BY clause matches the expected ordering

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
snuba/web/rpc/v1/endpoint_get_trace.py Updated ORDER BY clause to match ClickHouse table sort key (added organization_id, project_id, item_type, trace_id; removed item_timestamp alias). Added set_skip_transform_order_by(True) to preserve exact column names in ORDER BY.
tests/web/rpc/v1/test_endpoint_get_trace.py Added new test test_no_transformation_on_order_by to verify that the ORDER BY clause in generated SQL matches the expected column order without transformations. Added necessary imports (QueryResult, run_query).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

OrderBy(direction=OrderByDirection.ASC, expression=column("organization_id")),
OrderBy(direction=OrderByDirection.ASC, expression=column("project_id")),
OrderBy(direction=OrderByDirection.ASC, expression=column("item_type")),
OrderBy(direction=OrderByDirection.ASC, expression=column("timestamp")),
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ORDER BY clause doesn't align with the existing pagination filter logic (page_token_filter at lines 254-283). The page_token_filter uses item_timestamp (a computed alias) and timestamp to determine the next page, but the new ORDER BY doesn't include item_timestamp. This mismatch means pagination will not work correctly.

For correct pagination behavior, either:

  1. Update the page_token_filter to match this new ORDER BY (filtering on organization_id, project_id, item_type, timestamp, trace_id, item_id)
  2. Include item_timestamp in the ORDER BY if precision-based pagination is required
  3. Ensure the page_token stores and filters on all ORDER BY columns in the correct order

Currently, the page_token_filter uses timestamp and item_timestamp, but this ORDER BY only uses timestamp without item_timestamp, which will cause incorrect pagination boundaries.

Suggested change
OrderBy(direction=OrderByDirection.ASC, expression=column("timestamp")),
OrderBy(direction=OrderByDirection.ASC, expression=column("timestamp")),
OrderBy(direction=OrderByDirection.ASC, expression=column("item_timestamp")),

Copilot uses AI. Check for mistakes.
Comment on lines +622 to +659
def test_no_transformation_on_order_by(self, setup_teardown: Any, monkeypatch: Any) -> None:
# Wrap the real run_query to capture the actual QueryResult while still hitting ClickHouse.
captured: dict[str, Any] = {}

def wrapper(dataset, request, timer, robust: bool = False, concurrent_queries_gauge=None) -> QueryResult: # type: ignore[no-untyped-def]
qr = run_query(dataset, request, timer, robust, concurrent_queries_gauge)
captured["query_result"] = qr
return qr

monkeypatch.setattr("snuba.web.rpc.v1.endpoint_get_trace.run_query", wrapper)

ts = Timestamp(seconds=int(_BASE_TIME.timestamp()))
three_hours_later = int((_BASE_TIME + timedelta(hours=3)).timestamp())
message = GetTraceRequest(
meta=RequestMeta(
project_ids=[1, 2, 3],
organization_id=1,
cogs_category="something",
referrer="something",
start_timestamp=ts,
end_timestamp=Timestamp(seconds=three_hours_later),
request_id=_REQUEST_ID,
),
trace_id=_TRACE_ID,
items=[
GetTraceRequest.TraceItem(
item_type=TraceItemType.TRACE_ITEM_TYPE_SPAN,
)
],
)

EndpointGetTrace().execute(message)

qr = captured["query_result"]
assert (
"ORDER BY organization_id ASC, project_id ASC, item_type ASC, timestamp ASC, trace_id ASC, item_id ASC"
in qr.extra["sql"]
)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider explicitly setting state.set_config("enable_trace_pagination", 1) at the beginning of this test to make it clear that it's testing the new pagination order by behavior. While pagination is enabled by default (ENABLE_TRACE_PAGINATION_DEFAULT = 1), explicit configuration makes the test intent clearer and more resilient to future default changes.

Copilot uses AI. Check for mistakes.
@volokluev
Copy link
Member

What will you monitor to see if this has an impact on the product in aggregate?


# Update last seen pagination values
last_seen_item_type = int(row.pop("item_type", TraceItemType.TRACE_ITEM_TYPE_UNSPECIFIED))
last_seen_timestamp = row.get("integer_timestamp", None)

This comment was marked as outdated.


# Update last seen pagination values
last_seen_item_type = int(row.pop("item_type", TraceItemType.TRACE_ITEM_TYPE_UNSPECIFIED))
last_seen_timestamp = row.pop("integer_timestamp", None)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The variable last_seen_timestamp can become None if integer_timestamp is missing from a result row, causing a TypeError during protobuf serialization for pagination.
Severity: CRITICAL

🔍 Detailed Analysis

The ProcessedResults NamedTuple defines last_seen_timestamp as a float. However, it is assigned a value using row.pop("integer_timestamp", None), which can result in last_seen_timestamp being None. When a paginated response requires creating a page token, this None value is passed to EndpointGetTracePageToken. The subsequent attempt to serialize this token to protobuf will fail with a TypeError when trying to assign None to a double field, causing a crash. This occurs when pagination is enabled and a result row is missing the integer_timestamp field.

💡 Suggested Fix

Change the default value in the .pop() call to match the expected type and initialization value. Update the line to last_seen_timestamp = row.pop("integer_timestamp", 0.0) to ensure it is always a float.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: snuba/web/rpc/v1/endpoint_get_trace.py#L548

Potential issue: The `ProcessedResults` `NamedTuple` defines `last_seen_timestamp` as a
`float`. However, it is assigned a value using `row.pop("integer_timestamp", None)`,
which can result in `last_seen_timestamp` being `None`. When a paginated response
requires creating a page token, this `None` value is passed to
`EndpointGetTracePageToken`. The subsequent attempt to serialize this token to protobuf
will fail with a `TypeError` when trying to assign `None` to a `double` field, causing a
crash. This occurs when pagination is enabled and a result row is missing the
`integer_timestamp` field.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8502991

@phacops phacops added bug Something isn't working and removed bug Something isn't working labels Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants