Skip to content

Data source filtering can cause significant amount of data being scanned #2371

@scholtzan

Description

@scholtzan

It seems like BigQuery optimizer is struggling with the date filter on the data sources:

AND ds.submission_date BETWEEN '2024-10-31' AND '2024-12-19'
AND ds.submission_date BETWEEN
    DATE_ADD(e.enrollment_date, interval e.analysis_window_start day)
    AND DATE_ADD(e.enrollment_date, interval e.analysis_window_end day)

It ends up scanning all partitions between 2024-10-31 and 2024-12-19, even though for daily runs only the second part of the filter is relevant and will filter on a single date. When I changed the filter to a specific date like ds.submission_date = '2024-12-19' then it scanned significantly less data and way less slot usage.

BigQuery optimizer seems to make optimizations based on the size of source data as well, so this issue doesn't seem to consistently affect all experiments. But one that was affected was default-ui-experiment.

We may need to rethink how this filter should work.

┆Issue is synchronized with this Jira Task

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions