Skip to content

Columnar query engine additional filter optimization opportunities using IDMask #1550

@albertlockett

Description

@albertlockett

In #1523 we added the explicit IDMask enum as the return type of AttributeFilterExec::execute. There are a few places where we can use the values of this to eliminate some work.

  1. In FilterExec::execute, if AttreibuteFilterExec::execute returns either the None or All variants of the IDMask, we don't need to compute the selection_vector here from the ID column. We can skip over all this:

    let id_col = match get_id_col_from_parent(root_rb, attrs_filter.payload_type())? {
    Some(id_col) => id_col,
    None => {
    // None of the records have any attributes
    return Ok(BooleanArray::new(
    if self.missing_attrs_pass {
    BooleanBuffer::new_set(root_rb.num_rows())
    } else {
    BooleanBuffer::new_unset(root_rb.num_rows())
    },
    None,
    ));
    }
    };
    let id_mask = attrs_filter.execute(otap_batch, session_ctx, false)?;
    let mut attrs_selection_vec_builder = BooleanBufferBuilder::new(root_rb.num_rows());
    // we append to the selection vector in contiguous segments rather than doing it 1-by-1
    // for each value, as this is a faster way to build up the BooleanBuffer
    let mut segment_validity = false;
    let mut segment_len = 0usize;
    for index in 0..id_col.len() {
    let row_validity = if id_col.is_valid(index) {
    id_mask.contains(id_col.value(index) as u32)
    } else {
    // attribute does not exist
    self.missing_attrs_pass
    };
    if segment_validity != row_validity {
    if segment_len > 0 {
    attrs_selection_vec_builder.append_n(segment_len, segment_validity);
    }
    segment_validity = row_validity;
    segment_len = 0;
    }
    segment_len += 1;
    }
    // append the last segment
    if segment_len > 0 {
    attrs_selection_vec_builder.append_n(segment_len, segment_validity);
    }
    let attr_selection_vec = BooleanArray::new(attrs_selection_vec_builder.finish(), None);
    selection_vec = Some(match selection_vec {
    // update the result selection_vec to be the intersection of what's already filtered
    // and the attributes filters
    Some(selection_vec) => and(&selection_vec, &attr_selection_vec)?,
    // no predicate was applied to root batch, so we are just filtering by attributes
    None => attr_selection_vec,
    });

  2. In Columnar query engine optimization for attribute filtering #1514 we added the optimizer that would turn Composite<FilterExec> into Composite<AttributeFilterExec> where it made sense from a performance perspective. It didn't make sense to do this for Composite::<FilterExec>::Not when inverting Composite<AttributeFilterExec>, we needed to do a double scan over the parent ID column to compute the inverted ID mask, so nothing was gained. However, now that it's cheap to create IdMask::NotSome, this optimization might make more sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquery-engineQuery Engine / Transform related tasksquery-engine-columnarColumnar query engine which uses DataFusion to process OTAP Batches

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions