-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add ScalarValue::RunEndEncoded variant
#19895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| /// (run-ends field, value field, value) | ||
| RunEndEncoded(FieldRef, FieldRef, Box<ScalarValue>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mimicking the arrow type where it stores fields:
I tried initially only storing the index DataType and the ScalarValue value, but figured it would be better to try be as accurate as possible 🤔
| | DataType::Decimal32(_, _) | ||
| | DataType::Decimal64(_, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little fix since we were missing these
|
|
||
| // Unsupported types for now | ||
| _ => { | ||
| DataType::ListView(_) | DataType::LargeListView(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just getting rid of the catch-all to be more rigorous
| _ => unreachable!("Invalid dictionary keys type: {}", key_type), | ||
| } | ||
| } | ||
| DataType::RunEndEncoded(run_ends_field, value_field) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're building the runarray efficiently here, as unlike dictionary above which would require keeping a hashmap of values to build an efficient dictionary array, run arrays are simpler in that we just need to track when a new run starts.
Most of the verbosity here is related to destructuring input ScalarValues and ensuring we have consistent types from them.
| let run_ends = PrimitiveArray::<R>::from_iter_values(run_ends); | ||
| let values = ScalarValue::iter_to_array(value_scalars)?; | ||
|
|
||
| // Using ArrayDataBuilder so we can maintain the fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only way to construct runarrays with fields we want, since try_new creates the fields for us:
| ); | ||
| let err = scalar.eq_array(&run_array, 1).unwrap_err(); | ||
| let expected = "Internal error: could not cast array of type Float32 to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Float64Type>"; | ||
| assert!(err.to_string().starts_with(expected)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to use starts_with since backtrace feature can affect the error message, so direct equality can succeed for cargo test but fail in CI
| return Err(Error::General( | ||
| "Proto serialization error: The RunEndEncoded data type is not yet supported".to_owned() | ||
| )) | ||
| DataType::Decimal32(precision, scale) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only change here is for RunEndEncoded; for some reason other formatting changes were applied for the other arms here
|
FYI @brancz |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't say I reviewed every line of this PR carefully but I did read them all and they look structurally good to me -- thank you for pushing this along @Jefffrey
| } | ||
| (Dictionary(_, _), _) => None, | ||
| (RunEndEncoded(rf1, vf1, v1), RunEndEncoded(rf2, vf2, v2)) => { | ||
| // Don't compare if the run ends fields don't match (it is effectively a different datatype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is exactly what we want. The run arrays could be logically identical, but their index types might differ. I don't think we'd want the scalar not to equal in that case. I realize that's not what we have for dictionaries either, but is that really the intention of scalars? My understanding has always been that the integer width of codes should be irrelevant from a logical equality perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want that logic as this level; for example if we fix PartialOrd here to compare REE/Dicts based on inner values only, then we'd probably have to do the same for PartialEq right? But then we run into an issue with Hash not being consistent unless we also fix Hash 🤔
I think it might be better to leave these as is, and if we want proper comparison it would make more sense to do at a high level (e.g. via type coercion)
Which issue does this PR close?
Rationale for this change
Support RunEndEncoded scalar values, similar to how we support for Dictionary.
What changes are included in this PR?
ScalarValue::RunEndEncodedenum variantScalarValue::new_defaultto supportDecimal32andDecimal64ScalarValuemessage andArrowTypemessageAre these changes tested?
Added tests.
Are there any user-facing changes?
New variant for
ScalarValueProtobuf changes to support RunEndEncoded type