-
Notifications
You must be signed in to change notification settings - Fork 1.1k
arrow-cast: support packing to Dictionary(_, Utf8View/BinaryView) #9220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fix cast_to_dictionary failing for view value types by packing to Utf8/Binary and recasting; add regression tests for view inputs and key overflow.
arrow-cast/src/cast/dictionary.rs
Outdated
| }; | ||
|
|
||
| let dict_base = cast_to_dictionary::<K>(array, &base_value_type, cast_options)?; | ||
| cast_with_options( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this equal to dictionary_cast::<K>(dict_base.as_ref(), &DataType::Dictionary(Box::new(K::Data_TYPE), Box::new(DataType::Utf8View), cast_options))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, effectively equivalent. cast_with_options(dict, Dictionary(..)) hits the (Dictionary(..), _) arm, which calls dictionary_cast::<K>, then routes to dictionary_to_dictionary_cast since the target is also Dictionary(..). Same code path as calling dictionary_cast::<K> directly.
Only difference: cast_with_options short-circuits when from == to (:769-772), which a direct dictionary_cast call skips. Doesn't affect actual dict to dict conversions.
Happy to switch to explicit dictionary_cast::<K>(...) if you prefer, makes the dict-to-dict intent clearer. Let me know.
Avoid redundant cast dispatch when packing to Dictionary(_, Utf8View/BinaryView). Strengthen dictionary view cast tests to assert keys/values directly and add regressions distinguishing None from Some("null") for both String and Utf8View inputs.
klion26
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks for the contribution.
|
Thanks @ethan-tyler and @klion26
Why wouldn't we just pack to Utf8View directly? |
Good question @alamb - Packing directly to Utf8View would require a dictionary builder for view types (dedup and incremental construction over a block/buffer-indexed view layout, plus view invariants like prefix/offset correctness). The two step path reuses the existing Dictionary(K, Utf8/Binary) packing, then reuses the existing cast machinery to produce Dictionary(K, Utf8View/BinaryView). When the dictionary values are Utf8/Binary and offsets fit, the cast can build views over the existing values buffer via append_block/view_from_dict_values (no extra value buffer copy). For Large* / oversized values, it can fall back to the general cast path (potentially copying into view blocks) instead of trying to force a zero copy view representation. |
I think you can use The one thing that might be tricky is knowing what the pre-existing index was Specifically: https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html#method.with_deduplicate_strings I am not sure what you mean by "invariants like prefix/offset correctness"
|
You're right that Packing directly to For this PR I kept the two step path: pack to Also, "prefix/offset correctness" was poor wording, I just meant constructing valid |
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good approach, since we're adding some missing functionality; we can always try optimize in later PRs.
Main concern is the verbosity of the tests, I feel we can condense some of the test code and remove some unnecessary tests here.
arrow-cast/src/cast/mod.rs
Outdated
| let dict_array = cast_array | ||
| .as_any() | ||
| .downcast_ref::<DictionaryArray<UInt16Type>>() | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let dict_array = cast_array | |
| .as_any() | |
| .downcast_ref::<DictionaryArray<UInt16Type>>() | |
| .unwrap(); | |
| let dict_array = cast_array.as_dictionary::<UInt16Type>(); |
Less verbose casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - Switched to cast_array.as_dictionary::<UInt16Type>() in the dict view tests to remove the manual downcast boilerplate
|
|
||
| let cast_type = | ||
| DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8View)); | ||
| let cast_array = cast(&array, &cast_type).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also assert can_cast_types alongside these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added assert!(can_cast_types(..)) the successful casts (including the key overflow case)
arrow-cast/src/cast/mod.rs
Outdated
| let array = StringArray::from(vec![Some("one"), None, Some("null"), Some("one")]); | ||
| let view = cast(&array, &DataType::Utf8View).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well just construct the view array directly instead of casting it from a string array? Since we're testing dict casting
Actually how does this test differ from the above one? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the view input tests to construct StringViewArray/BinaryViewArray directly, they strictly test View → Dictionary. To avoid duplication the null vs literal‑null case stays only in the StringArray -> Dictionary(Utf8View) test .
Fix arrow-cast failing to cast into Dictionary(K, Utf8View) and Dictionary(K, BinaryView) even though can_cast_types returns true.
Which issue does this PR close?
Rationale for this change
Downstream pipelines may require view-typed schemas (e.g. schema_force_view_types) while still dictionary-encoding columns for memory efficiency. Even with view types, dictionary encoding can reduce per-row memory (1–4 byte keys vs 16-byte views) for low-cardinality columns. Today there's no way to materialize Dictionary(K, Utf8View/BinaryView) via arrow-cast.
What changes are included in this PR?
Add Utf8View and BinaryView support in cast_to_dictionary using a two-step pack to recast approach:
Are these changes tested?
Are there any user-facing changes?
No