Skip to content

Conversation

@ethan-tyler
Copy link

Fix arrow-cast failing to cast into Dictionary(K, Utf8View) and Dictionary(K, BinaryView) even though can_cast_types returns true.

Which issue does this PR close?

Rationale for this change

Downstream pipelines may require view-typed schemas (e.g. schema_force_view_types) while still dictionary-encoding columns for memory efficiency. Even with view types, dictionary encoding can reduce per-row memory (1–4 byte keys vs 16-byte views) for low-cardinality columns. Today there's no way to materialize Dictionary(K, Utf8View/BinaryView) via arrow-cast.

What changes are included in this PR?

Add Utf8View and BinaryView support in cast_to_dictionary using a two-step pack to recast approach:

  • Pack to Dictionary(K, Utf8/Binary)
  • Cast dictionary values to Utf8View/BinaryView via existing cast machinery

Are these changes tested?

  • StringArray/StringViewArray to Dictionary(UInt16, Utf8View) (including sliced inputs)
  • BinaryArray/BinaryViewArray to Dictionary(UInt16, BinaryView) (including sliced inputs)
  • Key overflow for Dictionary(UInt8, Utf8View)
  • Empty and all-null inputs

Are there any user-facing changes?

No

Fix cast_to_dictionary failing for view value types by packing to Utf8/Binary and recasting; add regression tests for view inputs and key overflow.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 19, 2026
};

let dict_base = cast_to_dictionary::<K>(array, &base_value_type, cast_options)?;
cast_with_options(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this equal to dictionary_cast::<K>(dict_base.as_ref(), &DataType::Dictionary(Box::new(K::Data_TYPE), Box::new(DataType::Utf8View), cast_options))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, effectively equivalent. cast_with_options(dict, Dictionary(..)) hits the (Dictionary(..), _) arm, which calls dictionary_cast::<K>, then routes to dictionary_to_dictionary_cast since the target is also Dictionary(..). Same code path as calling dictionary_cast::<K> directly.

Only difference: cast_with_options short-circuits when from == to (:769-772), which a direct dictionary_cast call skips. Doesn't affect actual dict to dict conversions.

Happy to switch to explicit dictionary_cast::<K>(...) if you prefer, makes the dict-to-dict intent clearer. Let me know.

ethan-tyler and others added 3 commits January 20, 2026 15:33
Avoid redundant cast dispatch when packing to Dictionary(_, Utf8View/BinaryView). Strengthen dictionary view cast tests to assert keys/values directly and add regressions distinguishing None from Some("null") for both String and Utf8View inputs.
Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks for the contribution.

@alamb
Copy link
Contributor

alamb commented Jan 25, 2026

Thanks @ethan-tyler and @klion26

Add Utf8View and BinaryView support in cast_to_dictionary using a two-step pack to recast approach:

Pack to Dictionary(K, Utf8/Binary)
Cast dictionary values to Utf8View/BinaryView via existing cast machinery

Why wouldn't we just pack to Utf8View directly?

@ethan-tyler
Copy link
Author

Why wouldn't we just pack to Utf8View directly?

Good question @alamb - Packing directly to Utf8View would require a dictionary builder for view types (dedup and incremental construction over a block/buffer-indexed view layout, plus view invariants like prefix/offset correctness).

The two step path reuses the existing Dictionary(K, Utf8/Binary) packing, then reuses the existing cast machinery to produce Dictionary(K, Utf8View/BinaryView). When the dictionary values are Utf8/Binary and offsets fit, the cast can build views over the existing values buffer via append_block/view_from_dict_values (no extra value buffer copy).

For Large* / oversized values, it can fall back to the general cast path (potentially copying into view blocks) instead of trying to force a zero copy view representation.

@alamb
Copy link
Contributor

alamb commented Jan 26, 2026

Why wouldn't we just pack to Utf8View directly?

Good question @alamb - Packing directly to Utf8View would require a dictionary builder for view types (dedup and incremental construction over a block/buffer-indexed view layout, plus view invariants like prefix/offset correctness).

I think you can use StringViewBuilder to do this: https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html

The one thing that might be tricky is knowing what the pre-existing index was

Specifically: https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html#method.with_deduplicate_strings

I am not sure what you mean by "invariants like prefix/offset correctness"

The two step path reuses the existing Dictionary(K, Utf8/Binary) packing, then reuses the existing cast machinery to produce Dictionary(K, Utf8View/BinaryView). When the dictionary values are Utf8/Binary and offsets fit, the cast can build views over the existing values buffer via append_block/view_from_dict_values (no extra value buffer copy).

For Large* / oversized values, it can fall back to the general cast path (potentially copying into view blocks) instead of trying to force a zero copy view representation.

@ethan-tyler
Copy link
Author

I think you can use StringViewBuilder to do this: https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html

The one thing that might be tricky is knowing what the pre-existing index was

Specifically: https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html#method.with_deduplicate_strings

I am not sure what you mean by "invariants like prefix/offset correctness"

You're right that StringViewBuilder supports with_deduplicate_strings, but that's deduplication of the backing storage for a view array. It doesn't provide the value→dictionary-index ("get_or_insert") we need to build the keys, and the API doesn't expose which prior entry was reused.

Packing directly to Dictionary<K, Utf8View> would therefore need a view dictionary builder (or a StringViewBuilder API that returns the existing/new index without appending duplicates).

For this PR I kept the two step path: pack to Dictionary<K, Utf8/Binary> (reusing GenericByteDictionaryBuilder), then cast to view values. When the dictionary values are Utf8/Binary and offsets fit, dictionary_cast can use view_from_dict_values (append_block + views) so the values buffer is reused zero-copy. Large* / oversized falls back to the general cast path.

Also, "prefix/offset correctness" was poor wording, I just meant constructing valid ByteView { prefix, buffer_index, offset, length } descriptors, which the existing cast path handles.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach, since we're adding some missing functionality; we can always try optimize in later PRs.

Main concern is the verbosity of the tests, I feel we can condense some of the test code and remove some unnecessary tests here.

Comment on lines 5730 to 5733
let dict_array = cast_array
.as_any()
.downcast_ref::<DictionaryArray<UInt16Type>>()
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let dict_array = cast_array
.as_any()
.downcast_ref::<DictionaryArray<UInt16Type>>()
.unwrap();
let dict_array = cast_array.as_dictionary::<UInt16Type>();

Less verbose casting

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - Switched to cast_array.as_dictionary::<UInt16Type>() in the dict view tests to remove the manual downcast boilerplate


let cast_type =
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8View));
let cast_array = cast(&array, &cast_type).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also assert can_cast_types alongside these tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assert!(can_cast_types(..)) the successful casts (including the key overflow case)

Comment on lines 5775 to 5776
let array = StringArray::from(vec![Some("one"), None, Some("null"), Some("one")]);
let view = cast(&array, &DataType::Utf8View).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well just construct the view array directly instead of casting it from a string array? Since we're testing dict casting

Actually how does this test differ from the above one? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the view input tests to construct StringViewArray/BinaryViewArray directly, they strictly test View → Dictionary. To avoid duplication the null vs literal‑null case stays only in the StringArray -> Dictionary(Utf8View) test .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-cast] can_cast_types allows Dictionary(_, Utf8View/BinaryView) but cast fails

4 participants