Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jan 24, 2026

Which issue does this PR close?

N/A - pathfinding related to

Rationale for this change

NOTE: Stacked on top of the other PR, so only look at the last four commits of this PR!

Exploring #9021 (comment)

What changes are included in this PR?

Enhance the customization support to allow things like:

  • Turn incompatible types to NULL instead of error
  • Special parsing behavior for complex types when presented with a string
  • Ability to apply specific custom decoders to specific fields based on field metadata
  • Ability to apply specific custom decoders to specific fields in a struct based on their path through the schema

Most of the changes are in a new test file that demos them. Important supporting changes are made to arrow-json.

If we decide to pursue this direction, the PR should be split into at least three separate PR, in addition to the original PR this one builds on.

Are these changes tested?

Yes, new unit tests

Are there any user-facing changes?

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet-variant parquet-variant* crates labels Jan 24, 2026
@scovich
Copy link
Contributor Author

scovich commented Jan 24, 2026

@debugmiller and @alamb, I'd love your thoughts on this pathfinding? It builds on top of @debugmiller excellent starting point and would merge mostly separately (afterward).

The main consideration is that it would impact the public API we create, so it would be nice to get that part right on the first try.

Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review.

Also, I just realized there's a missing test, to validate recursive propagation of decoder factories. I'll try to add that soon.

impl<O: OffsetSizeTrait> ListArrayDecoder<O> {
pub fn new(
data_type: DataType,
data_type: &DataType,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all the decoder paths to pass borrowed refs because it was leading to a lot of clone calls otherwise.

Plus, one of the example decoders relies on the datatype instances to have stable memory locations during the recursive make_decoder call.

is_nullable: bool,
struct_mode: StructMode,
decoder_factory: Option<Arc<dyn DecoderFactory>>,
decoder_factory: Option<&dyn DecoderFactory>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed because this change relies on forwarding decoder factories from a trait provided method, and &Arc<Self> receiver is not dyn-compatible.

Comment on lines 53 to 54
Some(field.clone()),
field.data_type().clone(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the redundant clones I mentioned earlier, eliminated

Comment on lines 52 to 57
field.data_type(),
field.is_nullable(),
field.metadata(),
coerce_primitive,
strict_mode,
field.is_nullable(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered to match the Field::new

}
}

impl<'a> Hash for DataTypeIdentity<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<'a> Hash for DataTypeIdentity<'a> {
impl Hash for DataTypeIdentity<'_> {

Comment on lines +587 to +589
enum DataTypeIdentity<'a> {
FieldRef(FieldRef),
DataType(&'a DataType),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a long time to figure out this approach. It allows creating a hash map whose keys compare by pointer instead of by value, while still respecting lifetimes -- pointers are only created very briefly during actual hashing and comparisons.

Comment on lines +616 to +621
/// A factory that routes to custom decoders based on specific field paths.
///
/// This allows fine-grained control: customize specific fields by name without
/// polluting the schema with metadata or affecting all fields of a given type.
#[derive(Debug)]
struct PathBasedDecoderFactory {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really powerful factory -- it allows individually customizing every single field in a schema, without altering the schema itself.

Relies on the fact that the updated decoder code passes the exact same DataType reference to the decoder factory's constructor as will eventually be passed to the make_custom_decoder method calls. Once the decoders have actually been created, we no longer care what happens to the data type.

// Walk the fields and associate DataTypeIdentity::FieldRef with factory for O(1) lookup
let mut routes = HashMap::new();
for (path, factory) in path_routes {
let parts: Vec<&str> = path.split('.').collect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real implementation would use a more robust mechanism for expressing nested column names, but I'm not aware of anything already in arrow-rs that would fit. The variant paths used by e.g. variant_get are the closest thing I could find, but they also support array indexing, which is not meaningful here. And they're anyway in the wrong crate.

/// Recursively find a Field by following a path of field names.
fn find_field_by_path(fields: &Fields, path: &[&str]) -> Option<FieldRef> {
let (first, rest) = path.split_first()?;
let field = fields.iter().find(|f| f.name() == *first)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: very annoying that Fields doesn't have an efficient way to find a field by name...

@github-actions github-actions bot removed the arrow-flight Changes to the arrow-flight crate label Jan 26, 2026

impl DecoderFactory for AlwaysNullStringArrayDecoderFactory {
fn make_custom_decoder(
&self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the idea that the factory gets registered for each column (rather than each type)? #9021 (comment)

Copy link
Contributor Author

@scovich scovich Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After playing around in this PR, I think we need both path-based and type-based factories:

  • Type-based -- for extension types, generally lenient parsing preferences, etc. The behavior of such decoders isn't related to its location in the schema, and expressing them path-based would require traversing the schema to find them would be annoying.
  • Path-based -- for "quirks-mode" parsing of specific badly-bahaved fields, flexible parsing of free-form fields while leaving the rest of the schema strongly typed, etc.

NOTE: Although one could use variant for a lot of these use cases, it adds an extra layer of indirection and mostly pushes off the problem to whoever consumes the resulting variant column -- if they want to do anything fancy, they'll just end up dealing with e.g. Variant::String instead of TapeElement::String. Not an obvious win. Granted, variant_get adds a lot of capability flexibility and error tolerance (e.g. type mismatches just return NULL instead of errors), so it might be worth using variant instead of custom decoders in simple path-based cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me

alamb pushed a commit that referenced this pull request Jan 27, 2026
# Which issue does this PR close?

- Part of #8987

# Rationale for this change

Today's json decoder helper, `make_decoder`, takes an owned data type
whose components are cloned at every level during the recursive decoder
initialization process. This breaks pointer stability of the resulting
`DataType` instances that a custom JSON decoder factory would see, vs.
those of the schema it and the reader builder were initialized with.

The lack of pointer stability prevents users from creating "path based"
decoder factories, that are able to customize decoder behavior based not
only on type, but also on the field's path in the schema. See the
`PathBasedDecoderFactory` in arrow-json/tests/custom_decoder_tests.rs of
#9259, for an example.

# What changes are included in this PR?

By passing `&DataType` instead, we change code like this:
```rust
let decoder = make_decoder(field.data_type().clone(), ...);
Ok(Self {
    data_type,
    decoder,
      ...
})
```

to this:
```rust
let child_decoder = make_decoder(field.data_type(), ...);
Ok(Self {
    data_type: data_type.clone(),
    decoder,
      ...
})
```

Result: Every call to `make_decoder` receives a reference to the actual
original data type from the builder's input schema. The final decoder
`Self` is unchanged -- it already received a clone and continues to do
so.

NOTE: There is one additional clone of the top-level `DataType::Struct`
we create for normal (`!is_field`) builders. But that's a cheap arc
clone of a `Fields` member.

# Are these changes tested?

Yes, existing unit tests validate the change.

# Are there any user-facing changes?

No. All functions and data types involved are private -- the array
decoders are marked `pub` but are defined in a private mod with no
public re-export that would make them available outside the crate.
@debugmiller
Copy link

From my end this looks very good. Builds nicely on top of #9021 and covers some of the sharp corners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants