[WIP] Explore fancier custom JSON decoding #9259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

scovich wants to merge 16 commits into apache:main from scovich:decoder-factory-json-extras

+1,539 −141

Contributor

scovich commented Jan 24, 2026 •

edited

Loading

Which issue does this PR close?

N/A - pathfinding related to

Allow extensions to arrow-json decoder and include an extension for variant #9021

Rationale for this change

NOTE: Stacked on top of the other PR, so only look at the last four commits of this PR!

Exploring #9021 (comment)

What changes are included in this PR?

Enhance the customization support to allow things like:

Turn incompatible types to NULL instead of error
Special parsing behavior for complex types when presented with a string
Ability to apply specific custom decoders to specific fields based on field metadata
Ability to apply specific custom decoders to specific fields in a struct based on their path through the schema

Most of the changes are in a new test file that demos them. Important supporting changes are made to arrow-json.

If we decide to pursue this direction, the PR should be split into at least three separate PR, in addition to the original PR this one builds on.

Are these changes tested?

Yes, new unit tests

Are there any user-facing changes?

cht42 and others added 13 commits

December 18, 2025 16:50


          Add custom decoder in arrow-json

21581ea


          include field

5fe3642


          change export

53bffa2


          cleanup

66b7512


          more obvious

6aeb892


          fix doctest and more exhaustive check

93ad08c


          more lint fixes

1fb7c47


          Merge branch 'main' into decoder-factory-json

e48e1c9


          Merge remote-tracking branch 'oss/main' into decoder-factory-json-extras

41a0dc2


          move extension type construction logic out of Field

1f50f34


          checkpoint - reworked signatures

6290db9


          checkpoint - add metadata stripping

06fb5f5


          Ditch annotations. Got it working

f7b36d7

github-actions bot added arrow arrow-flight parquet-variant labels

Contributor Author

scovich commented Jan 24, 2026 •

edited

Loading

@debugmiller and @alamb, I'd love your thoughts on this pathfinding? It builds on top of @debugmiller excellent starting point and would merge mostly separately (afterward).

The main consideration is that it would impact the public API we create, so it would be nice to get that part right on the first try.

scovich commented

View reviewed changes

Contributor Author

scovich left a comment

Self-review.

Also, I just realized there's a missing test, to validate recursive propagation of decoder factories. I'll try to add that soon.

arrow-flight/tests/flight_sql_client_cli.rs Outdated Show resolved Hide resolved

arrow-json/src/reader/list_array.rs

    
              impl<O: OffsetSizeTrait> ListArrayDecoder<O> {

                  pub fn new(

                      data_type: DataType,

                      data_type: &DataType,

Contributor Author

scovich Jan 24, 2026

I changed all the decoder paths to pass borrowed refs because it was leading to a lot of clone calls otherwise.

Plus, one of the example decoders relies on the datatype instances to have stable memory locations during the recursive make_decoder call.

arrow-json/src/reader/list_array.rs Outdated

    
                      is_nullable: bool,

                      struct_mode: StructMode,

                      decoder_factory: Option<Arc<dyn DecoderFactory>>,

                      decoder_factory: Option<&dyn DecoderFactory>,

Contributor Author

scovich Jan 24, 2026

Needed because this change relies on forwarding decoder factories from a trait provided method, and &Arc<Self> receiver is not dyn-compatible.

arrow-json/src/reader/list_array.rs Outdated

Comment on lines 53 to 54

    
                          Some(field.clone()),

                          field.data_type().clone(),

Contributor Author

scovich Jan 24, 2026

one of the redundant clones I mentioned earlier, eliminated

arrow-json/src/reader/list_array.rs Outdated

Comment on lines 52 to 57

    
                          field.data_type(),

                          field.is_nullable(),

                          field.metadata(),

                          coerce_primitive,

                          strict_mode,

                          field.is_nullable(),

Contributor Author

scovich Jan 24, 2026

reordered to match the Field::new

arrow-json/tests/custom_decoder_tests.rs

    
                  }

              }

              impl<'a> Hash for DataTypeIdentity<'a> {

Contributor Author

scovich Jan 24, 2026

Suggested change

      
            impl<'a> Hash for DataTypeIdentity<'a> {
          
            impl Hash for DataTypeIdentity<'_> {

arrow-json/tests/custom_decoder_tests.rs

Comment on lines +587 to +589

    
              enum DataTypeIdentity<'a> {

                  FieldRef(FieldRef),

                  DataType(&'a DataType),

Contributor Author

scovich Jan 24, 2026

It took a long time to figure out this approach. It allows creating a hash map whose keys compare by pointer instead of by value, while still respecting lifetimes -- pointers are only created very briefly during actual hashing and comparisons.

arrow-json/tests/custom_decoder_tests.rs

Comment on lines +616 to +621

    
              /// A factory that routes to custom decoders based on specific field paths.

              ///

              /// This allows fine-grained control: customize specific fields by name without

              /// polluting the schema with metadata or affecting all fields of a given type.

              #[derive(Debug)]

              struct PathBasedDecoderFactory {

Contributor Author

scovich Jan 24, 2026

This is a really powerful factory -- it allows individually customizing every single field in a schema, without altering the schema itself.

Relies on the fact that the updated decoder code passes the exact same DataType reference to the decoder factory's constructor as will eventually be passed to the make_custom_decoder method calls. Once the decoders have actually been created, we no longer care what happens to the data type.

arrow-json/tests/custom_decoder_tests.rs

    
                      // Walk the fields and associate DataTypeIdentity::FieldRef with factory for O(1) lookup

                      let mut routes = HashMap::new();

                      for (path, factory) in path_routes {

                          let parts: Vec<&str> = path.split('.').collect();

Contributor Author

scovich Jan 24, 2026

A real implementation would use a more robust mechanism for expressing nested column names, but I'm not aware of anything already in arrow-rs that would fit. The variant paths used by e.g. variant_get are the closest thing I could find, but they also support array indexing, which is not meaningful here. And they're anyway in the wrong crate.

arrow-json/tests/custom_decoder_tests.rs

    
                  /// Recursively find a Field by following a path of field names.

                  fn find_field_by_path(fields: &Fields, path: &[&str]) -> Option<FieldRef> {

                      let (first, rest) = path.split_first()?;

                      let field = fields.iter().find(|f| f.name() == *first)?;

Contributor Author

scovich Jan 24, 2026

aside: very annoying that Fields doesn't have an efficient way to find a field by name...


          Fix plumbing for nested factories

a5cad3f

scovich mentioned this pull request

Allow extensions to arrow-json decoder and include an extension for variant #9021

Open

scovich added 2 commits

January 26, 2026 09:12


          got it all working nice!

41583ef


          tidy up

1994cd5

github-actions bot removed the arrow-flight label

scovich mentioned this pull request

make_decoder accepts borrowed DataType instead of owned #9270

Merged

alamb reviewed

View reviewed changes

arrow-json/src/reader/mod.rs

    
                      impl DecoderFactory for AlwaysNullStringArrayDecoderFactory {

                          fn make_custom_decoder(

                              &self,

Contributor

alamb Jan 26, 2026

What do you think of the idea that the factory gets registered for each column (rather than each type)? #9021 (comment)

Contributor Author

scovich Jan 27, 2026 •

edited

Loading

After playing around in this PR, I think we need both path-based and type-based factories:

Type-based -- for extension types, generally lenient parsing preferences, etc. The behavior of such decoders isn't related to its location in the schema, and expressing them path-based would require traversing the schema to find them would be annoying.
Path-based -- for "quirks-mode" parsing of specific badly-bahaved fields, flexible parsing of free-form fields while leaving the rest of the schema strongly typed, etc.

NOTE: Although one could use variant for a lot of these use cases, it adds an extra layer of indirection and mostly pushes off the problem to whoever consumes the resulting variant column -- if they want to do anything fancy, they'll just end up dealing with e.g. Variant::String instead of TapeElement::String. Not an obvious win. Granted, variant_get adds a lot of capability flexibility and error tolerance (e.g. type mismatches just return NULL instead of errors), so it might be worth using variant instead of custom decoders in simple path-based cases.

Contributor

alamb Jan 27, 2026

That makes sense to me

scovich mentioned this pull request

[WIP] Json decoder factory v2 #9272

Open

alamb pushed a commit that referenced this pull request


          make_decoder accepts borrowed DataType instead of owned (#9270)

fe126d0

# Which issue does this PR close?

- Part of #8987

# Rationale for this change

Today's json decoder helper, `make_decoder`, takes an owned data type
whose components are cloned at every level during the recursive decoder
initialization process. This breaks pointer stability of the resulting
`DataType` instances that a custom JSON decoder factory would see, vs.
those of the schema it and the reader builder were initialized with.

The lack of pointer stability prevents users from creating "path based"
decoder factories, that are able to customize decoder behavior based not
only on type, but also on the field's path in the schema. See the
`PathBasedDecoderFactory` in arrow-json/tests/custom_decoder_tests.rs of
#9259, for an example.

# What changes are included in this PR?

By passing `&DataType` instead, we change code like this:
```rust
let decoder = make_decoder(field.data_type().clone(), ...);
Ok(Self {
    data_type,
    decoder,
      ...
})
```

to this:
```rust
let child_decoder = make_decoder(field.data_type(), ...);
Ok(Self {
    data_type: data_type.clone(),
    decoder,
      ...
})
```

Result: Every call to `make_decoder` receives a reference to the actual
original data type from the builder's input schema. The final decoder
`Self` is unchanged -- it already received a clone and continues to do
so.

NOTE: There is one additional clone of the top-level `DataType::Struct`
we create for normal (`!is_field`) builders. But that's a cheap arc
clone of a `Fields` member.

# Are these changes tested?

Yes, existing unit tests validate the change.

# Are there any user-facing changes?

No. All functions and data types involved are private -- the array
decoders are marked `pub` but are defined in a private mod with no
public re-export that would make them available outside the crate.

debugmiller commented Jan 27, 2026

From my end this looks very good. Builds nicely on top of #9021 and covers some of the sharp corners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow parquet-variant