-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[WIP] Explore fancier custom JSON decoding #9259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@debugmiller and @alamb, I'd love your thoughts on this pathfinding? It builds on top of @debugmiller excellent starting point and would merge mostly separately (afterward). The main consideration is that it would impact the public API we create, so it would be nice to get that part right on the first try. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review.
Also, I just realized there's a missing test, to validate recursive propagation of decoder factories. I'll try to add that soon.
| impl<O: OffsetSizeTrait> ListArrayDecoder<O> { | ||
| pub fn new( | ||
| data_type: DataType, | ||
| data_type: &DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed all the decoder paths to pass borrowed refs because it was leading to a lot of clone calls otherwise.
Plus, one of the example decoders relies on the datatype instances to have stable memory locations during the recursive make_decoder call.
arrow-json/src/reader/list_array.rs
Outdated
| is_nullable: bool, | ||
| struct_mode: StructMode, | ||
| decoder_factory: Option<Arc<dyn DecoderFactory>>, | ||
| decoder_factory: Option<&dyn DecoderFactory>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed because this change relies on forwarding decoder factories from a trait provided method, and &Arc<Self> receiver is not dyn-compatible.
arrow-json/src/reader/list_array.rs
Outdated
| Some(field.clone()), | ||
| field.data_type().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the redundant clones I mentioned earlier, eliminated
arrow-json/src/reader/list_array.rs
Outdated
| field.data_type(), | ||
| field.is_nullable(), | ||
| field.metadata(), | ||
| coerce_primitive, | ||
| strict_mode, | ||
| field.is_nullable(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reordered to match the Field::new
| } | ||
| } | ||
|
|
||
| impl<'a> Hash for DataTypeIdentity<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| impl<'a> Hash for DataTypeIdentity<'a> { | |
| impl Hash for DataTypeIdentity<'_> { |
| enum DataTypeIdentity<'a> { | ||
| FieldRef(FieldRef), | ||
| DataType(&'a DataType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a long time to figure out this approach. It allows creating a hash map whose keys compare by pointer instead of by value, while still respecting lifetimes -- pointers are only created very briefly during actual hashing and comparisons.
| /// A factory that routes to custom decoders based on specific field paths. | ||
| /// | ||
| /// This allows fine-grained control: customize specific fields by name without | ||
| /// polluting the schema with metadata or affecting all fields of a given type. | ||
| #[derive(Debug)] | ||
| struct PathBasedDecoderFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really powerful factory -- it allows individually customizing every single field in a schema, without altering the schema itself.
Relies on the fact that the updated decoder code passes the exact same DataType reference to the decoder factory's constructor as will eventually be passed to the make_custom_decoder method calls. Once the decoders have actually been created, we no longer care what happens to the data type.
| // Walk the fields and associate DataTypeIdentity::FieldRef with factory for O(1) lookup | ||
| let mut routes = HashMap::new(); | ||
| for (path, factory) in path_routes { | ||
| let parts: Vec<&str> = path.split('.').collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real implementation would use a more robust mechanism for expressing nested column names, but I'm not aware of anything already in arrow-rs that would fit. The variant paths used by e.g. variant_get are the closest thing I could find, but they also support array indexing, which is not meaningful here. And they're anyway in the wrong crate.
| /// Recursively find a Field by following a path of field names. | ||
| fn find_field_by_path(fields: &Fields, path: &[&str]) -> Option<FieldRef> { | ||
| let (first, rest) = path.split_first()?; | ||
| let field = fields.iter().find(|f| f.name() == *first)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: very annoying that Fields doesn't have an efficient way to find a field by name...
|
|
||
| impl DecoderFactory for AlwaysNullStringArrayDecoderFactory { | ||
| fn make_custom_decoder( | ||
| &self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of the idea that the factory gets registered for each column (rather than each type)? #9021 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing around in this PR, I think we need both path-based and type-based factories:
- Type-based -- for extension types, generally lenient parsing preferences, etc. The behavior of such decoders isn't related to its location in the schema, and expressing them path-based would require traversing the schema to find them would be annoying.
- Path-based -- for "quirks-mode" parsing of specific badly-bahaved fields, flexible parsing of free-form fields while leaving the rest of the schema strongly typed, etc.
NOTE: Although one could use variant for a lot of these use cases, it adds an extra layer of indirection and mostly pushes off the problem to whoever consumes the resulting variant column -- if they want to do anything fancy, they'll just end up dealing with e.g. Variant::String instead of TapeElement::String. Not an obvious win. Granted, variant_get adds a lot of capability flexibility and error tolerance (e.g. type mismatches just return NULL instead of errors), so it might be worth using variant instead of custom decoders in simple path-based cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me
# Which issue does this PR close? - Part of #8987 # Rationale for this change Today's json decoder helper, `make_decoder`, takes an owned data type whose components are cloned at every level during the recursive decoder initialization process. This breaks pointer stability of the resulting `DataType` instances that a custom JSON decoder factory would see, vs. those of the schema it and the reader builder were initialized with. The lack of pointer stability prevents users from creating "path based" decoder factories, that are able to customize decoder behavior based not only on type, but also on the field's path in the schema. See the `PathBasedDecoderFactory` in arrow-json/tests/custom_decoder_tests.rs of #9259, for an example. # What changes are included in this PR? By passing `&DataType` instead, we change code like this: ```rust let decoder = make_decoder(field.data_type().clone(), ...); Ok(Self { data_type, decoder, ... }) ``` to this: ```rust let child_decoder = make_decoder(field.data_type(), ...); Ok(Self { data_type: data_type.clone(), decoder, ... }) ``` Result: Every call to `make_decoder` receives a reference to the actual original data type from the builder's input schema. The final decoder `Self` is unchanged -- it already received a clone and continues to do so. NOTE: There is one additional clone of the top-level `DataType::Struct` we create for normal (`!is_field`) builders. But that's a cheap arc clone of a `Fields` member. # Are these changes tested? Yes, existing unit tests validate the change. # Are there any user-facing changes? No. All functions and data types involved are private -- the array decoders are marked `pub` but are defined in a private mod with no public re-export that would make them available outside the crate.
|
From my end this looks very good. Builds nicely on top of #9021 and covers some of the sharp corners. |
Which issue does this PR close?
N/A - pathfinding related to
Rationale for this change
NOTE: Stacked on top of the other PR, so only look at the last four commits of this PR!
Exploring #9021 (comment)
What changes are included in this PR?
Enhance the customization support to allow things like:
Most of the changes are in a new test file that demos them. Important supporting changes are made to arrow-json.
If we decide to pursue this direction, the PR should be split into at least three separate PR, in addition to the original PR this one builds on.
Are these changes tested?
Yes, new unit tests
Are there any user-facing changes?