-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Fix data loss in partial variant shredding #15087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@huaxingao @aihuaxu could you guys please take a look? |
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
| } else if (unshredded != null) { | ||
| for (String name : unshredded.fieldNames()) { | ||
| boolean replaced = shreddedFields.containsKey(name) || removedFields.contains(name); | ||
| boolean replaced = this.shreddedFields.containsKey(name) || removedFields.contains(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @dirtysalt! Though admittedly I'm a bit confused why our linter didn't flag the hidden fields, maybe something to check separately. I think I'd prefer to rename the fields to something like currentShreddedFields, currentUnshreddedFields (something that better indicates that it's state). That makes the code a bit more clear imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this. Let's use different names to make it cleaner. Thanks for fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on renaming the field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the name conflict was allowed is that this is the constructor so the check assumes that shreddedFields is used to initialize this.shreddedFields, which is true.
I agree that this should use a different name for the method argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way of renaming the field or the method argument sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fewer changes if you rename the argument. Just two lines.
| } | ||
|
|
||
| @Test | ||
| public void testPartialShreddingWithShreddedObject() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a more direct way to test this in TestShreddedObject? I think it's fine to have an end-to-end test, but I'd like to see a more direct one that doesn't go through Parquet.
| .named("typed_value"); | ||
|
|
||
| // Write and read back | ||
| List<Record> actual = writeAndRead(partialShredding, records); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an end-to-end test should live in TestVariantReaders rather than TestVariantWriters. The problem you're trying to recreate is in the read path when a partially shredded object is being reconstructed. I think that the tests should also be in the read path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit more and ended up producing a test for TestShreddedObject that reproduces the problem. Here's my full diff of the test and the fix:
diff --git a/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java b/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java
index c1fba073cf..5d2bfc48fa 100644
--- a/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java
+++ b/core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java
@@ -153,12 +153,12 @@ public class ShreddedObject implements VariantObject {
private SerializationState(
VariantMetadata metadata,
VariantObject unshredded,
- Map<String, VariantValue> shreddedFields,
+ Map<String, VariantValue> shredded,
Set<String> removedFields) {
this.metadata = metadata;
// field ID size is the size needed to store the largest field ID in the data
this.fieldIdSize = VariantUtil.sizeOf(metadata.dictionarySize());
- this.shreddedFields = Maps.newHashMap(shreddedFields);
+ this.shreddedFields = Maps.newHashMap(shredded);
int totalDataSize = 0;
// get the unshredded field names and values as byte buffers
diff --git a/core/src/test/java/org/apache/iceberg/variants/TestShreddedObject.java b/core/src/test/java/org/apache/iceberg/variants/TestShreddedObject.java
index 1ebb433a54..0cb043d420 100644
--- a/core/src/test/java/org/apache/iceberg/variants/TestShreddedObject.java
+++ b/core/src/test/java/org/apache/iceberg/variants/TestShreddedObject.java
@@ -213,6 +213,28 @@ public class TestShreddedObject {
.isEqualTo(DateTimeUtil.isoDateToDays("2024-10-12"));
}
+ @Test
+ public void testPartiallyShreddedUnserializedObjectSerializationMinimalBuffer() {
+ ShreddedObject partial = createUnserializedObject(FIELDS);
+ VariantMetadata metadata = partial.metadata();
+
+ // replace field c with a new value
+ partial.put("c", Variants.ofIsoDate("2024-10-12"));
+ partial.remove("b");
+
+ VariantValue value = roundTripMinimalBuffer(partial, metadata);
+
+ assertThat(value).isInstanceOf(SerializedObject.class);
+ SerializedObject actual = (SerializedObject) value;
+
+ assertThat(actual.get("a")).isInstanceOf(VariantPrimitive.class);
+ assertThat(actual.get("a").asPrimitive().get()).isEqualTo(34);
+ assertThat(actual.get("c")).isInstanceOf(VariantPrimitive.class);
+ assertThat(actual.get("c").type()).isEqualTo(PhysicalType.DATE);
+ assertThat(actual.get("c").asPrimitive().get())
+ .isEqualTo(DateTimeUtil.isoDateToDays("2024-10-12"));
+ }
+
@Test
public void testPartiallyShreddedObjectSerializationLargeBuffer() {
ShreddedObject partial = createUnshreddedObject(FIELDS);
@@ -370,6 +392,12 @@ public class TestShreddedObject {
return Variants.value(metadata, slice);
}
+ private static ShreddedObject createUnserializedObject(Map<String, VariantValue> fields) {
+ ByteBuffer metadataBuffer = VariantTestUtil.createMetadata(fields.keySet(), false);
+ VariantMetadata metadata = SerializedMetadata.from(metadataBuffer);
+ return new ShreddedObject(metadata, createShreddedObject(metadata, fields));
+ }
+
/** Creates a ShreddedObject with fields in its shredded map, using the given metadata */
private static ShreddedObject createShreddedObject(
VariantMetadata metadata, Map<String, VariantValue> fields) {Feel free to use that test case. It would also be nice to have a Parquet test like this one. If you end up keeping it, please use helper methods to create the Parquet schema (see ParquetVariantUtil).
When using variantShreddingFunc to partially shred variant fields, unshredded fields were being lost during serialization. The bug was in ShreddedObject constructor where local variable
shreddedFieldsshadowed the instance fieldthis.shreddedFields, causing unshredded fields to be added to the local map instead of the instance field.This resulted in the binary value field containing only metadata headers without actual field data, causing IndexOutOfBoundsException on read and permanent data loss.
Fix: Changed all references in the problematic code block to use
this.shreddedFieldsexplicitly, ensuring unshredded fields are properly preserved in the instance field and serialized correctly.Added test case testPartialShreddingWithShreddedObject that reproduces the exact scenario from issue #15086.