-
Notifications
You must be signed in to change notification settings - Fork 196
[755] Support column stats for paimon #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonDataFileExtractor.java
Outdated
Show resolved
Hide resolved
| List<String> colNames = file.valueStatsCols(); | ||
| // log.info("valueStatsCols: {}", colNames); | ||
| if (colNames == null || colNames.isEmpty()) { | ||
| // if column names are not present, we assume all columns in the schema are present in the same order as the schema - TODO: validate this assumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q to Paimon experts: Is this assumption valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mao-liu are you in contact with anyone on the Paimon side to get these questions answered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikedias do you know the answers to any of these Paimon questions on this PR by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't have these answers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @the-other-tim-brown , apologies I haven't been very active on this PR until this week.
I have just emailed the Paimon user group about these questions, and hoping to hear back soon.
We have been busy test-driving this change, and happy to report it's working well thus far!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonDataFileExtractor.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonDataFileExtractor.java
Outdated
Show resolved
Hide resolved
c0cabf3 to
2757735
Compare
2757735 to
6e927e2
Compare
| // TODO: Implement logic to extract column stats from the file meta | ||
| // https://github.com/apache/incubator-xtable/issues/755 | ||
| return Collections.emptyList(); | ||
| private List<ColumnStat> toColumnStats(DataFileMeta file, InternalSchema internalSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the column stats conversion to its own class? In the future if we add Paimon as a target then we will also need to convert to the Paimon representation and it would be nice to have all this stats logic in its own class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks for the early review @the-other-tim-brown !
I do wonder though, if it is even possible to have Paimon as a target... Paimon has a pretty unique file layout, and might not be as easily "tricked" as other formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about Paimon to say. Hudi also has a unique native layout structure to allow for update heavy workloads though and we were able to make this work.
Mainly we do this separation to keep the logic isolated though. As not necessarily relevant to Paimon, but if a table format changes how they represent stats in a new version, we can plug in the appropriate converter based on the version.
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonDataFileExtractor.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonStatsExtractor.java
Outdated
Show resolved
Hide resolved
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonStatsExtractor.java
Outdated
Show resolved
Hide resolved
|
Hey @the-other-tim-brown , we have validated the assumptions in this PR with responses from Paimon maintainers - no more TODOs from me, ready for your review now :) |
| Object min = getValue(minValues, i, type, field.getSchema()); | ||
| Object max = getValue(maxValues, i, type, field.getSchema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the getValue returns null, should we set the range value to null as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, added this
| InternalType type = field.getSchema().getDataType(); | ||
| Object min = getValue(minValues, i, type, field.getSchema()); | ||
| Object max = getValue(maxValues, i, type, field.getSchema()); | ||
| Long nullCount = (nullCounts != null && i < nullCounts.size()) ? nullCounts.getLong(i) : 0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: make this a primitive long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
| TestPaimonTable.createTable("test_table", "level", tempDir, new Configuration(), false); | ||
| paimonTable = testTable.getPaimonTable(); | ||
|
|
||
| // just the partition field matters for this test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why the test setup is changing as part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous tests did not accurately represent the schema of the test tables - the tables had certain columns, but the "testSchema" objects were empty.
The stats extractor rely on the table schema to parse binary stats data from Paimon metadata, hence the change to extract the schema properly
| // compaction create commits that are DELETE and ADD on the same file | ||
| // with `manifest.delete-file-drop-stats` enabled, this means stats are empty after compaction | ||
| // this is a smoke test to ensure exceptions aren't raised for this scenario | ||
| // TODO: Question for Paimon experts - is this the expected behaviour? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has this question been answered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't been validated with the Paimon community, though empirically this is the behaviour of the manifest.delete-file-drop-stats configuration today.
I have created a bug report at apache/paimon#7026 , and replaced the TODO with a link to the GH issue
| import org.apache.xtable.model.storage.InternalDataFile; | ||
|
|
||
| @Log4j2 | ||
| public class TestPaimonStatsExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a test case where the schema has nested fields?
If it will cause a lot of changes, feel free to suggest we handle this in a follow up pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't believe Paimon collects stats on nested fields or complex fields
I have added a test case showing this is the case.
| // Insert some data to create files | ||
| testTable.insertRows(5); | ||
|
|
||
| List<InternalDataFile> result = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some basic sanity checks that the stats are non-null/empty for the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call! This has been added
98012d5 to
ad5b35a
Compare
ad5b35a to
4867167
Compare
|
@the-other-tim-brown thanks for your review! I have replied to your comments and made the suggested changes, thanks again! |
|
@mao-liu things look good overall but it looks like the tests are not passing. Can you take a look? |
@the-other-tim-brown Sorry fixed a test! Missed it somehow on my local. Shall we try the build again? |
| <version>${spark.version}</version> | ||
| <scope>provided</scope> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mao-liu the Paimon dependency brings in transitive dependency on spark-catalyst. I had to pin the version to ensure it was consistent to solve the test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks heaps @the-other-tim-brown
The latest CI run also had some more test failures in xtable-utilities, I've fixed them up now
b10488e to
1183121
Compare
|
I'm tracking the CI issues in this ticket #787 |
The CI Issue is now resolved, please rebase your PR when you have a chance. |
Thanks @the-other-tim-brown , done! |
Important Read
Closes #755
What is the purpose of the pull request
Adds support for Paimon metadata stats
Brief change log
Verify this pull request
This change added tests and can be verified as follows:
Dev Notes