-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Implement source-ids to deal with multi arguments transforms #12897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@rdblue @Fokko @RussellSpitzer ^^ Thanks ! |
| fields.add(new PartitionField(sourceId, fieldId, name, transform)); | ||
| Builder add(List<Integer> sourceIds, int fieldId, String name, Transform<?, ?> transform) { | ||
| // we use the first entry in the source-ids list here | ||
| checkAndAddPartitionName(name, sourceIds.get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work for multi arg partition field?
Do we need a logic that accepts the sourceIds and resolve the name of multi arg transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sourceId here is to resolve conflict (no impact on name):
if (sourceColumnId != null) {
// for identity transform case we allow conflicts between partition and schema field name
// as
// long as they are sourced from the same schema field
Preconditions.checkArgument(
schemaField == null || schemaField.fieldId() == sourceColumnId,
"Cannot create identity partition sourced from different field in schema: %s",
name);
Let me check if we should compare fieldId with each column id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic looks like
if (check conflicts). {
if identity {
make sure we aren't using a different field name for this identity transform
} else {
make sure we aren't matching any other column name
}
}
Make sure it's not empty
Make sure we haven't already used this name for another partition
Add partitionI have no idea why those last 2 checks aren't part of the "if check conflicts" branch
Anyyyyyyway. I think this whole validation probably should be rewritten. The first branch we are checking basically based a lot of implicit assumptions when we should just be passing in the transform. But we don't have to do any of that now.
For now I think we should pass in sourceIds and just have the first branch include a "if sourceIds.length == 1"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me refactore this part.
a5b96f2 to
549c197
Compare
|
|
||
| for (UnboundPartitionField field : fields) { | ||
| Type fieldType = schema.findType(field.sourceId); | ||
| Type fieldType = schema.findType(field.sourceIds.get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we don't have a multi-arg transform to resolve here yet but this doesn't seem like the right thing to do. I think Transforms.fromString needs to be modified to accept fromString(list[types], transformName)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree. Let me update that.
| .isInstanceOf(IllegalArgumentException.class) | ||
| .hasMessage( | ||
| "Cannot parse partition field, either source-id or source-ids has to be present"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have some tests for "toJson" as well?
We also need a check that the validation for identity transforms still holds true. Ie you cannot make a multi-arg transform with the same name as any of the columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test in TestPartitionSpecParser testing when neither source-id and source-ids are provided: testFromJsonWithoutSourceIdAndSourceIds().
I will add additional tests for toJson path too.
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks pretty good to me. I think we are missing some testing but the approach looks correct. There are a few utility methods we need to fix up as well to accept sourceIds
|
Thanks @RussellSpitzer ! Let me fix the util methods and add tests. Thanks again ! |
549c197 to
1552f6c
Compare
| static void checkCompatibility(PartitionSpec spec, Schema schema) { | ||
| for (PartitionField field : spec.fields) { | ||
| Type sourceType = schema.findType(field.sourceId()); | ||
| Type sourceType = schema.findType(field.sourceIds().get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be extending this logic as well to handle multi arg sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. I'm doing a new update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced new canTransform(List) and getResultType(List) to support multi-args in transforms.
I didn't remove the "old" methods, and I propose to deprecate it.
Thoughts ?
|
Hey guys. I was traveling this week. I'm now back on this pr, updating according to the comments. |
|
I did a first update to introduce multi-args in |
ccc3e48 to
b1b407f
Compare
|
Finally back from several trips, so resuming work on this PR. |
|
I fixed the tests. I'm addressing the pending comments. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
I'm resuming the work on this one (I have on vacation for the last two weeks). |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
29ccf37 to
9fe5338
Compare
|
I just did a quick "rebase quick fix". I'm resuming my work on this one. |
1d5c435 to
c3ce9fc
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
|
Here we go again 😄 |
|
Back again :) |
Following the update on the spec regarding
source-idandsource-ids(thanks again @Fokko 😄 ), here's the PR to introducesource-idsfield in partition spec.A few notes:
source-idssource-idandsource-idselements in the json (exclusive), both populatingsource-idsinternal representation (asList<Integer>)TestPartitionSpecParseris still testingsource-id, but alsosource-idsparsing, and neithersource-idandsource-idspresence (throwingIllegalArgumentExceptionin that case)