Skip to content

Conversation

@jbonofre
Copy link
Member

@jbonofre jbonofre commented Apr 25, 2025

Following the update on the spec regarding source-id and source-ids (thanks again @Fokko 😄 ), here's the PR to introduce source-ids field in partition spec.

A few notes:

  1. Internal representation is now based on source-ids
  2. Serialization/deserialization supports source-id and source-ids elements in the json (exclusive), both populating source-ids internal representation (as List<Integer>)
  3. The TestPartitionSpecParser is still testing source-id, but also source-ids parsing, and neither source-id and source-ids presence (throwing IllegalArgumentException in that case)
  4. For backward compatibility (especially for the transforms), source id is still supported (using the first element in the internal representation) and some methods have been flagged as deprecated to encourage use of source ids.

@jbonofre
Copy link
Member Author

@rdblue @Fokko @RussellSpitzer ^^ Thanks !

@jbonofre jbonofre changed the title Implement source-ids to deal with multi arguments transforms Core: Implement source-ids to deal with multi arguments transforms Apr 25, 2025
fields.add(new PartitionField(sourceId, fieldId, name, transform));
Builder add(List<Integer> sourceIds, int fieldId, String name, Transform<?, ?> transform) {
// we use the first entry in the source-ids list here
checkAndAddPartitionName(name, sourceIds.get(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work for multi arg partition field?
Do we need a logic that accepts the sourceIds and resolve the name of multi arg transform?

Copy link
Member Author

@jbonofre jbonofre Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sourceId here is to resolve conflict (no impact on name):

        if (sourceColumnId != null) {
          // for identity transform case we allow conflicts between partition and schema field name
          // as
          //   long as they are sourced from the same schema field
          Preconditions.checkArgument(
              schemaField == null || schemaField.fieldId() == sourceColumnId,
              "Cannot create identity partition sourced from different field in schema: %s",
              name);

Let me check if we should compare fieldId with each column id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic looks like

if (check conflicts). {
  if identity {
    make sure we aren't using a different field name for this identity transform
  } else {
   make sure we aren't matching any other column name
  }
}
Make sure it's not empty
Make sure we haven't already used this name for another partition
Add partition

I have no idea why those last 2 checks aren't part of the "if check conflicts" branch

Anyyyyyyway. I think this whole validation probably should be rewritten. The first branch we are checking basically based a lot of implicit assumptions when we should just be passing in the transform. But we don't have to do any of that now.

For now I think we should pass in sourceIds and just have the first branch include a "if sourceIds.length == 1"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me refactore this part.

@jbonofre jbonofre force-pushed the multi-arg-transforms branch 3 times, most recently from a5b96f2 to 549c197 Compare April 25, 2025 13:29

for (UnboundPartitionField field : fields) {
Type fieldType = schema.findType(field.sourceId);
Type fieldType = schema.findType(field.sourceIds.get(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't have a multi-arg transform to resolve here yet but this doesn't seem like the right thing to do. I think Transforms.fromString needs to be modified to accept fromString(list[types], transformName)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree. Let me update that.

.isInstanceOf(IllegalArgumentException.class)
.hasMessage(
"Cannot parse partition field, either source-id or source-ids has to be present");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have some tests for "toJson" as well?

We also need a check that the validation for identity transforms still holds true. Ie you cannot make a multi-arg transform with the same name as any of the columns

Copy link
Member Author

@jbonofre jbonofre Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test in TestPartitionSpecParser testing when neither source-id and source-ids are provided: testFromJsonWithoutSourceIdAndSourceIds().
I will add additional tests for toJson path too.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks pretty good to me. I think we are missing some testing but the approach looks correct. There are a few utility methods we need to fix up as well to accept sourceIds

@jbonofre
Copy link
Member Author

Thanks @RussellSpitzer ! Let me fix the util methods and add tests. Thanks again !

@jbonofre jbonofre force-pushed the multi-arg-transforms branch from 549c197 to 1552f6c Compare April 26, 2025 04:28
static void checkCompatibility(PartitionSpec spec, Schema schema) {
for (PartitionField field : spec.fields) {
Type sourceType = schema.findType(field.sourceId());
Type sourceType = schema.findType(field.sourceIds().get(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be extending this logic as well to handle multi arg sources

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I'm doing a new update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced new canTransform(List) and getResultType(List) to support multi-args in transforms.
I didn't remove the "old" methods, and I propose to deprecate it.
Thoughts ?

@jbonofre
Copy link
Member Author

Hey guys. I was traveling this week. I'm now back on this pr, updating according to the comments.

@jbonofre
Copy link
Member Author

I did a first update to introduce multi-args in Transform. I will check/update the tests too.

@jbonofre jbonofre force-pushed the multi-arg-transforms branch 2 times, most recently from ccc3e48 to b1b407f Compare May 17, 2025 06:12
@jbonofre
Copy link
Member Author

Finally back from several trips, so resuming work on this PR.

@jbonofre jbonofre added this to the Iceberg 1.10.0 milestone May 30, 2025
@jbonofre
Copy link
Member Author

jbonofre commented Jun 2, 2025

I fixed the tests. I'm addressing the pending comments.

@github-actions
Copy link

github-actions bot commented Aug 2, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 2, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 10, 2025
@jbonofre jbonofre reopened this Aug 18, 2025
@jbonofre
Copy link
Member Author

I'm resuming the work on this one (I have on vacation for the last two weeks).

@github-actions github-actions bot removed the stale label Aug 19, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 19, 2025
@jbonofre jbonofre force-pushed the multi-arg-transforms branch from 29ccf37 to 9fe5338 Compare September 22, 2025 12:00
@jbonofre
Copy link
Member Author

I just did a quick "rebase quick fix". I'm resuming my work on this one.

@jbonofre jbonofre force-pushed the multi-arg-transforms branch from 1d5c435 to c3ce9fc Compare September 22, 2025 15:33
@github-actions github-actions bot removed the stale label Sep 23, 2025
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 23, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 30, 2025
@jbonofre
Copy link
Member Author

jbonofre commented Nov 3, 2025

Here we go again 😄

@jbonofre jbonofre reopened this Nov 3, 2025
@singhpk234 singhpk234 added not-stale and removed stale labels Nov 3, 2025
@jbonofre
Copy link
Member Author

jbonofre commented Dec 1, 2025

Back again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants