Core: TableMetadata Projection #14502

singhpk234 · 2025-11-04T18:22:35Z

About the change [Updated]

Introduces TableMetadataProjection on top of the TableMetadata.

Use cases:
1/ snapshot summary contains information like total files / total-records etc and when partition summary is enabled it contains partition summary which is a sensitive information incase the table is protected against FGAC specially Row Access Policy, this essentially will help in dropping the snapshot summary by the RESTCatlaog before the table metadata is being sent to the being sent to the client (untrusted) as part of LoadTable response.
2/ Dropping the summary from snapshot obj can reduce the size transfer cost.

Testing

unit tests for w/wo lazy loading and parsing

singhpk234 · 2025-11-05T02:48:46Z

I was talking to @stevenzwu, he suggested a nice alternative here inspired by StructProjection adding it here for broder forum, techinically indeed its a projection

class TableMetadataProjection {

  TableMetadata create(TableMetadata metadata, Function<Snapshot, Snapshot> transformer) {
    new SnapshotTransformer(metadata.())
  }

  private static class SnapshotTransformer extends TableMetadata {
    @Override
    public List<Snapshot> snapshots() {
      List<Snapshot> s = super.snapshot()
      // apply the tranform on s ?
    }
  }

In this approach we will need to make all the members getters, they can be package private for sure ! ~~we additionally would want to serialize the tranformer too if we want to send this back as part of loadTableResponse ?~~

github-actions · 2025-12-26T00:19:01Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

rdblue · 2026-01-16T23:43:48Z

core/src/main/java/org/apache/iceberg/TableMetadata.java

+    this.snapshotsById =
+        this.snapshots != null
+            ? indexAndValidateSnapshots(this.snapshots, lastSequenceNumber)
+            : ImmutableMap.of();


Why does this need to change TableMetadata? I think it can be dangerous if now there are cases where snapshots is null and that doesn't represent the actual table state. If I understand the intent of this PR, it is to have an alternative to TableMetadata that suppresses some information.

Ah, I see that this is because TableMetadataProjection inherits from TableMetadata. I think that pattern would cause problems as I mentioned above because TableMetadataProjection is a perfectly valid TableMetadata.

Is this really needed? If the idea is to transform snapshots when they are returned by accessors, you may not need to change this class at all.

Agree, this is not needed for this pr at all this is mostly for case when snapshot is null and we call indexAndValidateSnapshots which fails with an NPE here

private static Map<Long, Snapshot> indexAndValidateSnapshots( List<Snapshot> snapshots, long lastSequenceNumber) { for (Snapshot snap : snapshots) { .... } return builder.build(); }

Happy to remove it or create a seperate pr for fixing this, caught this while testing projection

I think that there should never be a case where snapshots is null. I'd probably keep that assumption in the code and, if anything, just add a check in the constructor that the list is valid.

rdblue · 2026-01-16T23:52:15Z

I don't know that I think this is a good idea. I think that the primary problem is that the snapshot summary may persist partition information that could be sensitive. To me, the right solution is to stop embedding partition information in the snapshot summary and instead capture that data (if it is needed) using the metrics reporting framework and REST endpoint. That solution to getting partition metrics keeps partition info out of the snapshot summaries and tracks it through a separate path where it can be transient or protected differently.

If the primary reason for introducing this is to stop leaking partition summary information in snapshots, then I'd recommend solving that problem more directly with something like a catalog override that suppresses them. Or just drop them at the catalog level when processing AddSnapshot changes.

singhpk234 · 2026-01-20T08:14:51Z

Thank you for the feedbacks @rdblue !

the right solution is to stop embedding partition information in the snapshot summary and instead capture that data (if it is needed) using the metrics reporting framework and REST endpoint

Agree, i think its an anti-pattern here were we leak stuff specially there are multiple ways to achieve the same. I am not sure we have a clear way to ban such writers, may be the end user made a dashboard on top of it because its convenient for them ?

I'd recommend solving that problem more directly with something like a catalog override that suppresses them.

IIUC, there can be cases such as a table was un-protected when the snapshot was added which contained partition stats, but now it is protected (we can enforce always to not add partition summary irrespective if the table is protected or not), may be this is a check then we would need to do as part of policy (RAP) attachment to make the attachment fail, but i think policy is sometimes attached via TAGs, may be failure at runtime then that "hey this table is protected but it has sensitive info which catalog can't hide", we throw 403 and prompt user to fix it. would expiring the snapshot be only solution then ? or we expect the user to rewrite the metadata.json without such summary and then do a force register ?

Or just drop them at the catalog level when processing AddSnapshot changes.

My understanding is unless we spec this out, it would hard to enforce across catalog, for example the cases of federation where one defines a policy on a federated table (catalog C1 federating to catalog C2) in will run into cases where AddSnapshot in C2 didn't enforce this and hence the table can't be queried now and we fail at runtime when queried from C1 since the policies are defined here.

Hence i thought having something like metadata projection would give some flexibility to the catalogs to properly redact info (since snapshot summary is optional) without burdening the end-user.

Please let me know your thoughts considering above.

rdblue · 2026-01-20T17:23:20Z

I am not sure we have a clear way to ban such writers

I would have the REST catalog service remove any partition. keys from snapshot metadata.

there can be cases such as a table was un-protected when the snapshot was added which contained partition stats

I agree, but snapshots usually don't last very long (days, unless it is the current version). So I'd expect that transition to enforcing no partition stats in snapshot summaries on the server side would fix this fairly quickly. You could also check for this when loading a table for a client.

If you detect this, you can also remove the stats on the server side. That may look similar to what is here, but I think that it is better to have limited code in the service for this rather than introduce something in the library. We don't want people using this to edit old snapshots, which are effectively immutable right now. Introducing this would change that guarantee.

hard to enforce across catalog

I think we'd need to define the use case a bit more clearly, but my initial take is that it's up to the catalog. If the primary catalog (the source of truth) drops the partition stats, then other catalogs should receive snapshots without them. And if the primary doesn't drop stats, then it is up to the receiving catalog how it chooses to handle that case. If it needs to drop stats for its own security model then I don't see how that would be a problem.

amogh-jahagirdar

I think there may really be a few different conversations happening here; I'll primarily keep my response on what I recall the primary driver of this PR being, which is exposing a way in the library to selectively redact (or inversely, project) fields in Snapshots for FGAC use cases to prevent leakages where a user could make some inferences from metadata that they shouldn't be able to. This primarily is meant to be APIs that server implementations can leverage when sending metadata back to the client. The argument being without such a thing, server implementations may have to use messy techniques like reflection to set these fields how they want (more on this later, since i don't think it has to be reflection).

Initially, I was a proponent of having a projection or some kind of builder APIs at the snapshot level (not really the tablemetadata level like done here in this PR), but after a lot of thought I think I'm at the same conclusion @rdblue is at

If you detect this, you can also remove the stats on the server side. That may look similar to what is here, but I think that it is better to have limited code in the service for this rather than introduce something in the library. We don't want people using this to edit old snapshots, which are effectively immutable right now. Introducing this would change that guarantee.

I think that's a pretty good reason we don't expose the ability to build new snapshots and why that's fairly abstracted in the library. Now, the next question in my head is "If we can't expose ways to build Snapshots and mutate them, or give the impression of mutating them via "transforming", are there other things that are worth exposing in the Iceberg library to make this use case easier". I thought about if there's a minimal projection like API we could somehow express to SnapshotParser#tojson , and I still come back to ultimately, no, this probably is better served by narrow implementations in servers.

While I can understand the desire to avoid reflection, I don't think it necessarily has to be the way. E.g. if you're using something like Jackson, when serializing responses it's possible to specify a TokenFilter in the mapper write, and the token filter implementation could reside in the server and redact the desired fields. I think a lot of the JSON serialization libraries have a similar mechanism. It's probably still best to combine this though with reflection at a DTO level or something in the server, so it's clear at the in-memory representation layer on the server that things are masked,.

singhpk234 · 2026-01-22T02:36:00Z

If you detect this, you can also remove the stats on the server side. That may look similar to what is here, but I think that it is better to have limited code in the service for this rather than introduce something in the library. We don't want people using this to edit old snapshots, which are effectively immutable right now. Introducing this would change that guarantee.

I thought about it a lot, i think fair to not include this in the library then, TableMetadataProjection would have circumvented this interpretation but i think the boundary becomes blurred since this is only what server see for the client its still is the immutable TableMetadata. i think for new snapshot when we know the table is protected we can just planinly reject it and ask the customer to fix and fail in runtime if we detect any snapshot has partition summary and table is protected, prompting the customer to fix / expire the snapshot (not sure if they would be open to it). All i think i got the stand point that we don't wanna be opinionated on the catalog behaviour and from library POV we work under assumption of the TableMetadata being assumption. I will think this more if i can get something generic enough in context of protected tables so we have some spec language. Thank you for the feedbacks here @rdblue @amogh-jahagirdar

if you're using something like Jackson, when serializing responses it's possible to specify a TokenFilter in the mapper write

Definitely, I think may be i should i just write a serializer and register that in my jackson toJson to strip out snapshots.

while we do this do we think its valuable to remove writing partition summary from sdk now that we have partition stats ? write.summary.partition-limit from https://iceberg.apache.org/docs/nightly/configuration/#write-properties ?

github-actions bot added the core label Nov 4, 2025

sfc-gh-prsingh force-pushed the feature/snapshot-transformer branch 2 times, most recently from 728959e to 76306db Compare November 4, 2025 19:15

singhpk234 requested a review from amogh-jahagirdar November 4, 2025 19:24

singhpk234 closed this Nov 4, 2025

singhpk234 reopened this Nov 4, 2025

singhpk234 closed this Nov 4, 2025

singhpk234 reopened this Nov 5, 2025

singhpk234 requested a review from stevenzwu November 5, 2025 02:48

sfc-gh-prsingh force-pushed the feature/snapshot-transformer branch from 76306db to af17886 Compare November 5, 2025 14:49

Core: Add snapshotTransformer API for TableMetadata

be138e7

sfc-gh-prsingh force-pushed the feature/snapshot-transformer branch from af17886 to be138e7 Compare November 5, 2025 15:29

singhpk234 marked this pull request as ready for review November 6, 2025 21:22

singhpk234 requested review from danielcweeks and nastra November 7, 2025 07:19

Add projection for TableMetadata

0ccb33d

sfc-gh-prsingh force-pushed the feature/snapshot-transformer branch from 1b1e9bd to 0ccb33d Compare November 25, 2025 01:56

singhpk234 changed the title ~~Core: Add snapshotTransformer API for TableMetadata~~ Core: TableMetadata Projection Nov 25, 2025

github-actions bot added the stale label Dec 26, 2025

singhpk234 added not-stale and removed stale labels Dec 26, 2025

singhpk234 mentioned this pull request Dec 30, 2025

REST: Reuse table metadata as part of LoadTable in serializable table #14944

Open

singhpk234 added this to the Iceberg 1.11.0 milestone Jan 11, 2026

singhpk234 requested a review from RussellSpitzer January 16, 2026 22:49

rdblue reviewed Jan 16, 2026

View reviewed changes

amogh-jahagirdar reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Core: TableMetadata Projection #14502

Core: TableMetadata Projection #14502

singhpk234 commented Nov 4, 2025 •

edited

Loading

Uh oh!

singhpk234 commented Nov 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 26, 2025

Uh oh!

rdblue Jan 16, 2026

Uh oh!

rdblue Jan 16, 2026

Uh oh!

singhpk234 Jan 17, 2026

Uh oh!

rdblue Jan 20, 2026

Uh oh!

rdblue commented Jan 16, 2026

Uh oh!

singhpk234 commented Jan 20, 2026

Uh oh!

rdblue commented Jan 20, 2026

Uh oh!

amogh-jahagirdar left a comment •

edited

Loading

Uh oh!

singhpk234 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Core: TableMetadata Projection #14502

Are you sure you want to change the base?

Core: TableMetadata Projection #14502

Conversation

singhpk234 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About the change [Updated]

Testing

Uh oh!

singhpk234 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 26, 2025

Uh oh!

rdblue Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 16, 2026

Uh oh!

singhpk234 commented Jan 20, 2026

Uh oh!

rdblue commented Jan 20, 2026

Uh oh!

amogh-jahagirdar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

singhpk234 commented Nov 4, 2025 •

edited

Loading

singhpk234 commented Nov 5, 2025 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading