-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: TableMetadata Projection #14502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
728959e to
76306db
Compare
|
I was talking to @stevenzwu, he suggested a nice alternative here inspired by StructProjection adding it here for broder forum, techinically indeed its a projection In this approach we will need to make all the members getters, they can be package private for sure ! |
76306db to
af17886
Compare
af17886 to
be138e7
Compare
1b1e9bd to
0ccb33d
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
| this.snapshotsById = | ||
| this.snapshots != null | ||
| ? indexAndValidateSnapshots(this.snapshots, lastSequenceNumber) | ||
| : ImmutableMap.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to change TableMetadata? I think it can be dangerous if now there are cases where snapshots is null and that doesn't represent the actual table state. If I understand the intent of this PR, it is to have an alternative to TableMetadata that suppresses some information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see that this is because TableMetadataProjection inherits from TableMetadata. I think that pattern would cause problems as I mentioned above because TableMetadataProjection is a perfectly valid TableMetadata.
Is this really needed? If the idea is to transform snapshots when they are returned by accessors, you may not need to change this class at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this is not needed for this pr at all this is mostly for case when snapshot is null and we call indexAndValidateSnapshots which fails with an NPE here
private static Map<Long, Snapshot> indexAndValidateSnapshots(
List<Snapshot> snapshots, long lastSequenceNumber) {
for (Snapshot snap : snapshots) {
....
}
return builder.build();
}
Happy to remove it or create a seperate pr for fixing this, caught this while testing projection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there should never be a case where snapshots is null. I'd probably keep that assumption in the code and, if anything, just add a check in the constructor that the list is valid.
|
I don't know that I think this is a good idea. I think that the primary problem is that the snapshot summary may persist partition information that could be sensitive. To me, the right solution is to stop embedding partition information in the snapshot summary and instead capture that data (if it is needed) using the metrics reporting framework and REST endpoint. That solution to getting partition metrics keeps partition info out of the snapshot summaries and tracks it through a separate path where it can be transient or protected differently. If the primary reason for introducing this is to stop leaking partition summary information in snapshots, then I'd recommend solving that problem more directly with something like a catalog override that suppresses them. Or just drop them at the catalog level when processing |
|
Thank you for the feedbacks @rdblue !
Agree, i think its an anti-pattern here were we leak stuff specially there are multiple ways to achieve the same. I am not sure we have a clear way to ban such writers, may be the end user made a dashboard on top of it because its convenient for them ?
IIUC, there can be cases such as a table was un-protected when the snapshot was added which contained partition stats, but now it is protected (we can enforce always to not add partition summary irrespective if the table is protected or not), may be this is a check then we would need to do as part of policy (RAP) attachment to make the attachment fail, but i think policy is sometimes attached via TAGs, may be failure at runtime then that "hey this table is protected but it has sensitive info which catalog can't hide", we throw 403 and prompt user to fix it. would expiring the snapshot be only solution then ? or we expect the user to rewrite the
My understanding is unless we spec this out, it would hard to enforce across catalog, for example the cases of federation where one defines a policy on a federated table (catalog C1 federating to catalog C2) in will run into cases where AddSnapshot in C2 didn't enforce this and hence the table can't be queried now and we fail at runtime when queried from C1 since the policies are defined here. Hence i thought having something like metadata projection would give some flexibility to the catalogs to properly redact info (since snapshot summary is optional) without burdening the end-user. Please let me know your thoughts considering above. |
I would have the REST catalog service remove any
I agree, but snapshots usually don't last very long (days, unless it is the current version). So I'd expect that transition to enforcing no partition stats in snapshot summaries on the server side would fix this fairly quickly. You could also check for this when loading a table for a client. If you detect this, you can also remove the stats on the server side. That may look similar to what is here, but I think that it is better to have limited code in the service for this rather than introduce something in the library. We don't want people using this to edit old snapshots, which are effectively immutable right now. Introducing this would change that guarantee.
I think we'd need to define the use case a bit more clearly, but my initial take is that it's up to the catalog. If the primary catalog (the source of truth) drops the partition stats, then other catalogs should receive snapshots without them. And if the primary doesn't drop stats, then it is up to the receiving catalog how it chooses to handle that case. If it needs to drop stats for its own security model then I don't see how that would be a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may really be a few different conversations happening here; I'll primarily keep my response on what I recall the primary driver of this PR being, which is exposing a way in the library to selectively redact (or inversely, project) fields in Snapshots for FGAC use cases to prevent leakages where a user could make some inferences from metadata that they shouldn't be able to. This primarily is meant to be APIs that server implementations can leverage when sending metadata back to the client. The argument being without such a thing, server implementations may have to use messy techniques like reflection to set these fields how they want (more on this later, since i don't think it has to be reflection).
Initially, I was a proponent of having a projection or some kind of builder APIs at the snapshot level (not really the tablemetadata level like done here in this PR), but after a lot of thought I think I'm at the same conclusion @rdblue is at
If you detect this, you can also remove the stats on the server side. That may look similar to what is here, but I think that it is better to have limited code in the service for this rather than introduce something in the library. We don't want people using this to edit old snapshots, which are effectively immutable right now. Introducing this would change that guarantee.
I think that's a pretty good reason we don't expose the ability to build new snapshots and why that's fairly abstracted in the library. Now, the next question in my head is "If we can't expose ways to build Snapshots and mutate them, or give the impression of mutating them via "transforming", are there other things that are worth exposing in the Iceberg library to make this use case easier". I thought about if there's a minimal projection like API we could somehow express to SnapshotParser#tojson , and I still come back to ultimately, no, this probably is better served by narrow implementations in servers.
While I can understand the desire to avoid reflection, I don't think it necessarily has to be the way. E.g. if you're using something like Jackson, when serializing responses it's possible to specify a TokenFilter in the mapper write, and the token filter implementation could reside in the server and redact the desired fields. I think a lot of the JSON serialization libraries have a similar mechanism. It's probably still best to combine this though with reflection at a DTO level or something in the server, so it's clear at the in-memory representation layer on the server that things are masked,.
I thought about it a lot, i think fair to not include this in the library then,
Definitely, I think may be i should i just write a serializer and register that in my jackson while we do this do we think its valuable to remove writing partition summary from sdk now that we have partition stats ? |
About the change [Updated]
Introduces TableMetadataProjection on top of the TableMetadata.
Use cases:
1/ snapshot summary contains information like total files / total-records etc and when partition summary is enabled it contains partition summary which is a sensitive information incase the table is protected against FGAC specially Row Access Policy, this essentially will help in dropping the snapshot summary by the RESTCatlaog before the table metadata is being sent to the being sent to the client (untrusted) as part of LoadTable response.
2/ Dropping the summary from snapshot obj can reduce the size transfer cost.
Testing
unit tests for w/wo lazy loading and parsing