-
Notifications
You must be signed in to change notification settings - Fork 3k
Implement Iceberg Kafka Connect with Delta Writer Support in DV Mode for in-batch deduplication #14797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bryanck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
As you may know, the original (non-Apache) sink had delta writer support that relied on equality deletes. When the sink was contributed to this project, the community decided it was best to remove that functionality, as using it can result in severely degraded performance. This can lead to poor user experience, reflect badly on the project, and increase support-related questions. Also, there are alternative solutions that don't have the same issues.
We should revisit those discussions and resolve those concerns before we proceed with this.
Thanks for the reply! |
|
I believe this feature is absolutely essential. |
|
Thanks for contributing for iceberg ecosystem. |
|
For what it's worth, I tested it under a modest CDC load, and it seems to be working fine. |
|
We did a test with this PR branch with properties in kafka connect for sink connector on AWS We found the table was created in version 2 in metadata file. So this upsert & DV support is for version 2 tables only ? |
Yeah, it's a limitation. I should have probably made a note about it above, but it is kind of out of scope for this PR. What I ended up doing is writing some code that interfaces with the iceberg catalog outside of kafka connect and initilizes the table ahead of time with the correct format version |
|
@bryanck @t3hw Great to see the progress here! Our team is looking to use this connector for a project that requires iceberg sink connector upsert/CDC support. Is there any guidance on whether the implementation part of this PR will be accommodated in a later release? Any insight into the timeline or priority would be very helpful for our planning." |
|
@bryanck @t3hw Great progress here, We are looking to use Apache Iceberg sink connector, and we need sink connector upsert functionality. Could you please clarify whether there are plans to support or merge this functionality in the future? If this is on the roadmap, we can proceed with this approach. Otherwise, we may need to consider alternative solutions, such as using Flink. It would be helpful to understand the expected direction or future of this PR. Thanks for the guidance. |
|
@bryanck +1 in strong support of this PR. The addition of Delta Writer and upsert/CDC support in the Kafka Connect sink unlocks important production use cases. It significantly simplifies incremental and CDC-based data pipelines. Many users including our team are looking to adopt Iceberg for CDC/upsert workflows, and having this merged into main would meaningfully improve adoption while reducing long-term maintenance and workarounds. We’ve also tested this PR under a sustained load of ~100 TPS for about 30 minutes and observed a stable lag in the range of 7–10 minutes. Notably, the lag did not scale proportionally with load, which is a promising signal for production readiness. Thanks for the thoughtful work on this, could the community consider moving this change forward in the near future? We’d be excited to see it included in an upcoming release. |
|
We need the ability to have the updates merged and this PR solves that. I understand the review and merge can take time so not asking to rush it but we would really benefit by getting a feedback from the community on:
We are blocked on our implementation journey pending this PR merge hence asking if this is confirmed to be on the roadmap or not. Appreciate your urgent attention and everything you and other contributors do. Thank you! |
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/BaseDeltaWriter.java
Show resolved
Hide resolved
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/BaseDeltaWriter.java
Outdated
Show resolved
Hide resolved
|
@bryanck could you review this and, if you’re comfortable, approve? We’re seeing growing interest from teams that manage their own Kafka infrastructure to use this sink. While our Data Platform teams can achieve similar outcomes with Spark Streaming or the older Iceberg sink (https://github.com/databricks/iceberg-kafka-connect), I believe the current Iceberg sink simplifies day-to-day work for a lot of acquisition teams |
|
@hladush This is something the Iceberg community has decided, including PMC members, so we'd need to get the community on board in order to proceed. You can raise this as a topic on the dev list (again) or in a community sync if you want. |
…the CdcConstants from the SMT classes.
|
@bryanck thanks a lot for your response, could you please share with me any doc how to do that ? |
|
added a small clarification to the PR description:
|
|
Pasting my comment from another discussion about this PR: Original commenter said: And the response: If this gets accepted the documentation should be updated to let users know that compaction is highly recommended. Given that equality deletes + compaction is how Flink handles CDC (and is the industry standard), would this approach be acceptable if we:
|
|
Thank you @t3hw for quick fix for the |
|
Removed the use-dv property in favor of automatic format-version detection (cc @hladush @rajansadasivan) |
2dc6b8f to
6f410bd
Compare
6f410bd to
6e5d469
Compare
Introduce Delta Writer functionality for both unpartitioned and partitioned tables, enabling CDC and upsert modes. Enhance configuration options for CDC fields, upsert mode, and DV usage.
Inspired by #12070
Resolves #10842
@bryanck
edit:
"using DVs for CDC" - DVs only help for in-batch deduplication.
Out of batch deletes/updates fall back to equality deletes.
Partitioning the table is highly recommended, periodically compacting the table when using CDC mode is mandatory.