Parquet Incremental Sync #768

sapienza88 · 2025-12-10T19:54:49Z

Important Read

Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

(For example: This pull request implements the sync for delta format.)

Brief change log

(for example:)

Fixed JSON parsing error when persisting state
Added unit tests for schema evolution

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added TestConversionController to verify the change.
Manually verified the change by running a job locally.

… into the parquet table

…ds, interfacing with ConversionSource

rahil-c · 2025-12-15T16:19:52Z

I can do first review for this @the-other-tim-brown @vinishjail97

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetFileConfig.java

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

vinishjail97 · 2025-12-17T19:16:16Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

+    try (ParquetWriter<Group> writer =
+        new ParquetWriter<Group>(
+            outputFile,
+            new GroupWriteSupport(),
+            parquetFileConfig.getCodec(),
+            (int) parquetFileConfig.getRowGroupSize(),
+            pageSize,
+            pageSize, // dictionaryPageSize
+            true, // enableDictionary
+            false, // enableValidation
+            ParquetWriter.DEFAULT_WRITER_VERSION,
+            conf)) {
+      Group currentGroup = null;
+      while ((currentGroup = (Group) reader.read()) != null) {
+        writer.write(currentGroup);


Why are we writing new parquet files again like this through the writer? I think there's some misunderstanding with the parquet incremental sync feature here.

Parquet Incremental Sync Requirements.

You have a target table where parquet files [p1/f1.parquet, p1/f2.parquet, p2/f1.parquet] have been synced to hudi, iceberg and delta for example.

In the source changes some changes have been made a new file in partition p1 was added and p2's file was deleted. The incremental sync should now sync the new changes incrementally.

@sapienza88 It's better to align on the approach first here before we push PR's. Can you add the approach for parquet incremental sync in the PR description or any google doc if possible?

@sapienza88 XTable shouldn't be writing any new data or parquet files it operates at a metadata level. Can you see this comment for reference?
#550 (comment)
Fetch the parquet files that have been added since last syncInstant to retrieve the change log. We can this via the same list call and filtering files based on their creationTime is the simplest way but it's expensive

@vinishjail97 thanks for the suggestion, but that isn't helping. Could you elaborate on that idea and how you could manage the metadata only for the task of retrieving data from a particular (modification) date? at the very least the current ConversionSource wasn't coded with that in mind.

sapienza88 · 2025-12-17T19:55:39Z

@vinishjail97 I added some comments on the functions so that the approach is clearer. All above suggestions were also taken into account in my last commit.

…ing)

vinishjail97 · 2025-12-22T19:46:35Z

XTable shouldn't be writing any new data or parquet files it operators at a metadata level. Can you see this comment for reference? I had written few approaches on how to do incremental parquet sync.
#550 (comment)

vinishjail97 · 2025-12-29T07:46:07Z

@sapienza88 I'm adding a more detailed design and a class level structure to unblock this PR.

Design Principle
XTable operates at a metadata level only. The current PR approach of writing new Parquet files with filtered data is incorrect. XTable should:

Discover existing Parquet files from storage
Generate table format metadata (Hudi, Iceberg, Delta) for those files
NEVER write new Parquet files or transform data.

Architecture

  ┌────────────────────────────────────────────────────────────┐
  │                  ParquetConversionSource                   │
  │  - Uses ParquetFileDiscovery to find files                 │
  │  - Converts file metadata to InternalDataFile              │
  │  - Returns snapshots and table changes                     │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │              ParquetFileDiscovery (new class)              │
  │  - Lists all .parquet files from filesystem                │
  │  - Filters files by modification time                      │
  │  - Returns lightweight file metadata                       │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │            FileSystem (HDFS/S3/GCS/Azure)                  │
  │  - fs.listFiles(basePath, recursive=true)                  │
  └────────────────────────────────────────────────────────────┘

Use file modification time as commit identifier, you will be able to identify which files have been synced and which haven't been synced. The files not synced need to have metadata generated. The future functionality like making it optimized, handling deleted parquet files in storage can be handled incrementally, hoping to scope low for this PR.

…ds using the FileStatus' modifTime attribute

…ificationTime selector

…ppend and 2) filter for sync

…ConversionSource

…onSource

vinishjail97

@sapienza88 Should I push a scaffolding PR for basic functionality with incremental sync and you can take it forward from there?

vinishjail97 · 2026-01-12T02:59:58Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

+    return parquetFileConfig;
+  }
+  // TODO add safe guards for possible empty parquet files
+  // append a file (merges two files into one .parquet under a partition folder)


Why are we merging two parquet files to one? We are not building a compaction service right?

vinishjail97 · 2026-01-12T03:00:28Z

xtable-core/src/main/java/org/apache/xtable/parquet/ParquetDataManager.java

+    // write the initial table with the appended file to add into the outputPath
+    writer.start();


This is again writing actual data?

sapienza88 · 2026-01-12T10:34:46Z

@sapienza88 Should I push a scaffolding PR for basic functionality with incremental sync and you can take it forward from there?

no, I'd rather do it from scratch

sapienza88 · 2026-01-12T21:30:32Z

@vinishjail97 @the-other-tim-brown pls check the latest changes for a review. thanks.

…into parquet_incr_sync

given a parquet file return data from a certain modification time

e541a71

sapienza88 changed the title ~~Parquet Incremental Sync: Given a parquet file return data from a certain modification time~~ Parquet Incremental Sync Dec 10, 2025

Selim Soufargi added 3 commits December 13, 2025 18:20

create the path based on the partition then inject the file to append…

15e282a

… into the parquet table

Handle case of path construction with file partitioned over many fiel…

2ee71c9

…ds, interfacing with ConversionSource

test append Parquet file into table init

6032e5f

add function to test schema equivalence before appending

f6fdc72

vinishjail97 self-requested a review December 16, 2025 08:31

Selim Soufargi added 2 commits December 16, 2025 12:59

construct path to inject to based on partitions

a94c3f3

fix imports

f8bdbfe

vinishjail97 requested changes Dec 17, 2025

View reviewed changes

refactoring (lombok, logs, javadocs and function and approach comment…

c04a983

…ing)

Selim Soufargi added 15 commits January 1, 2026 18:03

use appendFile to append a file into a table while tracking the appen…

5f2541e

…ds using the FileStatus' modifTime attribute

find the files that satisfy to the time condition

47e7076

treat appends as separate files to add in the target partition folder

fbb09ec

update approach: selective block compaction

fe19a60

update approach: added a basic test to check data selection using mod…

da7f300

…ificationTime selector

fix append based on partition value

a8730b7

fix test with basic example where partitions are not considered

d19ccbf

fix test with basic example where partitions are not considered2

aecb204

fix test with basic example where partitions are not considered3

0ec8cbb

test with time of last append is now

9cb75df

test appendFile with Parquet: TODO test with multiple partitions 1) a…

9e125f2

…ppend and 2) filter for sync

merge recursively one partition files

233ca77

fix paths for files to append

b4cba5a

fix bug of appending file path

a564b29

fix bug of schema

d1ceafb

Selim Soufargi added 10 commits January 3, 2026 17:41

bug fix

013ffe4

bug fix

b939a0a

cleanups + TODO: partitions match and merge

b6d8ddc

added test for many partitions, TODO integrate functions into Parquet…

c18ab1c

…ConversionSource

selecting data bug fix, TODO integrate functions into ParquetConversi…

b7c613e

…onSource

run all tests, TODO integrate functions into ParquetConversionSource

2a75f49

run all tests, TODO integrate functions into ParquetConversionSource

f1538b0

spotless:apply, TODO integrate functions into ParquetConversionSource

cdedeae

bug fixes, TODO integrate functions into ParquetConversionSource

e873120

bug fixes, TODO integrate functions into ParquetConversionSource

4d0f245

vinishjail97 reviewed Jan 12, 2026

View reviewed changes

handle lightweight metadata in the conversionSource

219656e

Selim Soufargi added 7 commits January 12, 2026 22:36

fix CI

ff809d7

fix CI, refactoring and isIncrementalSyncSafeFrom implementation

e06368f

added test

7cdccd0

added test: init conversionSource instance

b919146

added test: init conversionSource instance: fix error

8ff5aa6

added test: fix error

db9d7ab

more tests + bug fixes and reformatting

324d703

sapienza88 requested a review from vinishjail97 January 13, 2026 11:45

Selim Soufargi added 8 commits January 13, 2026 13:17

CI fix

3c453e6

CI fix

e4c0b4c

CI fix

4315282

CI fix

1417929

CI fix

7620c01

Merge branch 'main' of https://github.com/sapienza88/incubator-xtable …

e458d72

…into parquet_incr_sync

Merge branch 'main' of https://github.com/sapienza88/incubator-xtable …

9947f1b

…into parquet_incr_sync

CI fix

cd66151

		// write the initial table with the appended file to add into the outputPath
		writer.start();

Parquet Incremental Sync #768

Are you sure you want to change the base?

Parquet Incremental Sync #768

Conversation

sapienza88 commented Dec 10, 2025

Important Read

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

rahil-c commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinishjail97 Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sapienza88 commented Dec 17, 2025

Uh oh!

vinishjail97 commented Dec 22, 2025

Uh oh!

vinishjail97 commented Dec 29, 2025

Uh oh!

vinishjail97 left a comment

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

vinishjail97 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

sapienza88 commented Jan 12, 2026

Uh oh!

sapienza88 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vinishjail97 Dec 22, 2025 •

edited

Loading

sapienza88 Dec 23, 2025 •

edited

Loading