GH-48467: [C++][Parquet] Add configure to limit the row group size #48468

wecharyu · 2025-12-12T07:56:41Z

Rationale for this change

Limit the row group size.

What changes are included in this PR?

Add a new config parquet::WriterProperties::max_row_group_bytes.

Are these changes tested?

Yes, add unit test.

Are there any user-facing changes?

Yes, user could use the new config to limit row group size.

GitHub Issue: [C++][Parquet] Add configure to limit the row group size #48467

github-actions · 2025-12-12T07:57:04Z

⚠️ GitHub issue #48467 has been automatically assigned in GitHub to PR creator.

tusharbhatt7 · 2025-12-12T09:12:47Z

Rationale for this change

Limit the row group size.

What changes are included in this PR?

Add a new config parquet::WriterProperties::max_row_group_bytes.

Are these changes tested?

Yes, add unit test.

Are there any user-facing changes?

Yes, user could use the new config to limit row group size.

GitHub Issue: [C++][Parquet] Add configure to limit the row group size #48467

Thanks for working on this! Since I'm still new to the Arrow codebase, I reviewed the PR at a high level and it helped me understand how WriterProperties and row group configuration are implemented. I don’t have enough experience yet to provide a full technical review, but the approach looks consistent with the design discussed in the issue.

Thanks again for sharing this!

wecharyu · 2025-12-12T09:24:00Z

cpp/src/parquet/arrow/writer.cc

    };

+    // Max number of rows allowed in a row group.
+    const int64_t max_row_group_length = this->properties().max_row_group_length();


Not sure whether we should validate that such config value is positive.
If it's set to 0, the processor would never exit the loop.

I think we should validate this in the properties.h when it is being set?

HuaHuaY

LGTM

HuaHuaY · 2025-12-16T09:42:25Z

cpp/src/parquet/arrow/writer.cc

+                            row_group_writer_->num_rows();
+      chunk_size = std::min(
+          chunk_size,
+          static_cast<int64_t>(this->properties().max_row_group_bytes() / avg_row_size));


Will there be rows written in row_group_writer_? The condition contains row_group_writer_->num_rows() > 0 so I guess you think the answer is true. Then why don’t need to subtract these?

int64_t buffered_bytes = row_group_writer_->current_buffered_bytes(); double avg_row_bytes = buffered_bytes * 1.0 / group_rows; chunk_size = std::min( chunk_size, static_cast<int64_t>((this->properties().max_row_group_bytes() - buffered_bytes) / avg_row_bytes));

Actually each batch will be written to a new row group, we just use the avg_row_bytes to estimate batch_size, and the data will not be appended to existing row group.

wgtmac · 2025-12-18T15:15:41Z

cpp/src/parquet/properties.h

      return this;
    }

+    /// Specify the max number of bytes to put in a single row group.


The size is after encoding and compression, right? It would be good to document this.

wgtmac · 2025-12-18T15:21:10Z

cpp/src/parquet/file_writer.cc

    return total_compressed_bytes_written;
  }

+  int64_t estimated_buffered_value_bytes() const override {


Suggested change

int64_t estimated_buffered_value_bytes() const override {

int64_t EstimatedBufferedValueBytes() const override {

This is not a trivial getter so we need to use initial-capitalized camel form. Unfortunately functions like total_compressed_bytes() have already used the wrong form so it looks confusing. :/

wgtmac · 2025-12-18T15:26:03Z

cpp/src/parquet/file_writer.cc

  return contents_->total_compressed_bytes_written();
 }

+int64_t RowGroupWriter::current_buffered_bytes() const {


The function name is a little misleading because readers may think it is same as contents_->estimated_buffered_value_bytes().

rename to total_buffered_bytes()

wgtmac · 2025-12-18T15:37:28Z

cpp/src/parquet/arrow/writer.cc

      chunk_size = this->properties().max_row_group_length();
    }
+    // max_row_group_bytes is applied only after the row group has accumulated data.
+    if (row_group_writer_ != nullptr && row_group_writer_->num_rows() > 0) {


row_group_writer_->num_rows() > 0 can only happen when the current row group writer is in the buffered mode. Usually users calling WriteTable will never use buffered mode so this approach seems not working in the majority of cases.

Instead, can we gather this information from all written row groups (if available)?

@wgtmac If user use the static WriteTable function, the arrow FileWriter is always recreated and we can not gather the old written row groups.

arrow/cpp/src/parquet/arrow/writer.cc

Lines 591 to 601 in 8040f2a

Status WriteTable(const ::arrow::Table& table, ::arrow::MemoryPool* pool,

std::shared_ptr<::arrow::io::OutputStream> sink, int64_t chunk_size,

std::shared_ptr<WriterProperties> properties,

std::shared_ptr<ArrowWriterProperties> arrow_properties) {

std::unique_ptr<FileWriter> writer;

ARROW_ASSIGN_OR_RAISE(

writer, FileWriter::Open(*table.schema(), pool, std::move(sink),

std::move(properties), std::move(arrow_properties)));

RETURN_NOT_OK(writer->WriteTable(table, chunk_size));

return writer->Close();

}

If user use the internal WriteTable function, we can get avg_row_bytes by last row_group_writer_ or gathering all previous row group writers.

arrow/cpp/src/parquet/arrow/writer.cc

Line 394 in 8040f2a

Status WriteTable(const Table& table, int64_t chunk_size) override {

wgtmac · 2025-12-18T15:42:44Z

cpp/src/parquet/arrow/writer.cc

    };

+    // Max number of rows allowed in a row group.
+    const int64_t max_row_group_length = this->properties().max_row_group_length();


I think we should validate this in the properties.h when it is being set?

wgtmac · 2025-12-18T15:45:37Z

cpp/src/parquet/arrow/writer.cc

+      int64_t group_rows = row_group_writer_->num_rows();
+      int64_t batch_size =
+          std::min(max_row_group_length - group_rows, batch.num_rows() - offset);
+      if (group_rows > 0) {


Similar to my comment above, should we consider all written row groups as well to estimate the average row size?

If we change to use all written row groups, then the first row group size can only be determined by max_row_group_length, is it OK or just use current row group writer's buffered data?

apacheGH-48467: [C++][Parquet] Add configure to limit the row group size

7b8e058

wecharyu requested a review from wgtmac as a code owner December 12, 2025 07:56

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 12, 2025

refine code

8142f17

wecharyu commented Dec 12, 2025

View reviewed changes

WriteTable respect max_row_group_bytes

e19db37

HuaHuaY approved these changes Dec 16, 2025

View reviewed changes

HuaHuaY reviewed Dec 16, 2025

View reviewed changes

wgtmac reviewed Dec 18, 2025

View reviewed changes

address comments

13fe7b1

	int64_t estimated_buffered_value_bytes() const override {
	int64_t EstimatedBufferedValueBytes() const override {

	Status WriteTable(const ::arrow::Table& table, ::arrow::MemoryPool* pool,
	std::shared_ptr<::arrow::io::OutputStream> sink, int64_t chunk_size,
	std::shared_ptr<WriterProperties> properties,
	std::shared_ptr<ArrowWriterProperties> arrow_properties) {
	std::unique_ptr<FileWriter> writer;
	ARROW_ASSIGN_OR_RAISE(
	writer, FileWriter::Open(*table.schema(), pool, std::move(sink),
	std::move(properties), std::move(arrow_properties)));
	RETURN_NOT_OK(writer->WriteTable(table, chunk_size));
	return writer->Close();
	}

GH-48467: [C++][Parquet] Add configure to limit the row group size #48468

Are you sure you want to change the base?

GH-48467: [C++][Parquet] Add configure to limit the row group size #48468

Conversation

wecharyu commented Dec 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

tusharbhatt7 commented Dec 12, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY left a comment

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wecharyu commented Dec 12, 2025 •

edited by github-actions bot

Loading

HuaHuaY Dec 16, 2025 •

edited

Loading

HuaHuaY Dec 16, 2025 •

edited

Loading