[arrow-avro] Add AsyncWriter #9241

GaneshPatil7517 · 2026-01-21T15:16:13Z

Summary

This PR implements a fully functional async Avro writer for arrow-avro, providing a symmetric and idiomatic async API that mirrors the existing synchronous Writer while following Arrow's established async patterns (consistent with Parquet's async writer).

Fixes: #9212

Design Overview

New Types

AsyncFileWriter trait: Minimal abstraction for async I/O sinks
- write(Bytes) -> BoxFuture<Result<()>>
- complete() -> BoxFuture<Result<()>>
- Blanket impl for tokio::io::AsyncWrite + Unpin + Send
- Matches Parquet async_writer pattern
AsyncWriter<W, F> struct: Generic async writer
- W: Any AsyncFileWriter (tokio types, custom implementations, etc.)
- F: Any AvroFormat (OCF or SOE)
- Full API parity with sync Writer
Type aliases:
- AsyncAvroWriter<W> - OCF (Object Container File) format
- AsyncAvroStreamWriter<W> - SOE (Single Object Encoding) format
AsyncWriterBuilder: Configuration builder
- with_compression() - All codecs (Deflate, Snappy, ZStandard, etc.)
- with_fingerprint_strategy() - SOE fingerprinting
- with_capacity() - Buffer sizing
- Async build() method

Key Implementation Details

Sync Encoding Reuse: Leverages existing RecordEncoder - no re-implementation of Avro encoding
Buffer Staging: Encodes to Vec<u8>, converts to Bytes, flushes asynchronously
Header on Construction: OCF headers written and flushed immediately in builder
Compression Preserved: Identical logic to sync writer for compression application
Feature Gated: Requires async feature with tokio, futures, bytes dependencies

API Parity

The async writer provides identical methods to the sync writer:

// Create writers
let mut writer = AsyncAvroWriter::new(sink, schema).await?;
let mut writer = AsyncAvroStreamWriter::new(sink, schema).await?;

// Write batches
writer.write(&batch).await?;
writer.write_batches(&[&batch1, &batch2]).await?;

// Finish and retrieve sink
writer.finish().await?;
let sink = writer.into_inner();

Test Coverage

7 comprehensive tests covering:

✅ OCF round-trip with sync reader
✅ SOE stream writing
✅ Multiple batch accumulation
✅ Builder configuration
✅ Writer consumption with into_inner()
✅ Schema mismatch error handling
✅ Deflate compression

All tests verify data integrity through round-trip with sync ReaderBuilder, not byte-for-byte equality (OCF sync markers are random).

Feature Gating

[features]
async = ["tokio", "futures", "bytes"]

[dependencies]
tokio = { version = "1", features = ["io-util"], optional = true }
futures = { version = "0.3.31", optional = true }
bytes = { version = "1.10.1", optional = true }

All async code is guarded with #[cfg(feature = "async")].

Commits

[arrow-avro] add async writer module with feature gating - Core implementation
[arrow-avro] add comprehensive async writer tests - Test coverage
[arrow-avro] format code with rustfmt - Formatting
[arrow-avro] improve async writer documentation - Docs

Testing

# Run async writer tests
cargo test -p arrow-avro --lib --features async async_writer

# Build with all features
cargo build -p arrow-avro --all-features

# Check documentation
cargo doc -p arrow-avro --features async --no-deps

All tests pass: ✅
Clippy: No warnings ✅
Rustfmt: Clean ✅

Files Modified

arrow-avro/Cargo.toml - Added async feature and dependencies
arrow-avro/src/writer/mod.rs - Module exports
arrow-avro/src/writer/async_writer.rs - New module (486 lines, 7 tests)

Future Work

Optional object_store feature for cloud storage integration (S3, GCS, Azure)
- Would follow same pattern as Parquet's ParquetObjectWriter
- Can be added in follow-up PR

Example Usage

use arrow_avro::writer::AsyncAvroWriter;
use tokio::fs::File;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::create("output.avro").await?;
    let mut writer = AsyncAvroWriter::new(file, schema).await?;
    
    writer.write(&batch1).await?;
    writer.write(&batch2).await?;
    
    writer.finish().await?;
    Ok(())
}

References

Issue: [arrow-avro] Add AsyncWriter #9212
Similar async pattern: Parquet AsyncArrowWriter (parquet/src/arrow/async_writer/)
Sync API: arrow-avro/src/writer/mod.rs
Related: AsyncAvroReader PR feat: Implement an AsyncReader for avro using ObjectStore #8930

- Introduce AsyncFileWriter trait for async sink abstraction - Implement AsyncWriter<W, F> generic over async sink and format - Provide AsyncAvroWriter and AsyncAvroStreamWriter type aliases - Support OCF and SOE formats with identical API to sync Writer - Add AsyncWriterBuilder for configuration - Include comprehensive tests for OCF, SOE, and batch writing - Gate behind 'async' feature with tokio/futures/bytes dependencies

- Test OCF and stream writing modes - Test multiple batch writing with write_batches - Test builder configuration and capacity settings - Test schema mismatch error handling - Test deflate compression with conditional feature gate - Test into_inner to verify writer consumption

- Add comprehensive feature list in module docs - Add note about future object_store integration - Update code formatting

Copilot

Pull request overview

This PR adds a comprehensive async writer API for the arrow-avro crate, providing an idiomatic async counterpart to the existing synchronous writer. The implementation mirrors the sync writer's API while following established Arrow async patterns (consistent with Parquet's async writer).

Changes:

Added async writer feature with tokio, futures, and bytes dependencies
Implemented AsyncFileWriter trait and AsyncWriter generic struct with type aliases for OCF and SOE formats
Added 7 comprehensive tests covering OCF/SOE round-trips, multiple batches, compression, and error handling

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
arrow-avro/Cargo.toml	Added `async` feature with optional dependencies for tokio, futures, and bytes
arrow-avro/src/writer/mod.rs	Added feature-gated public exports for async writer types
arrow-avro/src/writer/async_writer.rs	New module implementing full async writer API with trait, builder, writer struct, and comprehensive tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

arrow-avro/src/writer/async_writer.rs

arrow-avro/Cargo.toml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Add detailed schema resolution behavior documentation - Explain metadata key usage and fallback to conversion - Document default settings and customization options - Match documentation level with sync WriterBuilder

jecsand838

@GaneshPatil7517

Thanks you so much for getting this up so quickly! I went ahead and left some initial comments, but overall this is a really solid start.

jecsand838 · 2026-01-24T08:28:37Z

arrow-avro/src/writer/async_writer.rs

+pub struct AsyncWriterBuilder {
+    schema: Schema,
+    codec: Option<CompressionCodec>,
+    capacity: usize,
+    fingerprint_strategy: Option<FingerprintStrategy>,
+}


Maintainability thought: AsyncWriterBuilder duplicates the same knobs/fields as the existing sync WriterBuilder (schema/codec/capacity/fingerprint_strategy).

To avoid future drift, could we factor shared builder logic, or add something like WriterBuilder::build_async(...) in mod.rs behind cfg(feature="async") so there is a single source of truth for configuration + schema selection?

jecsand838 · 2026-01-24T08:31:16Z

arrow-avro/src/writer/async_writer.rs

+    async fn test_async_writer_into_inner() -> Result<(), Box<dyn std::error::Error>> {
+        let schema = Schema::new(vec![Field::new("id", DataType::Int32, false)]);
+        let batch = RecordBatch::try_new(
+            Arc::new(schema.clone()),
+            vec![Arc::new(Int32Array::from(vec![99])) as ArrayRef],
+        )?;
+
+        let mut buffer = Vec::new();
+        {
+            let mut writer = AsyncAvroWriter::new(&mut buffer, schema).await?;
+            writer.write(&batch).await?;
+            writer.finish().await?;
+        }
+
+        assert!(!buffer.is_empty());
+        Ok(())
+    }


This test is named *_into_inner but never calls into_inner. Would you be open to either renaming it (e.g. test_async_writer) or updating it to exercise into_inner?

One way is to use an owned test sink (custom AsyncFileWriter that collects Bytes) and assert that writer.into_inner() returns it with non-empty contents.

jecsand838 · 2026-01-24T08:33:05Z

arrow-avro/src/writer/async_writer.rs

+    #[tokio::test]
+    async fn test_async_avro_stream_writer() -> Result<(), Box<dyn std::error::Error>> {
+        let schema = Schema::new(vec![Field::new("x", DataType::Int32, false)]);
+
+        let batch = RecordBatch::try_new(
+            Arc::new(schema.clone()),
+            vec![Arc::new(Int32Array::from(vec![10, 20, 30])) as ArrayRef],
+        )?;
+
+        let mut buffer = Vec::new();
+        let mut writer = AsyncAvroStreamWriter::new(&mut buffer, schema).await?;
+        writer.write(&batch).await?;
+        writer.finish().await?;
+
+        assert!(!buffer.is_empty());
+        Ok(())
+    }


This SOE test is currently a non-empty buffer smoke test. What do you think about strengthening it to catch regressions in SOE framing by asserting on exact matches?

Ideas:

Validate the stream prefix matches the expected SOE magic + fingerprint (e.g. compare against schema::SINGLE_OBJECT_MAGIC and ensure the fingerprint bytes are present), and/or

Do a real round-trip decode using the existing streaming decode path (Decoder + SchemaStore) with the writer schema registered.

jecsand838 · 2026-01-24T08:36:16Z

arrow-avro/src/writer/async_writer.rs

+    async fn write_ocf_block(
+        &mut self,
+        batch: &RecordBatch,
+        sync: &[u8; 16],
+    ) -> Result<(), ArrowError> {
+        let mut buf = Vec::<u8>::with_capacity(self.capacity);
+        self.encoder.encode(&mut buf, batch)?;
+        let encoded = match self.compression {
+            Some(codec) => codec.compress(&buf)?,
+            None => buf,
+        };
+
+        let mut block_buf = Vec::<u8>::new();
+        write_long(&mut block_buf, batch.num_rows() as i64)?;
+        write_long(&mut block_buf, encoded.len() as i64)?;
+        block_buf.extend_from_slice(&encoded);
+        block_buf.extend_from_slice(sync);
+
+        self.writer.write(Bytes::from(block_buf)).await
+    }


write_ocf_block currently does a full memcpy of the encoded/compressed payload via block_buf.extend_from_slice(&encoded). For large batches this can significantly increase peak memory (encode/compress buffer + block_buf copy) and adds extra CPU.

What do you think about writing in multiple chunks (header bytes, then Bytes::from(encoded) to move without copy, then sync marker) to avoid copying, instead of re-buffering the entire block into block_buf?

jecsand838 · 2026-01-24T08:37:44Z

arrow-avro/src/writer/async_writer.rs

+        let mut header_buf = Vec::<u8>::with_capacity(256);
+        format.start_stream(&mut header_buf, &schema, self.codec)?;
+        writer.write(Bytes::from(header_buf)).await?;


Minor nit: for formats where start_stream writes no header bytes, this will still call write with an empty buffer. Might be worth guarding with if !header_buf.is_empty() to avoid a no-op write (some sinks treat zero-length writes oddly).

jecsand838 · 2026-01-24T08:40:48Z

arrow-avro/src/writer/async_writer.rs

+/// The asynchronous interface used by [`AsyncWriter`] to write Avro files.
+pub trait AsyncFileWriter: Send {
+    /// Write the provided bytes to the underlying writer
+    fn write(&mut self, bs: Bytes) -> BoxFuture<'_, Result<(), ArrowError>>;
+
+    /// Flush any buffered data and finish the writing process.
+    ///
+    /// After `complete` returns `Ok(())`, the caller SHOULD not call write again.
+    fn complete(&mut self) -> BoxFuture<'_, Result<(), ArrowError>>;
+}


Maybe worth expanding the trait docs a bit to clarify intended semantics imo.
Specifically that implementations may buffer internally (or write immediately), and may implement retry logic, and that write is expected to append all bytes or return an error.

This becomes especially important once we add an object_store-backed implementation.

jecsand838 · 2026-01-24T08:44:37Z

arrow-avro/src/writer/async_writer.rs

+    /// Create a new builder with default settings.
+    ///
+    /// The Avro schema used for writing is determined as follows:
+    /// 1) If the Arrow schema metadata contains `avro::schema` (see `SCHEMA_METADATA_KEY`),


Suggested change

/// 1) If the Arrow schema metadata contains `avro::schema` (see `SCHEMA_METADATA_KEY`),

/// 1) If the Arrow schema metadata contains `avro.schema` (see `SCHEMA_METADATA_KEY`),

jecsand838 · 2026-01-24T08:46:07Z

arrow-avro/src/writer/async_writer.rs

+//! ```no_run
+//! use std::sync::Arc;
+//! use arrow_array::{ArrayRef, Int64Array, RecordBatch};
+//! use arrow_schema::{DataType, Field, Schema};
+//! use arrow_avro::writer::AsyncAvroWriter;
+//! use bytes::Bytes;


use bytes::Bytes; appears unused in the example. Also I'd recommend we make all of our examples runnable imo.

Suggested change

//! ```no_run

//! use std::sync::Arc;

//! use arrow_array::{ArrayRef, Int64Array, RecordBatch};

//! use arrow_schema::{DataType, Field, Schema};

//! use arrow_avro::writer::AsyncAvroWriter;

//! use bytes::Bytes;

//! ```

//! use std::sync::Arc;

//! use arrow_array::{ArrayRef, Int64Array, RecordBatch};

//! use arrow_schema::{DataType, Field, Schema};

//! use arrow_avro::writer::AsyncAvroWriter;

GaneshPatil7517 added 4 commits January 21, 2026 15:56

[arrow-avro] format code with rustfmt

cde1239

[arrow-avro] improve async writer documentation

c4c2274

- Add comprehensive feature list in module docs - Add note about future object_store integration - Update code formatting

Copilot AI review requested due to automatic review settings January 21, 2026 15:16

Merge branch 'main' into arrow-avro-async-writer

97d5c11

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 21, 2026

Copilot started reviewing on behalf of GaneshPatil7517 January 21, 2026 15:20 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

arrow-avro/src/writer/async_writer.rs Outdated Show resolved Hide resolved

arrow-avro/src/writer/async_writer.rs Show resolved Hide resolved

arrow-avro/Cargo.toml Outdated Show resolved Hide resolved

GaneshPatil7517 and others added 4 commits January 21, 2026 21:14

Update arrow-avro/Cargo.toml

e8c647b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update arrow-avro/src/writer/async_writer.rs

c677bcb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[arrow-avro] improve AsyncWriterBuilder documentation

bf9d65f

- Add detailed schema resolution behavior documentation - Explain metadata key usage and fallback to conversion - Document default settings and customization options - Match documentation level with sync WriterBuilder

Merge branch 'main' into arrow-avro-async-writer

3520299

jecsand838 reviewed Jan 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arrow-avro] Add AsyncWriter #9241

[arrow-avro] Add AsyncWriter #9241

GaneshPatil7517 commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jecsand838 left a comment

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

jecsand838 Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	/// 1) If the Arrow schema metadata contains `avro::schema` (see `SCHEMA_METADATA_KEY`),
	/// 1) If the Arrow schema metadata contains `avro.schema` (see `SCHEMA_METADATA_KEY`),

[arrow-avro] Add AsyncWriter #9241

Are you sure you want to change the base?

[arrow-avro] Add AsyncWriter #9241

Conversation

GaneshPatil7517 commented Jan 21, 2026

Summary

Design Overview

New Types

Key Implementation Details

API Parity

Test Coverage

Feature Gating

Commits

Testing

Files Modified

Future Work

Example Usage

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants