feat: Implement an AsyncReader for avro using ObjectStore #8930

EmilyMatt · 2025-11-26T10:35:05Z

Which issue does this PR close?

Closes Implement an async AvroReader #8929 .

Rationale for this change

Allows for proper file splitting within an asynchronous context.

What changes are included in this PR?

The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema

Are these changes tested?

Yes

Are there any user-facing changes?

Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)

jecsand838

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

arrow-avro/Cargo.toml

arrow-avro/src/reader/async_reader.rs

EmilyMatt · 2025-11-26T21:26:03Z

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

jecsand838 · 2025-11-26T21:33:18Z

Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

100% I'm working on that right now and won't stop until I have a PR. That was a solid catch.

The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back.

arrow-avro/src/reader/async_reader.rs

EmilyMatt · 2025-12-01T21:46:18Z

Sorry, I haven't dropped it, just found myself in a really busy week! The generic reader support does not seem to hard to implement from the dabbling I made, and I still need to get to the builder pattern change

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

EmilyMatt · 2025-12-07T15:10:23Z

@jecsand838 I believe this is now ready for a proper review^

jecsand838

@EmilyMatt Thank you so much for getting these changes up!

I left a few comments. Let me know what you think.

EDIT: Should have mentioned that this is looking really good overall and I'm very excited for the AsyncReader!

arrow-avro/Cargo.toml

arrow-avro/src/reader/async_reader/async_file_reader.rs

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/async_file_reader.rs

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/object_store_reader.rs

alamb · 2026-01-10T13:12:10Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

EmilyMatt · 2026-01-11T12:24:35Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

I had actually just returned to work on it 2 days ago, still having some issues with the schema now being provided, due to the problems I've described, @jecsand838 suggested removing the arrow schema and I'm starting to think that is the only viable way for now.
Making the fetch API a bit closer to the one parquet uses is the smaller issue, I do wish to keep the seperate semantics for the original fetch and extra fetch(for parquet for example, that will be the row groups ranges, and the footer range), will try a couple ways to do this

EmilyMatt · 2026-01-11T12:24:52Z

Hope to push another version today and address some of the things above

mzabaluev · 2026-01-19T23:05:53Z

I get a ParseError("bad varint") on this test:

    fn get_int_array_schema() -> SchemaRef {
        let schema = Schema::new(vec![Field::new(
            "int_array",
            DataType::List(Arc::new(Field::new("element", DataType::Int32, true))),
            true,
        )])
        .with_metadata(HashMap::from([("avro.name".into(), "table".into())]));
        Arc::new(schema)
    }

    #[tokio::test]
    async fn test_bad_varint_bug() {
        let file = arrow_test_data("avro/bad-varint-bug.avro");

        let schema = get_int_array_schema();
        let batches = read_async_file(&file, 1024, None, schema).await.unwrap();
        let _batch = &batches[0];
    }

The Avro file, readable by Spark: bad-varint-bug.avro.gz

mzabaluev-flarion · 2026-01-19T23:36:21Z

The Avro file, readable by Spark: bad-varint-bug.avro.gz

I have checked the Avro file is readable with Python avro 1.12.1:

>>> from avro.datafile import DataFileReader
>>> from avro.io import DatumReader
>>> reader = DataFileReader(open("testing/data/avro/bad-varint-bug.avro", "rb"), DatumReader())
>>> for rec in reader:
...     print(rec)
... 
{'int_array': [1, 2]}

EmilyMatt · 2026-01-20T08:19:08Z

The Avro file, readable by Spark: bad-varint-bug.avro.gz

I have checked the Avro file is readable with Python avro 1.12.1:
>>> from avro.datafile import DataFileReader
>>> from avro.io import DatumReader
>>> reader = DataFileReader(open("testing/data/avro/bad-varint-bug.avro", "rb"), DatumReader())
>>> for rec in reader:
...     print(rec)
... 
{'int_array': [1, 2]}

I don't think this is a bug in the async reader.
You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

I can confirm the following test passes:

#[tokio::test]
    async fn test_bad_varint_bug() {
        let store: Arc<dyn ObjectStore> = Arc::new(LocalFileSystem::new());
        let location = Path::from_filesystem_path("/home/emily/Downloads/bad-varint-bug.avro").unwrap();

        let file_size = store.head(&location).await.unwrap().size;

        let file_reader = AvroObjectReader::new(store, location);
        let reader = AsyncAvroFileReader::builder(file_reader, file_size, 1024)
            .try_build()
            .await.unwrap();;

        let batches: Vec<RecordBatch> = reader.try_collect().await.unwrap();
        let batch = &batches[0];
        let int_list_col = batch.column(0).as_list::<i32>();

        let first_list = int_list_col.value(0);
        let expected_result = Arc::new(Int32Array::from_iter_values(vec![1i32, 2])) as _;
        assert_eq!(first_list, expected_result)
    }

The issue is probably in the AvroSchema::from
it has various bugs I've also encountered.

mzabaluev · 2026-01-20T13:08:46Z

I don't think this is a bug in the async reader. You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient.
The problem was in a schema mismatch: in the file, the array elements are not nullable.

EmilyMatt · 2026-01-20T19:10:10Z

I don't think this is a bug in the async reader. You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient. The problem was in a schema mismatch: in the file, the array elements are not nullable.

It is not necessarily sufficient.
See
#8928

But you should open a bug for this. since a reader schema with nullables and writer schema with non-nullables should be compatible.

jecsand838 · 2026-01-21T00:40:14Z

@mzabaluev

I don't think this is a bug in the async reader. You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient. The problem was in a schema mismatch: in the file, the array elements are not nullable.

I think this is a schema resolution bug based on a quick glance over details you provided.

That being said there are limitations with using AvroSchema::try_from to create a reader schema. For now my recommendation for creating a reader schema (especially more complicated ones) is to either:

Modify the writer schema's JSON
Manually craft the json for an AvroSchema.
Use AvroSchema::try_from, but sanitize the output and embed it into a pre-defined JSON wrapper if needed.

Originally the AvroSchema::try_from method was built for the Writer so that a correct AvroSchema is inferred from an Arrow Schema in the absence of a provided AvroSchema.

The biggest challenge to overcome relates to the lossy behavior inherent to Arrow -> Avro schema conversion, i.e. Arrow not having the concepts of named types, etc.

@EmilyMatt

The issue is probably in the AvroSchema::from
it has various bugs I've also encountered.

100%, It's absolutely not related to this PR. Sorry about not jumping in sooner to call that out.

As an aside, I just created #9233 which proposes an approach for modularizing schema.rs, adding an ArrowToAvroSchemaBuilder, and enhancing the overall AvroSchema conversion functionality. I'd love to get some feedback if either of you get an opportunity!

jecsand838

@EmilyMatt This looks good! I left some final feedback and recommendations, but I think it's at a place to re-run the CI/CD jobs if you wanted to follow-up on these. Once pipelines pass, I'll approve.

CC: @alamb

jecsand838 · 2026-01-21T01:16:40Z

arrow-avro/src/reader/async_reader/builder.rs

+        // If projection exists, project the reader schema,
+        // if no reader schema is provided, parse it from the header(get the raw writer schema), and project that
+        // this projected schema will be the schema used for reading.
+        let projected_reader_schema = self
+            .projection
+            .as_deref()
+            .map(|projection| {
+                let base_schema = if let Some(reader_schema) = &self.reader_schema {
+                    reader_schema.clone()
+                } else {
+                    let raw = header.get(SCHEMA_METADATA_KEY).ok_or_else(|| {
+                        ArrowError::ParseError("No Avro schema present in file header".to_string())
+                    })?;
+                    let json_string = std::str::from_utf8(raw)
+                        .map_err(|e| {
+                            ArrowError::ParseError(format!(
+                                "Invalid UTF-8 in Avro schema header: {e}"
+                            ))
+                        })?
+                        .to_string();
+                    AvroSchema::new(json_string)
+                };
+                base_schema.project(projection)
+            })
+            .transpose()?;


We should probably add more test coverage to the arrow-avro/src/reader/async_reader/builder.rs file.

Added tests, not all the error cases are covered in the builder, but it looks better now

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/builder.rs

EmilyMatt · 2026-01-25T09:54:03Z

@jecsand838 I've removed the parquet changes, and synced with main, I believe this is ready for last reviews and test runs before merged.

CC: @alamb

alamb · 2026-01-25T14:33:22Z

I started the tests

EmilyMatt · 2026-01-25T16:33:41Z

I started the tests

Thx, most failures were technicalities, believe I fixed all of them

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

jecsand838

@EmilyMatt LGTM!

At this point I'm fine with anything else that comes up being a follow-up issue if you are @mzabaluev (unless it's major).

I just left a few final comments related to improving the docs for this PR.

@alamb

jecsand838 · 2026-01-26T21:41:39Z

arrow-avro/src/lib.rs

+//! is enabled, [`AvroObjectReader`] provides integration with object storage services
+//! such as S3 via the [object_store] crate.
+//!
+//! ```ignore


Let's make this runnable

Suggested change

//! ```ignore

//! ```

I don't know how to do it without failing the tests, because all the code here is featuregated, and the doctests also runs without features enabled

jecsand838 · 2026-01-26T21:51:52Z

arrow-avro/src/reader/async_reader/mod.rs

+/// An asynchronous Avro file reader that implements `Stream<Item = Result<RecordBatch, ArrowError>>`.
+/// This uses an [`AsyncFileReader`] to fetch data ranges as needed, starting with fetching the header,
+/// then reading all the blocks in the provided range where:
+/// 1. Reads and decodes data until the header is fully decoded.
+/// 2. Searching from `range.start` for the first sync marker, and starting with the following block.
+///    (If `range.start` is less than the header length, we start at the header length minus the sync marker bytes)
+/// 3. Reading blocks sequentially, decoding them into RecordBatches.
+/// 4. If a block is incomplete (due to range ending mid-block), fetching the remaining bytes from the [`AsyncFileReader`].
+/// 5. If no range was originally provided, reads the full file.
+/// 6. If the range is 0, file_size is 0, or `range.end` is less than the header length, finish immediately.
+pub struct AsyncAvroFileReader<R> {


I'd recommend adding a good runnable example for AsyncAvroFileReader here.

Added, also moved everything to use AvroError

# Conflicts: # arrow-avro/Cargo.toml # arrow-avro/src/reader/mod.rs

mzabaluev · 2026-01-27T09:12:42Z

arrow-avro/src/reader/async_reader/builder.rs

+const DEFAULT_HEADER_SIZE_HINT: u64 = 16 * 1024; // 16 KB
+
+/// Builder for an asynchronous Avro file reader.
+pub struct AsyncAvroFileReaderBuilder<R> {


This name is unwieldy, though it does not need to be imported.

I'd rather do the idiomatic Rust thing: expose the module as public and export the builder with a terse name under the module path: crate::reader::async_reader::ReaderBuilder.

This is also because I want to add another builder typestate to this API in a follow-up PR, and I don't want its name to be even longer.

This is already kind of what's happening.
Renamed the builder to ReaderBuilder

mzabaluev · 2026-01-27T21:26:13Z

arrow-avro/src/reader/async_reader/builder.rs

+            let current_data = self.reader.get_bytes(range_to_fetch.clone()).await.map_err(|err| {
                AvroError::General(format!(
-                    "Error fetching Avro header from object store: {err}"
+                    "Error fetching Avro header from object store(range: {range_to_fetch:?}): {err}"


This looks like a debugging artefact; add a space after "store" at least?

EmilyMatt added 2 commits November 26, 2025 12:28

feat: Implement an AsyncReader for avro using ObjectStore

4ed172b

Merge branch 'main' into avro-async-reader

e5c7f57

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Nov 26, 2025

This was referenced Nov 26, 2025

Support Ref types in Avro Reader. apache/datafusion#18811

Open

Improve Avro Reader Types Support apache/datafusion#18810

Open

Use arrow-avro for performance and improved type support apache/datafusion#14097

Open

EmilyMatt added 4 commits November 26, 2025 14:24

feature gate use

f5bfd35

update comments

32b0760

file size is mandatory

2251be5

finish immediately

79af114

jecsand838 reviewed Nov 26, 2025

View reviewed changes

arrow-avro/Cargo.toml Outdated Show resolved Hide resolved

arrow-avro/src/reader/async_reader.rs Outdated Show resolved Hide resolved

jecsand838 reviewed Nov 26, 2025

View reviewed changes

arrow-avro/src/reader/async_reader.rs Outdated Show resolved Hide resolved

EmilyMatt added 3 commits November 27, 2025 12:05

remove object store form default

4e207ea

remove object store form default

e04337c

Merge branch 'main' into avro-async-reader

854cd95

EmilyMatt added 2 commits December 7, 2025 17:06

Use builder pattern, fallback to get the schema from the arrow schema…

8e0e46e

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

Merge branch 'main' into avro-async-reader

4f88571

remove accidental changes

cb19fad

jecsand838 reviewed Dec 9, 2025

View reviewed changes

arrow-avro/src/reader/async_reader/object_store_reader.rs Outdated Show resolved Hide resolved

Merge branch 'apache:main' into avro-async-reader

d59a2f5

mzabaluev mentioned this pull request Jan 20, 2026

ParseError("bad varint") on reading array field with nullable elements in the reader schema #9231

Closed

jecsand838 reviewed Jan 21, 2026

View reviewed changes

address CR

138e72d

GaneshPatil7517 mentioned this pull request Jan 21, 2026

[arrow-avro] Add AsyncWriter #9241

Open

add tests

977f619

mzabaluev reviewed Jan 23, 2026

View reviewed changes

arrow-avro/src/reader/async_reader/builder.rs Show resolved Hide resolved

arrow-avro/src/reader/async_reader/builder.rs Show resolved Hide resolved

EmilyMatt added 3 commits January 25, 2026 11:29

Move schema parsing up

3f51768

Merge branch 'main' into avro-async-reader

69b7967

Remove parquet diffs

e3d3624

github-actions bot removed the parquet Changes to the parquet crate label Jan 25, 2026

fix various failures

61c44e4

mzabaluev reviewed Jan 25, 2026

View reviewed changes

arrow-avro/src/reader/async_reader/builder.rs Outdated Show resolved Hide resolved

arrow-avro/src/reader/async_reader/mod.rs Outdated Show resolved Hide resolved

arrow-avro/src/reader/async_reader/mod.rs Outdated Show resolved Hide resolved

seperate impl

4d15e02

jecsand838 approved these changes Jan 26, 2026

View reviewed changes

EmilyMatt added 3 commits January 27, 2026 10:21

Merge branch 'main' into avro-async-reader

eef1b4d

# Conflicts: # arrow-avro/Cargo.toml # arrow-avro/src/reader/mod.rs

sync with main

5c93414

add runnable example

7936756

mzabaluev reviewed Jan 27, 2026

View reviewed changes

EmilyMatt added 2 commits January 27, 2026 11:52

rename to ReaderBuilder

5bd5241

Fix empty files

3f1489e

mzabaluev reviewed Jan 27, 2026

View reviewed changes

feat: Implement an AsyncReader for avro using ObjectStore #8930

Are you sure you want to change the base?

feat: Implement an AsyncReader for avro using ObjectStore #8930

Conversation

EmilyMatt commented Nov 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Nov 26, 2025

Uh oh!

jecsand838 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Dec 1, 2025

Uh oh!

EmilyMatt commented Dec 7, 2025

Uh oh!

jecsand838 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

EmilyMatt commented Jan 11, 2026

Uh oh!

EmilyMatt commented Jan 11, 2026

Uh oh!

mzabaluev commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzabaluev-flarion commented Jan 19, 2026

Uh oh!

EmilyMatt commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzabaluev commented Jan 20, 2026

Uh oh!

EmilyMatt commented Jan 20, 2026

Uh oh!

jecsand838 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Jan 25, 2026

Uh oh!

alamb commented Jan 25, 2026

Uh oh!

EmilyMatt commented Jan 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jecsand838 commented Nov 26, 2025 •

edited

Loading

jecsand838 left a comment •

edited

Loading

mzabaluev commented Jan 19, 2026 •

edited

Loading

EmilyMatt commented Jan 20, 2026 •

edited

Loading

jecsand838 commented Jan 21, 2026 •

edited

Loading

EmilyMatt Jan 27, 2026 •

edited

Loading