-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Implement an AsyncReader for avro using ObjectStore #8930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.
Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile |
100% I'm working on that right now and won't stop until I have a PR. That was a solid catch. The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back. |
|
Sorry, I haven't dropped it, just found myself in a really busy week! The generic reader support does not seem to hard to implement from the dabbling I made, and I still need to get to the builder pattern change |
…, separate object store file reader into a featuregated struct and use a generic async file reader trait
|
@jecsand838 I believe this is now ready for a proper review^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EmilyMatt Thank you so much for getting these changes up!
I left a few comments. Let me know what you think.
EDIT: Should have mentioned that this is looking really good overall and I'm very excited for the AsyncReader!
|
@jecsand838 and @EmilyMatt -- how is this PR looking? |
I had actually just returned to work on it 2 days ago, still having some issues with the schema now being provided, due to the problems I've described, @jecsand838 suggested removing the arrow schema and I'm starting to think that is the only viable way for now. |
|
Hope to push another version today and address some of the things above |
|
I get a fn get_int_array_schema() -> SchemaRef {
let schema = Schema::new(vec![Field::new(
"int_array",
DataType::List(Arc::new(Field::new("element", DataType::Int32, true))),
true,
)])
.with_metadata(HashMap::from([("avro.name".into(), "table".into())]));
Arc::new(schema)
}
#[tokio::test]
async fn test_bad_varint_bug() {
let file = arrow_test_data("avro/bad-varint-bug.avro");
let schema = get_int_array_schema();
let batches = read_async_file(&file, 1024, None, schema).await.unwrap();
let _batch = &batches[0];
}The Avro file, readable by Spark: bad-varint-bug.avro.gz |
I have checked the Avro file is readable with Python avro 1.12.1: |
I don't think this is a bug in the async reader. I can confirm the following test passes: The issue is probably in the AvroSchema::from |
My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient. |
It is not necessarily sufficient. But you should open a bug for this. since a reader schema with nullables and writer schema with non-nullables should be compatible. |
I think this is a schema resolution bug based on a quick glance over details you provided. That being said there are limitations with using
Originally the The biggest challenge to overcome relates to the lossy behavior inherent to Arrow -> Avro schema conversion, i.e. Arrow not having the concepts of named types, etc.
100%, It's absolutely not related to this PR. Sorry about not jumping in sooner to call that out. As an aside, I just created #9233 which proposes an approach for modularizing |
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EmilyMatt This looks good! I left some final feedback and recommendations, but I think it's at a place to re-run the CI/CD jobs if you wanted to follow-up on these. Once pipelines pass, I'll approve.
CC: @alamb
| // If projection exists, project the reader schema, | ||
| // if no reader schema is provided, parse it from the header(get the raw writer schema), and project that | ||
| // this projected schema will be the schema used for reading. | ||
| let projected_reader_schema = self | ||
| .projection | ||
| .as_deref() | ||
| .map(|projection| { | ||
| let base_schema = if let Some(reader_schema) = &self.reader_schema { | ||
| reader_schema.clone() | ||
| } else { | ||
| let raw = header.get(SCHEMA_METADATA_KEY).ok_or_else(|| { | ||
| ArrowError::ParseError("No Avro schema present in file header".to_string()) | ||
| })?; | ||
| let json_string = std::str::from_utf8(raw) | ||
| .map_err(|e| { | ||
| ArrowError::ParseError(format!( | ||
| "Invalid UTF-8 in Avro schema header: {e}" | ||
| )) | ||
| })? | ||
| .to_string(); | ||
| AvroSchema::new(json_string) | ||
| }; | ||
| base_schema.project(projection) | ||
| }) | ||
| .transpose()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests, not all the error cases are covered in the builder, but it looks better now
|
@jecsand838 I've removed the parquet changes, and synced with main, I believe this is ready for last reviews and test runs before merged. CC: @alamb |
|
I started the tests |
Thx, most failures were technicalities, believe I fixed all of them |
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EmilyMatt LGTM!
At this point I'm fine with anything else that comes up being a follow-up issue if you are @mzabaluev (unless it's major).
I just left a few final comments related to improving the docs for this PR.
| //! is enabled, [`AvroObjectReader`] provides integration with object storage services | ||
| //! such as S3 via the [object_store] crate. | ||
| //! | ||
| //! ```ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this runnable
| //! ```ignore | |
| //! ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do it without failing the tests, because all the code here is featuregated, and the doctests also runs without features enabled
| /// An asynchronous Avro file reader that implements `Stream<Item = Result<RecordBatch, ArrowError>>`. | ||
| /// This uses an [`AsyncFileReader`] to fetch data ranges as needed, starting with fetching the header, | ||
| /// then reading all the blocks in the provided range where: | ||
| /// 1. Reads and decodes data until the header is fully decoded. | ||
| /// 2. Searching from `range.start` for the first sync marker, and starting with the following block. | ||
| /// (If `range.start` is less than the header length, we start at the header length minus the sync marker bytes) | ||
| /// 3. Reading blocks sequentially, decoding them into RecordBatches. | ||
| /// 4. If a block is incomplete (due to range ending mid-block), fetching the remaining bytes from the [`AsyncFileReader`]. | ||
| /// 5. If no range was originally provided, reads the full file. | ||
| /// 6. If the range is 0, file_size is 0, or `range.end` is less than the header length, finish immediately. | ||
| pub struct AsyncAvroFileReader<R> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend adding a good runnable example for AsyncAvroFileReader here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, also moved everything to use AvroError
# Conflicts: # arrow-avro/Cargo.toml # arrow-avro/src/reader/mod.rs
| const DEFAULT_HEADER_SIZE_HINT: u64 = 16 * 1024; // 16 KB | ||
|
|
||
| /// Builder for an asynchronous Avro file reader. | ||
| pub struct AsyncAvroFileReaderBuilder<R> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is unwieldy, though it does not need to be imported.
I'd rather do the idiomatic Rust thing: expose the module as public and export the builder with a terse name under the module path: crate::reader::async_reader::ReaderBuilder.
This is also because I want to add another builder typestate to this API in a follow-up PR, and I don't want its name to be even longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already kind of what's happening.
Renamed the builder to ReaderBuilder
| let current_data = self.reader.get_bytes(range_to_fetch.clone()).await.map_err(|err| { | ||
| AvroError::General(format!( | ||
| "Error fetching Avro header from object store: {err}" | ||
| "Error fetching Avro header from object store(range: {range_to_fetch:?}): {err}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a debugging artefact; add a space after "store" at least?

Which issue does this PR close?
Rationale for this change
Allows for proper file splitting within an asynchronous context.
What changes are included in this PR?
The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema
Are these changes tested?
Yes
Are there any user-facing changes?
Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)