Module reader

Expand description

Core functionality for reading Avro data into Arrow arrays

Implements the primary reader interface and record decoding logic. Avro reader

This module provides facilities to read Apache Avro-encoded files or streams into Arrow’s RecordBatch format. In particular, it introduces:

ReaderBuilder: Configures Avro reading, e.g., batch size
Reader: Yields RecordBatch values, implementing Iterator
Decoder: A low-level push-based decoder for Avro records

§Basic Usage

Reader can be used directly with synchronous data sources, such as std::fs::File.

§Reading a Single Batch

let file = File::open(path).unwrap();
let mut avro = ReaderBuilder::new().build(BufReader::new(file)).unwrap();
let batch = avro.next().unwrap();

§Async Usage

The lower-level Decoder can be integrated with various forms of async data streams, and is designed to be agnostic to different async IO primitives within the Rust ecosystem. It works by incrementally decoding Avro data from byte slices.

For example, see below for how it could be used with an arbitrary Stream of Bytes:


fn decode_stream<S: Stream<Item = Bytes> + Unpin>(
    mut decoder: Decoder,
    mut input: S,
) -> impl Stream<Item = Result<RecordBatch, ArrowError>> {
    let mut buffered = Bytes::new();
    futures::stream::poll_fn(move |cx| {
        loop {
            if buffered.is_empty() {
                buffered = match ready!(input.poll_next_unpin(cx)) {
                    Some(b) => b,
                    None => break,
                };
            }
            let decoded = match decoder.decode(buffered.as_ref()) {
                Ok(decoded) => decoded,
                Err(e) => return Poll::Ready(Some(Err(e))),
            };
            let read = buffered.len();
            buffered.advance(decoded);
            if decoded != read {
                break
            }
        }
        // Convert any fully-decoded rows to a RecordBatch, if available
        Poll::Ready(decoder.flush().transpose())
    })
}

Modules§

block 🔒: Decoder for Block
cursor 🔒
header 🔒: Decoder for Header
record 🔒
vlq 🔒

Structs§

Decoder: A low-level interface for decoding Avro-encoded bytes into Arrow RecordBatch.
Reader: A high-level Avro Reader that reads container-file blocks and feeds them into a row-level Decoder.
ReaderBuilder: A builder to create an Avro Reader that reads Avro data into Arrow RecordBatch.

Functions§

read_header 🔒: Read the Avro file header (magic, metadata, sync marker) from reader.