pub struct Decoder {
schema: SchemaRef,
projection: Option<Vec<usize>>,
batch_size: usize,
to_skip: usize,
line_number: usize,
end: usize,
record_decoder: RecordDecoder,
null_regex: NullRegex,
}
Expand description
A push-based interface for decoding CSV data from an arbitrary byte stream
See Reader
for a higher-level interface for interface with Read
The push-based interface facilitates integration with sources that yield arbitrarily
delimited bytes ranges, such as BufRead
, or a chunked byte stream received from
object storage
fn read_from_csv<R: BufRead>(
mut reader: R,
schema: SchemaRef,
batch_size: usize,
) -> Result<impl Iterator<Item = Result<RecordBatch, ArrowError>>, ArrowError> {
let mut decoder = ReaderBuilder::new(schema)
.with_batch_size(batch_size)
.build_decoder();
let mut next = move || {
loop {
let buf = reader.fill_buf()?;
let decoded = decoder.decode(buf)?;
if decoded == 0 {
break;
}
// Consume the number of bytes read
reader.consume(decoded);
}
decoder.flush()
};
Ok(std::iter::from_fn(move || next().transpose()))
}
Fields§
§schema: SchemaRef
Explicit schema for the CSV file
projection: Option<Vec<usize>>
Optional projection for which columns to load (zero-based column indices)
batch_size: usize
Number of records per batch
to_skip: usize
Rows to skip
line_number: usize
Current line number
end: usize
End line number
record_decoder: RecordDecoder
A decoder for StringRecords
null_regex: NullRegex
Check if the string matches this pattern for NULL
.
Implementations§
Source§impl Decoder
impl Decoder
Sourcepub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError>
pub fn decode(&mut self, buf: &[u8]) -> Result<usize, ArrowError>
Decode records from buf
returning the number of bytes read
This method returns once batch_size
objects have been parsed since the
last call to Self::flush
, or buf
is exhausted. Any remaining bytes
should be included in the next call to Self::decode
There is no requirement that buf
contains a whole number of records, facilitating
integration with arbitrary byte streams, such as that yielded by BufRead
or
network sources such as object storage
Sourcepub fn flush(&mut self) -> Result<Option<RecordBatch>, ArrowError>
pub fn flush(&mut self) -> Result<Option<RecordBatch>, ArrowError>
Flushes the currently buffered data to a [RecordBatch
]
This should only be called after Self::decode
has returned Ok(0)
,
otherwise may return an error if part way through decoding a record
Returns Ok(None)
if no buffered data
Sourcepub fn capacity(&self) -> usize
pub fn capacity(&self) -> usize
Returns the number of records that can be read before requiring a call to Self::flush