pub struct ArrowWriter<W: Write> {
writer: SerializedFileWriter<W>,
in_progress: Option<ArrowRowGroupWriter>,
arrow_schema: SchemaRef,
row_group_writer_factory: ArrowRowGroupWriterFactory,
max_row_group_size: usize,
}Expand description
Encodes [RecordBatch] to parquet
Writes Arrow RecordBatches to a Parquet writer. Multiple [RecordBatch] will be encoded
to the same row group, up to max_row_group_size rows. Any remaining rows will be
flushed on close, leading the final row group in the output file to potentially
contain fewer than max_row_group_size rows
§Example: Writing RecordBatches
let col = Arc::new(Int64Array::from_iter_values([1, 2, 3])) as ArrayRef;
let to_write = RecordBatch::try_from_iter([("col", col)]).unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None).unwrap();
writer.write(&to_write).unwrap();
writer.close().unwrap();
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
let read = reader.next().unwrap().unwrap();
assert_eq!(to_write, read);§Memory Usage and Limiting
The nature of Parquet requires buffering of an entire row group before it can be flushed to the underlying writer. Data is mostly buffered in its encoded form, reducing memory usage. However, some data such as dictionary keys, large strings or very nested data may still result in non-trivial memory usage.
See Also:
ArrowWriter::memory_size: the current memory usage of the writer.ArrowWriter::in_progress_size: Estimated size of the buffered row group,
Call Self::flush to trigger an early flush of a row group based on a
memory threshold and/or global memory pressure. However, smaller row groups
result in higher metadata overheads, and thus may worsen compression ratios
and query performance.
writer.write(&batch).unwrap();
// Trigger an early flush if anticipated size exceeds 1_000_000
if writer.in_progress_size() > 1_000_000 {
writer.flush().unwrap();
}§Type Support
The writer supports writing all Arrow DataTypes that have a direct mapping to
Parquet types including StructArray and ListArray.
The following are not supported:
IntervalMonthDayNanoArray: Parquet does not support nanosecond intervals.
§Type Compatibility
The writer can write Arrow [RecordBatch]s that are logically equivalent. This means that for
a given column, the writer can accept multiple Arrow DataTypes that contain the same
value type.
For example, the following DataTypes are all logically equivalent and can be written
to the same column:
- String, LargeString, StringView
- Binary, LargeBinary, BinaryView
The writer can will also accept both native and dictionary encoded arrays if the dictionaries contain compatible values.
let record_batch1 = RecordBatch::try_new(
Arc::new(Schema::new(vec![Field::new("col", DataType::LargeUtf8, false)])),
vec![Arc::new(LargeStringArray::from_iter_values(vec!["a", "b"]))]
)
.unwrap();
let mut buffer = Vec::new();
let mut writer = ArrowWriter::try_new(&mut buffer, record_batch1.schema(), None).unwrap();
writer.write(&record_batch1).unwrap();
let record_batch2 = RecordBatch::try_new(
Arc::new(Schema::new(vec![Field::new(
"col",
DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)),
false,
)])),
vec![Arc::new(DictionaryArray::new(
UInt8Array::from_iter_values(vec![0, 1]),
Arc::new(StringArray::from_iter_values(vec!["b", "c"])),
))],
)
.unwrap();
writer.write(&record_batch2).unwrap();
writer.close();Fields§
§writer: SerializedFileWriter<W>Underlying Parquet writer
in_progress: Option<ArrowRowGroupWriter>The in-progress row group if any
arrow_schema: SchemaRefA copy of the Arrow schema.
The schema is used to verify that each record batch written has the correct schema
row_group_writer_factory: ArrowRowGroupWriterFactoryCreates new ArrowRowGroupWriter instances as required
max_row_group_size: usizeThe length of arrays to write to each row group
Implementations§
Source§impl<W: Write + Send> ArrowWriter<W>
impl<W: Write + Send> ArrowWriter<W>
Sourcepub fn try_new(
writer: W,
arrow_schema: SchemaRef,
props: Option<WriterProperties>,
) -> Result<Self>
pub fn try_new( writer: W, arrow_schema: SchemaRef, props: Option<WriterProperties>, ) -> Result<Self>
Try to create a new Arrow writer
The writer will fail if:
- a
SerializedFileWritercannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
Sourcepub fn try_new_with_options(
writer: W,
arrow_schema: SchemaRef,
options: ArrowWriterOptions,
) -> Result<Self>
pub fn try_new_with_options( writer: W, arrow_schema: SchemaRef, options: ArrowWriterOptions, ) -> Result<Self>
Try to create a new Arrow writer with ArrowWriterOptions.
The writer will fail if:
- a
SerializedFileWritercannot be created from the ParquetWriter - the Arrow schema contains unsupported datatypes such as Unions
Sourcepub fn flushed_row_groups(&self) -> &[RowGroupMetaData]
pub fn flushed_row_groups(&self) -> &[RowGroupMetaData]
Returns metadata for any flushed row groups
Sourcepub fn memory_size(&self) -> usize
pub fn memory_size(&self) -> usize
Estimated memory usage, in bytes, of this ArrowWriter
This estimate is formed bu summing the values of
ArrowColumnWriter::memory_size all in progress columns.
Sourcepub fn in_progress_size(&self) -> usize
pub fn in_progress_size(&self) -> usize
Anticipated encoded size of the in progress row group.
This estimate the row group size after being completely encoded is,
formed by summing the values of
ArrowColumnWriter::get_estimated_total_bytes for all in progress
columns.
Sourcepub fn in_progress_rows(&self) -> usize
pub fn in_progress_rows(&self) -> usize
Returns the number of rows buffered in the in progress row group
Sourcepub fn bytes_written(&self) -> usize
pub fn bytes_written(&self) -> usize
Returns the number of bytes written by this instance
Sourcepub fn write(&mut self, batch: &RecordBatch) -> Result<()>
pub fn write(&mut self, batch: &RecordBatch) -> Result<()>
Encodes the provided [RecordBatch]
If this would cause the current row group to exceed WriterProperties::max_row_group_size
rows, the contents of batch will be written to one or more row groups such that all but
the final row group in the file contain WriterProperties::max_row_group_size rows.
This will fail if the batch’s schema does not match the writer’s schema.
Sourcepub fn write_all(&mut self, buf: &[u8]) -> Result<()>
pub fn write_all(&mut self, buf: &[u8]) -> Result<()>
Writes the given buf bytes to the internal buffer.
It’s safe to use this method to write data to the underlying writer, because it will ensure that the buffering and byte‐counting layers are used.
Sourcepub fn flush(&mut self) -> Result<()>
pub fn flush(&mut self) -> Result<()>
Flushes all buffered rows into a new row group
Note the underlying writer is not flushed with this call.
If this is a desired behavior, please call ArrowWriter::sync.
Sourcepub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
pub fn append_key_value_metadata(&mut self, kv_metadata: KeyValue)
Additional KeyValue metadata to be written in addition to those from WriterProperties
This method provide a way to append kv_metadata after write RecordBatch
Sourcepub fn inner_mut(&mut self) -> &mut W
pub fn inner_mut(&mut self) -> &mut W
Returns a mutable reference to the underlying writer.
Warning: if you write directly to this writer, you will skip
the TrackedWrite buffering and byte‐counting layers. That’ll cause
the file footer’s recorded offsets and sizes to diverge from reality,
resulting in an unreadable or corrupted Parquet file.
If you want to write safely to the underlying writer, use Self::write_all.
Sourcepub fn into_inner(self) -> Result<W>
pub fn into_inner(self) -> Result<W>
Flushes any outstanding data and returns the underlying writer.
Sourcepub fn finish(&mut self) -> Result<ParquetMetaData>
pub fn finish(&mut self) -> Result<ParquetMetaData>
Close and finalize the underlying Parquet writer
Unlike Self::close this does not consume self
Attempting to write after calling finish will result in an error
Sourcepub fn close(self) -> Result<ParquetMetaData>
pub fn close(self) -> Result<ParquetMetaData>
Close and finalize the underlying Parquet writer
Sourcepub fn get_column_writers(&mut self) -> Result<Vec<ArrowColumnWriter>>
👎Deprecated since 56.2.0: Use ArrowRowGroupWriterFactory instead, see ArrowColumnWriter for an example
pub fn get_column_writers(&mut self) -> Result<Vec<ArrowColumnWriter>>
ArrowRowGroupWriterFactory instead, see ArrowColumnWriter for an exampleCreate a new row group writer and return its column writers.
Sourcepub fn append_row_group(&mut self, chunks: Vec<ArrowColumnChunk>) -> Result<()>
👎Deprecated since 56.2.0: Use SerializedFileWriter directly instead, see ArrowColumnWriter for an example
pub fn append_row_group(&mut self, chunks: Vec<ArrowColumnChunk>) -> Result<()>
SerializedFileWriter directly instead, see ArrowColumnWriter for an exampleAppend the given column chunks to the file as a new row group.
Sourcepub fn into_serialized_writer(
self,
) -> Result<(SerializedFileWriter<W>, ArrowRowGroupWriterFactory)>
pub fn into_serialized_writer( self, ) -> Result<(SerializedFileWriter<W>, ArrowRowGroupWriterFactory)>
Converts this writer into a lower-level SerializedFileWriter and ArrowRowGroupWriterFactory.
Flushes any outstanding data before returning.
This can be useful to provide more control over how files are written, for example
to write columns in parallel. See the example on ArrowColumnWriter.
Trait Implementations§
Auto Trait Implementations§
impl<W> Freeze for ArrowWriter<W>where
W: Freeze,
impl<W> !RefUnwindSafe for ArrowWriter<W>
impl<W> Send for ArrowWriter<W>where
W: Send,
impl<W> !Sync for ArrowWriter<W>
impl<W> Unpin for ArrowWriter<W>where
W: Unpin,
impl<W> !UnwindSafe for ArrowWriter<W>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more