Skip to main content

ArrowWriterOptions

Struct ArrowWriterOptions 

Source
pub struct ArrowWriterOptions {
    properties: WriterProperties,
    skip_arrow_metadata: bool,
    schema_root: Option<String>,
    schema_descr: Option<SchemaDescriptor>,
    page_store_factory: Option<Arc<dyn PageStoreFactory>>,
}
Expand description

Arrow-specific configuration settings for writing parquet files.

See ArrowWriter for how to configure the writer.

Fields§

§properties: WriterProperties§skip_arrow_metadata: bool§schema_root: Option<String>§schema_descr: Option<SchemaDescriptor>§page_store_factory: Option<Arc<dyn PageStoreFactory>>

Implementations§

Source§

impl ArrowWriterOptions

Source

pub fn new() -> Self

Creates a new ArrowWriterOptions with the default settings.

Source

pub fn with_properties(self, properties: WriterProperties) -> Self

Sets the WriterProperties for writing parquet files.

Source

pub fn with_page_store_factory( self, page_store_factory: Arc<dyn PageStoreFactory>, ) -> Self

Sets the PageStoreFactory used to buffer completed pages while a row group is being written.

The default implementation (InMemoryPageStore) buffers all completed pages on the heap until the row group is flushed, so peak write memory grows with the row group size. Using this API, pages can be spilled to a file or object storage instead, reducing peak write memory substantially at the expense of an extra write to and read from secondary storage.

§Example: spilling pages to a temp file

A simple spilling backend uses one temp file per column chunk; put appends the page and take reads it back.

struct TempFilePageStore {
    file: File,
    /// Total size of the file
    end: u64,
    /// Location of pages: (offset, len)
    locs: Vec<(u64, usize)>,
}

impl PageStore for TempFilePageStore {
    fn put(&mut self, value: Bytes) -> Result<PageKey> {
        // Append to the end of the file
        self.file.seek(SeekFrom::Start(self.end))?;
        self.file.write_all(&value)?;
        let key = PageKey::new(self.locs.len() as u64);
        self.locs.push((self.end, value.len()));
        self.end += value.len() as u64;
        Ok(key)
    }

    fn take(&mut self, key: PageKey) -> Result<Bytes> {
        let (offset, len) = self.locs[key.get() as usize];
        let mut buf = vec![0u8; len];
        self.file.seek(SeekFrom::Start(offset))?;
        self.file.read_exact(&mut buf)?;
        Ok(Bytes::from(buf))
    }
}

/// Factory for creating [`TempFilePageStore`]
#[derive(Debug)]
struct TempFilePageStoreFactory;

impl PageStoreFactory for TempFilePageStoreFactory {
    fn create(&self, args: &PageStoreArgs<'_>) -> Result<Box<dyn PageStore>> {
        // `args` exposes the column index and descriptor (physical/logical
        // type, path), so a real backend might choose to spill only large columns.
        let _ = (args.column_index(), args.column_descriptor());
        Ok(Box::new(TempFilePageStore {
            file: tempfile::tempfile()?, // temp file is cleaned on drop
            end: 0,
            locs: Vec::new(),
        }))
    }
}
// write 1000 integers
let col = Arc::new(Int64Array::from_iter_values(0..1000)) as ArrayRef;
let to_write = RecordBatch::try_from_iter([("col", col)]).unwrap();

let options =
    ArrowWriterOptions::new().with_page_store_factory(Arc::new(TempFilePageStoreFactory));
let mut buffer = Vec::new();
let mut writer =
    ArrowWriter::try_new_with_options(&mut buffer, to_write.schema(), options).unwrap();
writer.write(&to_write).unwrap();
writer.close().unwrap();

// buffer now holds valid Parquet data, which can be read as normal:
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buffer), 1024).unwrap();
assert_eq!(to_write, reader.next().unwrap().unwrap());
Source

pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self

Skip encoding the embedded arrow metadata (defaults to false)

Parquet files generated by the ArrowWriter contain embedded arrow schema by default.

Set skip_arrow_metadata to true, to skip encoding the embedded metadata.

Source

pub fn with_schema_root(self, schema_root: String) -> Self

Set the name of the root parquet schema element (defaults to "arrow_schema")

Source

pub fn with_parquet_schema(self, schema_descr: SchemaDescriptor) -> Self

Explicitly specify the Parquet schema to be used

If omitted (the default), the ArrowSchemaConverter is used to compute the Parquet SchemaDescriptor. This may be used When the SchemaDescriptor is already known or must be calculated using custom logic.

Trait Implementations§

Source§

impl Clone for ArrowWriterOptions

Source§

fn clone(&self) -> ArrowWriterOptions

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for ArrowWriterOptions

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for ArrowWriterOptions

Source§

fn default() -> ArrowWriterOptions

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
§

impl<T> Ungil for T
where T: Send,

§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

§

fn vzip(self) -> V