Module variant

Module variant 

Source
Expand description

⚠️ Experimental Support for reading and writing Variants to / from Parquet files ⚠️

This is a 🚧 Work In Progress

Note: Requires the variant_experimental feature of the parquet crate to be enabled.

§Features

  • Representation of Variant, and VariantArray for working with Variant values (see [parquet_variant] for more details)
  • Kernels for working with arrays of Variant values such as conversion between Variant and JSON, and shredding/unshredding (see [parquet_variant_compute] for more details)

§Example: Writing a Parquet file with Variant column

 // Use the VariantArrayBuilder to build a VariantArray
 let mut builder = VariantArrayBuilder::new(3);
 builder.new_object().with_field("name", "Alice").finish(); // row 1: {"name": "Alice"}
 builder.append_value("such wow"); // row 2: "such wow" (a string)
 let array = builder.build();

 // Since VariantArray is an ExtensionType, it needs to be converted
 // to an ArrayRef and Field with the appropriate metadata
 // before it can be written to a Parquet file
 let field = array.field("data");
 let array = ArrayRef::from(array);
 // create a RecordBatch with the VariantArray
 let schema = Schema::new(vec![field]);
 let batch = RecordBatch::try_new(Arc::new(schema), vec![array])?;

 // Now you can write the RecordBatch to the Parquet file, as normal
 let file = std::fs::File::create("variant.parquet")?;
 let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
 writer.write(&batch)?;
 writer.close()?;

§Example: Writing JSON into a Parquet file with Variant column

 // Create an array of JSON strings, simulating a column of JSON data
 let input_array: ArrayRef = Arc::new(StringArray::from(vec![
  Some(r#"{"name": "Alice", "age": 30}"#),
  Some(r#"{"name": "Bob", "age": 25, "address": {"city": "New York"}}"#),
  None,
  Some("{}"),
 ]));

 // Convert the JSON strings to a VariantArray
 let array: VariantArray = json_to_variant(&input_array)?;
 // create a RecordBatch with the VariantArray
 let schema = Schema::new(vec![array.field("data")]);
 let batch = RecordBatch::try_new(Arc::new(schema), vec![ArrayRef::from(array)])?;

 // write the RecordBatch to a Parquet file as normal
 let file = std::fs::File::create("variant-json.parquet")?;
 let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
 writer.write(&batch)?;
 writer.close()?;

§Example: Reading a Parquet file with Variant column

Use the VariantType extension type to find the Variant column:

// Read the Parquet file using standard Arrow Parquet reader.
// Note this file has 2 columns: "id", "var", and the "var" column
let file = std::fs::File::open(file_path())?;
let mut reader = ArrowReaderBuilder::try_new(file)?.build()?;

// You can check if a column contains a Variant using
// the VariantType extension type
let schema = reader.schema();
let field = schema.field_with_name("var")?;
assert!(field.try_extension_type::<VariantType>().is_ok());

// The reader will yield RecordBatches with a StructArray
// to convert them to VariantArray, use VariantArray::try_new
let batch = reader.next().unwrap().unwrap();

let col = batch.column_by_name("var").unwrap();
let var_array = VariantArray::try_new(col)?;
assert_eq!(var_array.len(), 1);
let var_value: Variant = var_array.value(0);
assert_eq!(var_value, Variant::from("iceberg")); // the value in case-075.parquet

Structs§

BorrowedShreddingState
Similar to ShreddingState except it holds borrowed references of the target arrays. Useful for avoiding clone operations when the caller does not need a self-standing shredding state.
CastOptions
Options for controlling the behavior of cast_to_variant_with_options.
GetOptions
Controls the action of the variant_get kernel.
ListBuilder
A builder for creating Variant::List values.
ListState
Internal state for list building
ObjectBuilder
A builder for creating Variant::Object values.
ObjectFieldBuilder
A VariantBuilderExt that inserts a new field into a variant object.
ObjectState
Internal state for object building
ParentState
Tracks information needed to correctly finalize a nested builder.
ReadOnlyMetadataBuilder
A metadata builder that cannot register new field names, and merely returns the field id associated with a known field name. This is useful for variant unshredding operations, where the metadata column is fixed and – per variant shredding spec – already contains all field names from the typed_value column. It is also useful when projecting a subset of fields from a variant object value, since the bytes can be copied across directly without re-encoding their field ids.
ShortString
A Variant ShortString
ShreddingState
Represents the shredding state of a VariantArray
Uuid
A Universally Unique Identifier (UUID).
ValueBuilder
Wrapper around a Vec<u8> that provides methods for appending primitive values, variant types, and metadata.
VariantArray
An array of Parquet Variant values
VariantArrayBuilder
A builder for VariantArray
VariantBuilder
Top level builder for Variant values
VariantDecimal4
Represents a 4-byte decimal value in the Variant format.
VariantDecimal8
Represents an 8-byte decimal value in the Variant format.
VariantDecimal16
Represents an 16-byte decimal value in the Variant format.
VariantList
Variant Array.
VariantMetadata
Variant Metadata
VariantObject
A Variant Object (struct with named fields).
VariantPath
Represents a qualified path to a potential subfield or index of a variant value.
VariantType
Arrow Variant [ExtensionType].
VariantValueArrayBuilder
A builder for creating only the value column of a VariantArray
WritableMetadataBuilder
Builder for constructing metadata for Variant values.
f16
A 16-bit floating point type implementing the IEEE 754-2008 standard binary16 a.k.a “half” format.

Enums§

Variant
Represents a Parquet Variant
VariantPathElement
Element of a VariantPath that can be a field name or an index.

Constants§

EMPTY_VARIANT_METADATA
The empty metadata dictionary.
EMPTY_VARIANT_METADATA_BYTES
The canonical byte slice corresponding to an empty metadata dictionary.

Traits§

BuilderSpecificState
A trait for managing state specific to different builder types.
MetadataBuilder
A trait for building variant metadata dictionaries, to be used in conjunction with a ValueBuilder. The trait provides methods for managing field names and their IDs, as well as rolling back a failed builder operation that might have created new field ids.
VariantBuilderExt
Extends VariantBuilder to help building nested Variants

Functions§

cast_to_variant
Convert an array to a VariantArray with strict mode enabled (returns errors on conversion failures).
cast_to_variant_with_options
Casts a typed arrow [Array] to a VariantArray. This is useful when you need to convert a specific data type
json_to_variant
Parse a batch of JSON strings into a batch of Variants represented as STRUCT<metadata: BINARY, value: BINARY> where nulls are preserved. The JSON strings in the input must be valid.
shred_variant
Shreds the input binary variant using a target shredding schema derived from the requested data type.
unshred_variant
Removes all (nested) typed_value columns from a VariantArray by converting them back to binary variant and merging the resulting values back into the value column.
variant_get
Returns an array with the specified path extracted from the variant values.
variant_to_json
Transform a batch of Variant represented as STRUCT<metadata: BINARY, value: BINARY> to a batch of JSON strings where nulls are preserved. The JSON strings in the input must be valid.