<!–- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. –>
API Reference
Arrow.ArrowVector
— TypeArrow.ArrowVector
An abstract type that subtypes AbstractVector
. Each specific arrow array type subtypes ArrowVector
. See BoolVector
, Primitive
, List
, Map
, FixedSizeList
, Struct
, DenseUnion
, SparseUnion
, and DictEncoded
for more details.
Arrow.BoolVector
— TypeArrow.BoolVector
A bit-packed array type, similar to ValidityBitmap
, but which holds boolean values, true
or false
.
Arrow.Compressed
— TypeArrow.Compressed
Represents the compressed version of an ArrowVector
. Holds a reference to the original column. May have Compressed
children for nested array types.
Arrow.DenseUnion
— TypeArrow.DenseUnion
An ArrowVector
where the type of each element is one of a fixed set of types, meaning its eltype is like a julia Union{type1, type2, ...}
. An Arrow.DenseUnion
, in comparison to Arrow.SparseUnion
, stores elements in a set of arrays, one array per possible type, and an "offsets" array, where each offset element is the index into one of the typed arrays. This allows a sort of "compression", where no extra space is used/allocated to store all the elements.
Arrow.DictEncode
— TypeArrow.DictEncode(::AbstractVector, id::Integer=nothing)
Signals that a column/array should be dictionary encoded when serialized to the arrow streaming/file format. An optional id
number may be provided to signal that multiple columns should use the same pool when being dictionary encoded.
Arrow.DictEncoded
— TypeArrow.DictEncoded
A dictionary encoded array type (similar to a PooledArray
). Behaves just like a normal array in most respects; internally, possible values are stored in the encoding::DictEncoding
field, while the indices::Vector{<:Integer}
field holds the "codes" of each element for indexing into the encoding pool. Any column/array can be dict encoding when serializing to the arrow format either by passing the dictencode=true
keyword argument to Arrow.write
(which causes all columns to be dict encoded), or wrapping individual columns/ arrays in Arrow.DictEncode(x)
.
Arrow.DictEncoding
— TypeArrow.DictEncoding
Represents the "pool" of possible values for a DictEncoded
array type. Whether the order of values is significant can be checked by looking at the isOrdered
boolean field.
The S
type parameter, while not tied directly to any field, is the signed integer "index type" of the parent DictEncoded. We keep track of this in the DictEncoding in order to validate the length of the pool doesn't exceed the index type limit. The general workflow of writing arrow data means the initial schema will typically be based off the data in the first record batch, and subsequent record batches need to match the same schema exactly. For example, if a non-first record batch dict encoded column were to cause a DictEncoding pool to overflow on unique values, a fatal error should be thrown.
Arrow.FixedSizeList
— TypeArrow.FixedSizeList
An ArrowVector
where each element is a "fixed size" list of some kind, like a NTuple{N, T}
.
Arrow.List
— TypeArrow.List
An ArrowVector
where each element is a variable sized list of some kind, like an AbstractVector
or AbstractString
.
Arrow.Map
— TypeArrow.Map
An ArrowVector
where each element is a "map" of some kind, like a Dict
.
Arrow.Primitive
— TypeArrow.Primitive
An ArrowVector
where each element is a "fixed size" scalar of some kind, like an integer, float, decimal, or time type.
Arrow.SparseUnion
— TypeArrow.SparseUnion
An ArrowVector
where the type of each element is one of a fixed set of types, meaning its eltype is like a julia Union{type1, type2, ...}
. An Arrow.SparseUnion
, in comparison to Arrow.DenseUnion
, stores elements in a set of arrays, one array per possible type, and each typed array has the same length as the full array. This ends up with "wasted" space, since only one slot among the typed arrays is valid per full array element, but can allow for certain optimizations when each typed array has the same length.
Arrow.Stream
— TypeArrow.Stream(io::IO; convert::Bool=true)
Arrow.Stream(file::String; convert::Bool=true)
Arrow.Stream(bytes::Vector{UInt8}, pos=1, len=nothing; convert::Bool=true)
Arrow.Stream(inputs::Vector; convert::Bool=true)
Start reading an arrow formatted table, from:
io
, bytes will be read all at once viaread(io)
file
, bytes will be read viaMmap.mmap(file)
bytes
, a byte vector directly, optionally allowing specifying the starting byte positionpos
andlen
- A
Vector
of any of the above, in which each input should be an IPC or arrow file and must match schema
Reads the initial schema message from the arrow stream/file, then returns an Arrow.Stream
object which will iterate over record batch messages, producing an Arrow.Table
on each iteration.
By iterating Arrow.Table
, Arrow.Stream
satisfies the Tables.partitions
interface, and as such can be passed to Tables.jl-compatible sink functions.
This allows iterating over extremely large "arrow tables" in chunks represented as record batches.
Supports the convert
keyword argument which controls whether certain arrow primitive types will be lazily converted to more friendly Julia defaults; by default, convert=true
.
Arrow.Struct
— TypeArrow.Struct
An ArrowVector
where each element is a "struct" of some kind with ordered, named fields, like a NamedTuple{names, types}
or regular julia struct
.
Arrow.Table
— TypeArrow.Table(io::IO; convert::Bool=true)
Arrow.Table(file::String; convert::Bool=true)
Arrow.Table(bytes::Vector{UInt8}, pos=1, len=nothing; convert::Bool=true)
Arrow.Table(inputs::Vector; convert::Bool=true)
Read an arrow formatted table, from:
io
, bytes will be read all at once viaread(io)
file
, bytes will be read viaMmap.mmap(file)
bytes
, a byte vector directly, optionally allowing specifying the starting byte positionpos
andlen
- A
Vector
of any of the above, in which each input should be an IPC or arrow file and must match schema
Returns a Arrow.Table
object that allows column access via table.col1
, table[:col1]
, or table[1]
.
NOTE: the columns in an Arrow.Table
are views into the original arrow memory, and hence are not easily modifiable (with e.g. push!
, append!
, etc.). To mutate arrow columns, call copy(x)
to materialize the arrow data as a normal Julia array.
Arrow.Table
also satisfies the Tables.jl interface, and so can easily be materialied via any supporting sink function: e.g. DataFrame(Arrow.Table(file))
, SQLite.load!(db, "table", Arrow.Table(file))
, etc.
Supports the convert
keyword argument which controls whether certain arrow primitive types will be lazily converted to more friendly Julia defaults; by default, convert=true
.
Arrow.ToTimestamp
— TypeArrow.ToTimestamp(x::AbstractVector{ZonedDateTime})
Wrapper array that provides a more efficient encoding of ZonedDateTime
elements to the arrow format. In the arrow format, timestamp columns with timezone information are encoded as the arrow equivalent of a Julia type parameter, meaning an entire column should have elements all with the same timezone. If a ZonedDateTime
column is passed to Arrow.write
, for correctness, it must scan each element to check each timezone. Arrow.ToTimestamp
provides a "bypass" of this process by encoding the timezone of the first element of the AbstractVector{ZonedDateTime}
, which in turn allows Arrow.write
to avoid costly checking/conversion and can encode the ZonedDateTime
as Arrow.Timestamp
directly.
Arrow.ValidityBitmap
— TypeArrow.ValidityBitmap
A bit-packed array type where each bit corresponds to an element in an ArrowVector
, indicating whether that element is "valid" (bit == 1), or not (bit == 0). Used to indicate element missingness (whether it's null).
If the null count of an array is zero, the ValidityBitmap
will be "empty" and all elements are treated as "valid"/non-null.
Arrow.Writer
— TypeArrow.Writer{T<:IO}
An object that can be used to incrementally write Arrow partitions
Examples
julia> writer = open(Arrow.Writer, tempname())
julia> partition1 = (col1 = [1, 2], col2 = ["A", "B"])
(col1 = [1, 2], col2 = ["A", "B"])
julia> Arrow.write(writer, partition1)
julia> partition2 = (col1 = [3, 4], col2 = ["C", "D"])
(col1 = [3, 4], col2 = ["C", "D"])
julia> Arrow.write(writer, partition2)
julia> close(writer)
It's also possible to automatically close the Writer using a do-block:
julia> open(Arrow.Writer, tempname()) do writer
partition1 = (col1 = [1, 2], col2 = ["A", "B"])
Arrow.write(writer, partition1)
partition2 = (col1 = [3, 4], col2 = ["C", "D"])
Arrow.write(writer, partition2)
end
Arrow.append
— FunctionArrow.append(io::IO, tbl)
Arrow.append(file::String, tbl)
tbl |> Arrow.append(file)
Append any Tables.jl-compatible tbl
to an existing arrow formatted file or IO. The existing arrow data must be in IPC stream format. Note that appending to the "feather formatted file" is not allowed, as this file format doesn't support appending. That means files written like Arrow.write(filename::String, tbl)
cannot be appended to; instead, you should write like Arrow.write(filename::String, tbl; file=false)
.
When an IO object is provided to be written on to, it must support seeking. For example, a file opened in r+
mode or an IOBuffer
that is readable, writable and seekable can be appended to, but not a network stream.
Multiple record batches will be written based on the number of Tables.partitions(tbl)
that are provided; by default, this is just one for a given table, but some table sources support automatic partitioning. Note you can turn multiple table objects into partitions by doing Tables.partitioner([tbl1, tbl2, ...])
, but note that each table must have the exact same Tables.Schema
.
By default, Arrow.append
will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8
or the JULIA_NUM_THREADS
environment variable is set).
Supported keyword arguments to Arrow.append
include:
alignment::Int=8
: specify the number of bytes to align buffers to when written in messages; strongly recommended to only use alignment values of 8 or 64 for modern memory cache line optimizationcolmetadata=nothing
: the metadata that should be written as the table's columns'custom_metadata
fields; must either benothing
or anAbstractDict
ofcolumn_name::Symbol => column_metadata
wherecolumn_metadata
is an iterable of<:AbstractString
pairs.dictencode::Bool=false
: whether all columns should use dictionary encoding when being written; to dict encode specific columns, wrap the column/array inArrow.DictEncode(col)
dictencodenested::Bool=false
: whether nested data type columns should also dict encode nested arrays/buffers; other language implementations may not support thisdenseunions::Bool=true
: whether JuliaVector{<:Union}
arrays should be written using the dense union layout; passingfalse
will result in the sparse union layoutlargelists::Bool=false
: causes list column types to be written with Int64 offset arrays; mainly for testing purposes; by default, Int64 offsets will be used only if neededmaxdepth::Int=6
: deepest allowed nested serialization level; this is provided by default to prevent accidental infinite recursion with mutually recursive data structuresmetadata=Arrow.getmetadata(tbl)
: the metadata that should be written as the table's schema'scustom_metadata
field; must either benothing
or an iterable of<:AbstractString
pairs.ntasks::Int
: number of concurrent threaded tasks to allow while writing input partitions out as arrow record batches; default is no limit; to disable multithreaded writing, passntasks=1
convert::Bool
: whether certain arrow primitive types in the schema offile
should be converted to Julia defaults for matching them to the schema oftbl
; by default,convert=true
.file::Bool
: applicable when anIO
is provided, whether it is a file; by defaultfile=false
.
Arrow.arrowtype
— FunctionGiven a FlatBuffers.Builder and a Julia column or column eltype, Write the field.type flatbuffer definition of the eltype
Arrow.getmetadata
— MethodArrow.getmetadata(x)
If x isa Arrow.Table
return a Base.ImmutableDict{String,String}
representation of x
's Schema
custom_metadata
, or nothing
if no such metadata exists.
If x isa Arrow.ArrowVector
, return a Base.ImmutableDict{String,String}
representation of x
's Field
custom_metadata
, or nothing
if no such metadata exists.
Otherwise, return nothing
.
See the official Arrow documentation for more details on custom application metadata.
Arrow.juliaeltype
— FunctionGiven a flatbuffers metadata type definition (a Field instance from Schema.fbs), translate to the appropriate Julia storage eltype
Arrow.write
— FunctionArrow.write(io::IO, tbl)
Arrow.write(file::String, tbl)
tbl |> Arrow.write(io_or_file)
Write any Tables.jl-compatible tbl
out as arrow formatted data. Providing an io::IO
argument will cause the data to be written to it in the "streaming" format, unless file=true
keyword argument is passed. Providing a file::String
argument will result in the "file" format being written.
Multiple record batches will be written based on the number of Tables.partitions(tbl)
that are provided; by default, this is just one for a given table, but some table sources support automatic partitioning. Note you can turn multiple table objects into partitions by doing Tables.partitioner([tbl1, tbl2, ...])
, but note that each table must have the exact same Tables.Schema
.
By default, Arrow.write
will use multiple threads to write multiple record batches simultaneously (e.g. if julia is started with julia -t 8
or the JULIA_NUM_THREADS
environment variable is set).
Supported keyword arguments to Arrow.write
include:
colmetadata=nothing
: the metadata that should be written as the table's columns'custom_metadata
fields; must either benothing
or anAbstractDict
ofcolumn_name::Symbol => column_metadata
wherecolumn_metadata
is an iterable of<:AbstractString
pairs.compress
: possible values include:lz4
,:zstd
, or your own initializedLZ4FrameCompressor
orZstdCompressor
objects; will cause all buffers in each record batch to use the respective compression encodingalignment::Int=8
: specify the number of bytes to align buffers to when written in messages; strongly recommended to only use alignment values of 8 or 64 for modern memory cache line optimizationdictencode::Bool=false
: whether all columns should use dictionary encoding when being written; to dict encode specific columns, wrap the column/array inArrow.DictEncode(col)
dictencodenested::Bool=false
: whether nested data type columns should also dict encode nested arrays/buffers; other language implementations may not support thisdenseunions::Bool=true
: whether JuliaVector{<:Union}
arrays should be written using the dense union layout; passingfalse
will result in the sparse union layoutlargelists::Bool=false
: causes list column types to be written with Int64 offset arrays; mainly for testing purposes; by default, Int64 offsets will be used only if neededmaxdepth::Int=6
: deepest allowed nested serialization level; this is provided by default to prevent accidental infinite recursion with mutually recursive data structuresmetadata=Arrow.getmetadata(tbl)
: the metadata that should be written as the table's schema'scustom_metadata
field; must either benothing
or an iterable of<:AbstractString
pairs.ntasks::Int
: number of buffered threaded tasks to allow while writing input partitions out as arrow record batches; default is no limit; for unbuffered writing, passntasks=0
file::Bool=false
: if a anio
argument is being written to, passingfile=true
will cause the arrow file format to be written instead of just IPC streaming
Internals: Arrow.FlatBuffers
The FlatBuffers
module is not part of Arrow.jl's public API, and these functions may change without notice.
Arrow.FlatBuffers.Scalar
— TypeScalar A Union of the Julia types T <: Number
that are allowed in FlatBuffers schema
Arrow.FlatBuffers.Builder
— TypeBuilder is a state machine for creating FlatBuffer objects. Use a Builder to construct object(s) starting from leaf nodes.
A Builder constructs byte buffers in a last-first manner for simplicity and performance.
Arrow.FlatBuffers.Table
— TypeTable
The object containing the flatbuffer and positional information specific to the table. The vtable
containing the offsets for specific members precedes pos
. The actual values in the table follow pos
offset and size of the vtable.
bytes::Vector{UInt8}
: the flatbuffer itselfpos::Integer
: the base position inbytes
of the table
Arrow.FlatBuffers.createstring!
— Methodcreatestring!
writes a null-terminated string as a vector.
Arrow.FlatBuffers.endobject!
— Methodendobject
writes data necessary to finish object construction.
Arrow.FlatBuffers.endvector!
— Methodendvector
writes data necessary to finish vector construction.
Arrow.FlatBuffers.finish!
— Methodfinish!
finalizes a buffer, pointing to the given rootTable
.
Arrow.FlatBuffers.finishedbytes
— Methodfinishedbytes
returns a pointer to the written data in the byte buffer. Panics if the builder is not in a finished state (which is caused by calling finish!()
).
Arrow.FlatBuffers.getoffsetslot
— MethodGetVOffsetTSlot retrieves the VOffsetT that the given vtable location points to. If the vtable value is zero, the default value d
will be returned.
Arrow.FlatBuffers.getslot
— Methodgetslot
retrieves the T
that the given vtable location points to. If the vtable value is zero, the default value d
will be returned.
Arrow.FlatBuffers.indirect
— Methodindirect
retrieves the relative offset stored at offset
.
Arrow.FlatBuffers.offset
— Methodoffset
provides access into the Table's vtable.
Deprecated fields are ignored by checking against the vtable's length.
Arrow.FlatBuffers.place!
— Methodplace!
prepends a T
to the Builder, without checking for space.
Arrow.FlatBuffers.prep!
— Methodprep!
prepares to write an element of size
after additionalbytes
have been written, e.g. if you write a string, you need to align such the int length field is aligned to sizeof(Int32), and the string data follows it directly. If all you need to do is align, additionalbytes
will be 0.
Arrow.FlatBuffers.prependslot!
— Methodprependslot!
prepends a T
onto the object at vtable slot o
. If value x
equals default d
, then the slot will be set to zero and no other data will be written.
Arrow.FlatBuffers.prependstructslot!
— Methodprependstructslot!
prepends a struct onto the object at vtable slot o
. Structs are stored inline, so nothing additional is being added. In generated code, d
is always 0.
Arrow.FlatBuffers.slot!
— Methodslot!
sets the vtable key voffset
to the current location in the buffer.
Arrow.FlatBuffers.startvector!
— Methodstartvector
initializes bookkeeping for writing a new vector.
A vector has the following format: <UOffsetT: number of elements in this vector> <T: data>+, where T is the type of elements of this vector.
Arrow.FlatBuffers.vector
— Methodvector
retrieves the start of data of the vector whose offset is stored at off
in this object.
Arrow.FlatBuffers.vectorlen
— Methodvectorlen
retrieves the length of the vector whose offset is stored at off
in this object.
Arrow.FlatBuffers.vtableEqual
— MethodvtableEqual compares an unwritten vtable to a written vtable.
Arrow.FlatBuffers.writevtable!
— MethodWriteVtable serializes the vtable for the current object, if applicable.
Before writing out the vtable, this checks pre-existing vtables for equality to this one. If an equal vtable is found, point the object to the existing vtable and return.
Because vtable values are sensitive to alignment of object data, not all logically-equal vtables will be deduplicated.
A vtable has the following format: <VOffsetT: size of the vtable in bytes, including this value> <VOffsetT: size of the object in bytes, including the vtable offset> <VOffsetT: offset for a field> * N, where N is the number of fields in the schema for this type. Includes deprecated fields. Thus, a vtable is made of 2 + N elements, each SizeVOffsetT bytes wide.
An object has the following format: <SOffsetT: offset to this object's vtable (may be negative)> <byte: data>+