Python#

PyArrow - Apache Arrow Python bindings#

This is the documentation of the Python API of Apache Arrow.

Apache Arrow is a universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics. It contains a set of technologies that enable data systems to efficiently store, process, and move data.

See the parent documentation for additional details on the Arrow Project itself, on the Arrow format and the other language bindings.

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

Here we will detail the usage of the Python API for Arrow and the leaf libraries that add additional functionality such as reading Apache Parquet files into Arrow structures.

Installing PyArrow
- System Compatibility
- Python Compatibility
- Using Conda
- Using Pip
- Installing nightly packages or from source
- Dependencies
- Differences between conda-forge packages
Getting Started
- Creating Arrays and Tables
- Saving and Loading Tables
- Performing Computations
- Working with large data
- Continuing from here
Data Types and In-Memory Data Model
- Type Metadata
- Schemas
- Arrays
- Record Batches
- Tables
- Custom Schema and Field Metadata
- Record Batch Readers
- Conversion of RecordBatch to Tensor
Compute Functions
- Standard Compute Functions
- Grouped Aggregations
- Table and Dataset Joins
- Filtering by Expressions
- User-Defined Functions
Memory and IO Interfaces
- Referencing and Allocating Memory
- Input and Output
Streaming, Serialization, and IPC
- Writing and Reading Streams
- Efficiently Writing and Reading Arrow Data
Filesystem Interface
- Usage
- Local FS
- S3
- Google Cloud Storage File System
- Hadoop Distributed File System (HDFS)
- Azure Storage File System
- Using fsspec-compatible filesystems with Arrow
- Using Arrow filesystems with fsspec
NumPy Integration
- NumPy to Arrow
- Arrow to NumPy
Pandas Integration
- DataFrames
- Series
- Handling pandas Indexes
- Type differences
- Nullable types
- Memory Usage and Zero Copy
Dataframe Interchange Protocol
- From PyArrow to other libraries: __dataframe__() method
- From other libraries to PyArrow: from_dataframe()
The DLPack Protocol
- Implementation of DLPack in PyArrow
- Examples
Timestamps
- Arrow/Pandas Timestamps
- Timestamp Conversions
Reading and Writing the Apache ORC Format
- Obtaining pyarrow with ORC Support
- Reading and Writing Single Files
- Finer-grained Reading and Writing
- Compression
- Reading from cloud storage
Reading and Writing CSV files
- Usage
- Customized parsing
- Customized conversion
- Incremental reading
- Character encoding
- Customized writing
- Incremental writing
- Performance
Feather File Format
- Using Compression
- Writing Version 1 (V1) Files
Reading JSON files
- Usage
- Automatic Type Inference
- Customized parsing
- Incremental reading
Reading and Writing the Apache Parquet Format
- Obtaining pyarrow with Parquet Support
- Reading and Writing Single Files
- Finer-grained Reading and Writing
- Inspecting the Parquet File Metadata
- Data Type Handling
- Compression, Encoding, and File Compatibility
- Partitioned Datasets (Multiple Files)
- Writing to Partitioned Datasets
- Reading from Partitioned Datasets
- Using with Spark
- Multithreaded Reads
- Reading from cloud storage
- Parquet Modular Encryption (Columnar Encryption)
Tabular Datasets
- Reading Datasets
- Filtering data
- Projecting columns
- Reading partitioned data
- Reading from cloud storage
- Reading from Minio
- Working with Parquet Datasets
- Manual specification of the Dataset
- Iterative (out of core or streaming) reads
- A note on transactions & ACID guarantees
- Writing Datasets
Arrow Flight RPC
- Writing a Flight Service
- Using the Flight Client
- Cancellation and Timeouts
- Enabling TLS
- Enabling Authentication
- Custom Middleware
- Flight best practices
Extending pyarrow
- Controlling conversion to (Py)Arrow with the PyCapsule Interface
- Controlling conversion to pyarrow.Array with the __arrow_array__ protocol
- Defining extension types (“user-defined types”)
PyArrow Integrations
- Substrait
- Integrating PyArrow with R
- Integrating PyArrow with Java
- Using pyarrow from C++ and Cython Code
- CUDA Integration
Environment Variables
API Reference
- Data Types and Schemas
- Arrays and Scalars
- Buffers and Memory
- Tables and Tensors
- Compute Functions
- Acero - Streaming Execution Engine
- Substrait
- Streams and File Access
- Serialization and IPC
- Arrow Flight
- Tabular File Formats
- Filesystems
- Dataset
- CUDA Integration
- Miscellaneous
Getting Involved
- PyArrow Architecture
Benchmarks
- Running the benchmarks
- Running for arbitrary Git revisions
- Compatibility
Python cookbook