PyArrow - Apache Arrow Python bindings¶

This is the documentation of the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.

See the parent documentation for additional details on the Arrow Project itself, on the Arrow format and the other language bindings.

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.

Here will we detail the usage of the Python API for Arrow and the leaf libraries that add additional functionality such as reading Apache Parquet files into Arrow structures.

Installing PyArrow
- System Compatibility
- Python Compatibility
- Using Conda
- Using Pip
- Installing from source
- Installing Nightly Packages
Getting Started
- Creating Arrays and Tables
- Saving and Loading Tables
- Performing Computations
- Working with large data
- Continuining from here
Data Types and In-Memory Data Model
- Type Metadata
- Schemas
- Arrays
- Record Batches
- Tables
- Custom Schema and Field Metadata
Compute Functions
- Standard Compute Functions
- Grouped Aggregations
Memory and IO Interfaces
- Referencing and Allocating Memory
- Input and Output
Streaming, Serialization, and IPC
- Writing and Reading Streams
- Efficiently Writing and Reading Arrow Data
- Arbitrary Object Serialization
Filesystem Interface
- Usage
- Local FS
- S3
- Hadoop Distributed File System (HDFS)
- Using fsspec-compatible filesystems with Arrow
- Using Arrow filesystems with fsspec
Filesystem Interface (legacy)
- Hadoop File System (HDFS)
The Plasma In-Memory Object Store
- The Plasma API
- Using Arrow and Pandas with Plasma
- Using Plasma with Huge Pages
NumPy Integration
- NumPy to Arrow
- Arrow to NumPy
Pandas Integration
- DataFrames
- Series
- Handling pandas Indexes
- Type differences
- Nullable types
- Memory Usage and Zero Copy
Timestamps
- Arrow/Pandas Timestamps
- Timestamp Conversions
Reading and Writing CSV files
- Usage
- Customized parsing
- Customized conversion
- Incremental reading
- Character encoding
- Customized writing
- Incremental writing
- Performance
Feather File Format
- Using Compression
- Writing Version 1 (V1) Files
Reading JSON files
- Usage
- Automatic Type Inference
- Customized parsing
Reading and Writing the Apache Parquet Format
- Obtaining pyarrow with Parquet Support
- Reading and Writing Single Files
- Finer-grained Reading and Writing
- Inspecting the Parquet File Metadata
- Data Type Handling
- Compression, Encoding, and File Compatibility
- Partitioned Datasets (Multiple Files)
- Writing to Partitioned Datasets
- Reading from Partitioned Datasets
- Using with Spark
- Multithreaded Reads
- Reading from cloud storage
Tabular Datasets
- Reading Datasets
- Filtering data
- Projecting columns
- Reading partitioned data
- Reading from cloud storage
- Reading from Minio
- Working with Parquet Datasets
- Manual specification of the Dataset
- Iterative (out of core or streaming) reads
- A note on transactions & ACID guarantees
- Writing Datasets
Extending pyarrow
- Controlling conversion to pyarrow.Array with the __arrow_array__ protocol
- Defining extension types (“user-defined types”)
PyArrow Integrations
- Integrating PyArrow with R
- Using pyarrow from C++ and Cython Code
- CUDA Integration
API Reference
- Data Types and Schemas
- Arrays and Scalars
- Buffers and Memory
- Compute Functions
- Streams and File Access
- Tables and Tensors
- Serialization and IPC
- Arrow Flight
- Tabular File Formats
- Filesystems
- Dataset
- Plasma In-Memory Object Store
- CUDA Integration
- Miscellaneous
Getting Involved
Benchmarks
- Running the benchmarks
- Running for arbitrary Git revisions
- Compatibility

JavaScript docs

Installing PyArrow