Apache Arrow¶
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
The project is developing a multi-language collection of libraries for solving systems problems related to in-memory analytical data processing. This includes such topics as:
Zero-copy shared memory and RPC-based data movement
Reading and writing file formats (like CSV, Apache ORC, and Apache Parquet)
In-memory analytics and query processing
- Implementation Status
- C/GLib
- C++
- C#
- Go
- Java
- JavaScript
- MATLAB
- Python
- Installing PyArrow
- Memory and IO Interfaces
- Data Types and In-Memory Data Model
- Compute Functions
- Streaming, Serialization, and IPC
- Filesystem Interface
- Filesystem Interface (legacy)
- The Plasma In-Memory Object Store
- NumPy Integration
- Pandas Integration
- Timestamps
- Reading CSV files
- Feather File Format
- Reading JSON files
- Reading and Writing the Apache Parquet Format
- Tabular Datasets
- CUDA Integration
- Extending pyarrow
- Using pyarrow from C++ and Cython Code
- API Reference
- Getting Involved
- Benchmarks
- R
- Ruby
- Rust