Project and Product names using “Apache Arrow”
Organizations creating products and projects for use with Apache Arrow, along with associated marketing materials, should take care to respect the trademark in “Apache Arrow” and its logo. Please refer to ASF Trademarks Guidance and associated FAQ for comprehensive and authoritative guidance on proper usage of ASF trademarks.
Names that do not include “Apache Arrow” at all have no potential trademark issue with the Apache Arrow project. This is recommended.
Names like “Apache Arrow BigCoProduct” are not OK, as are names including “Apache Arrow” in general. The above links, however, describe some exceptions, like for names such as “BigCoProduct, powered by Apache Arrow” or “BigCoProduct for Apache Arrow”.
It is common practice to create software identifiers (Maven coordinates, module names, etc.) like “arrow-foo”. These are permitted. Nominative use of trademarks in descriptions is also always allowed, as in “BigCoProduct is a widget for Apache Arrow”.
To add yourself to the list, please open a pull request adding your organization name, URL, a list of which Arrow components you are using, and a short description of your use case. See the following for some examples.
- Apache Parquet: A columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. The C++ and Java implementation provide vectorized reads and write to/from Arrow data structures.
- Apache Spark: Apache Spark™ is a fast and general engine for
large-scale data processing. Spark uses Apache Arrow to
- improve performance of conversion between Spark DataFrame and pandas DataFrame
- enable a set of vectorized user-defined functions (
pandas_udf) in PySpark.
- Dask: Python library for parallel and distributed execution of dynamic task graphs. Dask supports using pyarrow for accessing Parquet files
- Dremio: A self-service data platform. Dremio makes it easy for users to discover, curate, accelerate, and share data from any source. It includes a distributed SQL execution engine based on Apache Arrow. Dremio reads data from any source (RDBMS, HDFS, S3, NoSQL) into Arrow buffers, and provides fast SQL access via ODBC, JDBC, and REST for BI, Python, R, and more (all backed by Apache Arrow).
- Fletcher: Fletcher is an FPGA acceleration framework that can convert an Arrow schema into an easy-to-use hardware interface. The accelerator can request data from Arrow tables by supplying row indices. In turn, the interface provides streams of data of the types defined through the schema. Furthermore, Arrow alleviates serialization bottlenecks.
- GeoMesa: A suite of tools that enables large-scale geospatial query and analytics on distributed computing systems. GeoMesa supports query results in the Arrow IPC format, which can then be used for in-browser visualizations and/or further analytics.
- GOAI: Open GPU-Accelerated Analytics Initiative for Arrow-powered analytics across GPU tools and vendors
- Graphistry: Supercharged Visual Investigation Platform used by teams for security, anti-fraud, and related investigations. The Graphistry team uses Arrow in its NodeJS GPU backend and client libraries, and is an early contributing member to GOAI and Arrow[JS] focused on bringing these technologies to the enterprise.
- InAccel: A machine learning acceleration framework which leverages FPGAs-as-a-service. InAccel supports dataframes backed by Apache Arrow to serve as input for our implemented ML algorithms. Those dataframes can be accessed from the FPGAs with a single DMA operation by implementing a shared memory communication schema.
- libgdf: A C library of CUDA-based analytics functions and GPU IPC support for structured data. Uses the Arrow IPC format and targets the Arrow memory layout in its analytic functions. This work is part of the GPU Open Analytics Initiative
- OmniSci (formerly MapD): In-memory columnar SQL engine designed to run on both GPUs and CPUs. OmniSci supports Arrow for data ingest and data interchange via CUDA IPC handles. This work is part of the GPU Open Analytics Initiative
- pandas: data analysis toolkit for Python programmers. pandas supports reading and writing Parquet files using pyarrow. Several pandas core developers are also contributors to Apache Arrow.
- Petastorm: Petastorm enables single machine or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, Pytorch, and PySpark. It can also be used from pure Python code.
- Quilt Data: Quilt is a data package manager, designed to make managing data as easy as managing code. It supports Parquet format via pyarrow for data access.
- Ray: A flexible, high-performance distributed execution framework with a focus on machine learning and AI applications. Uses Arrow to efficiently store Python data structures containing large arrays of numerical data. Data can be accessed with zero-copy by multiple processes using the Plasma shared memory object store which originated from Ray and is part of Arrow now.
- Red Data Tools: A project that provides data processing tools for Ruby. It provides Red Arrow that is a Ruby bindings of Apache Arrow based on Apache Arrow GLib. Red Arrow is a core library for it. It also provides many Ruby libraries to integrate existing Ruby libraries with Apache Arrow. They use Red Arrow.
- SciDB: Paradigm4’s SciDB is a scalable, scientific database management system that helps researchers integrate and analyze diverse, multi-dimensional, high resolution data - like genomic, clinical, images, sensor, environmental, and IoT data - all in one analytical platform. SciDB streaming and accelerated_io_tools are powered by Apache Arrow.
- Turbodbc: Python module to access relational databases via the Open Database Connectivity (ODBC) interface. It provides the ability to return Arrow Tables and RecordBatches in addition to the Python Database API Specification 2.0.
- FASTDATA.io: Plasma Engine (unrelated to Arrow’s Plasma In-Memory Object Store) exploits the massive parallel processing power of GPUs for stream and batch processing. It supports Arrow as input and output, uses Arrow internally to maximize performance, and can be used with existing Apache Spark™ APIs.