Reading and Writing the Apache Parquet Format

The Apache Parquet project provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala (incubating), and Apache Spark adopting it as a shared standard for high performance data IO.

Apache Arrow is an ideal in-memory transport layer for data that is being read or written with Parquet files. We have been concurrently developing the C++ implementation of Apache Parquet, which includes a native, multithreaded C++ adapter to and from in-memory Arrow data. PyArrow includes Python bindings to this code, which thus enables reading and writing Parquet files with pandas as well.

Obtaining PyArrow with Parquet Support

If you installed pyarrow with pip or conda, it should be built with Parquet support bundled:

In [1]: import pyarrow.parquet as pq

If you are building pyarrow from source, you must also build parquet-cpp and enable the Parquet extensions when building pyarrow. See the Development page for more details.

Reading and Writing Single Files

The functions read_table() and write_table() read and write the pyarrow.Table objects, respectively.

Let’s look at a simple table:

In [2]: import numpy as np

In [3]: import pandas as pd

In [4]: import pyarrow as pa

In [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
   ...:                    'two': ['foo', 'bar', 'baz'],
   ...:                    'three': [True, False, True]})
   ...: 

In [6]: table = pa.Table.from_pandas(df)

We write this to Parquet format with write_table:

In [7]: import pyarrow.parquet as pq

In [8]: pq.write_table(table, 'example.parquet')

This creates a single Parquet file. In practice, a Parquet dataset may consist of many files in many directories. We can read a single file back with read_table:

In [9]: table2 = pq.read_table('example.parquet')

In [10]: table2.to_pandas()
Out[10]: 
   one  three  two
0 -1.0   True  foo
1  NaN  False  bar
2  2.5   True  baz

You can pass a subset of columns to read, which can be much faster than reading the whole file (due to the columnar layout):

In [11]: pq.read_table('example.parquet', columns=['one', 'three'])
Out[11]: 
pyarrow.Table
one: double
three: bool
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "pandas_type": "string", "numpy_type": "object", "met'
            b'adata": null}], "columns": [{"name": "one", "pandas_type": "floa'
            b't64", "numpy_type": "float64", "metadata": null}, {"name": "thre'
            b'e", "pandas_type": "bool", "numpy_type": "bool", "metadata": nul'
            b'l}, {"name": "two", "pandas_type": "unicode", "numpy_type": "obj'
            b'ect", "metadata": null}, {"name": null, "pandas_type": "int64", '
            b'"numpy_type": "int64", "metadata": null}], "pandas_version": "0.'
            b'21.0"}'}

We need not use a string to specify the origin of the file. It can be any of:

  • A file path as a string
  • A NativeFile from PyArrow
  • A Python file object

In general, a Python file object will have the worst read performance, while a string file path or an instance of NativeFIle (especially memory maps) will perform the best.

Finer-grained Reading and Writing

read_table uses the ParquetFile class, which has other features:

In [12]: parquet_file = pq.ParquetFile('example.parquet')

In [13]: parquet_file.metadata
Out[13]: 
<pyarrow._parquet.FileMetaData object at 0x7f7f6480cef8>
  created_by: parquet-cpp version 1.3.2-SNAPSHOT
  num_columns: 4
  num_rows: 3
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 938

In [14]: parquet_file.schema
Out[14]: 
<pyarrow._parquet.ParquetSchema object at 0x7f7f650b74c8>
one: DOUBLE
three: BOOLEAN
two: BYTE_ARRAY UTF8
__index_level_0__: INT64

As you can learn more in the Apache Parquet format, a Parquet file consists of multiple row groups. read_table will read all of the row groups and concatenate them into a single table. You can read individual row groups with read_row_group:

In [15]: parquet_file.num_row_groups
Out[15]: 1

In [16]: parquet_file.read_row_group(0)
Out[16]: 
pyarrow.Table
one: double
three: bool
two: string
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "pandas_type": "string", "numpy_type": "object", "met'
            b'adata": null}], "columns": [{"name": "one", "pandas_type": "floa'
            b't64", "numpy_type": "float64", "metadata": null}, {"name": "thre'
            b'e", "pandas_type": "bool", "numpy_type": "bool", "metadata": nul'
            b'l}, {"name": "two", "pandas_type": "unicode", "numpy_type": "obj'
            b'ect", "metadata": null}, {"name": null, "pandas_type": "int64", '
            b'"numpy_type": "int64", "metadata": null}], "pandas_version": "0.'
            b'21.0"}'}

We can similarly write a Parquet file with multiple row groups by using ParquetWriter:

In [17]: writer = pq.ParquetWriter('example2.parquet', table.schema)

In [18]: for i in range(3):
   ....:     writer.write_table(table)
   ....: 

In [19]: writer.close()

In [20]: pf2 = pq.ParquetFile('example2.parquet')

In [21]: pf2.num_row_groups
Out[21]: 3

Compression, Encoding, and File Compatibility

The most commonly used Parquet implementations use dictionary encoding when writing files; if the dictionaries grow too large, then they “fall back” to plain encoding. Whether dictionary encoding is used can be toggled using the use_dictionary option:

pq.write_table(table, where, use_dictionary=False)

The data pages within a column in a row group can be compressed after the encoding passes (dictionary, RLE encoding). In PyArrow we use Snappy compression by default, but Brotli, Gzip, and uncompressed are also supported:

pq.write_table(table, where, compression='snappy')
pq.write_table(table, where, compression='gzip')
pq.write_table(table, where, compression='brotli')
pq.write_table(table, where, compression='none')

Snappy generally results in better performance, while Gzip may yield smaller files.

These settings can also be set on a per-column basis:

pq.write_table(table, where, compression={'foo': 'snappy', 'bar': 'gzip'},
               use_dictionary=['foo', 'bar'])

Reading Multiples Files and Partitioned Datasets

Multiple Parquet files constitute a Parquet dataset. These may present in a number of ways:

  • A list of Parquet absolute file paths
  • A directory name containing nested directories defining a partitioned dataset

A dataset partitioned by year and month may look like on disk:

dataset_name/
  year=2007/
    month=01/
       0.parq
       1.parq
       ...
    month=02/
       0.parq
       1.parq
       ...
    month=03/
    ...
  year=2008/
    month=01/
    ...
  ...

The ParquetDataset class accepts either a directory name or a list or file paths, and can discover and infer some common partition structures, such as those produced by Hive:

dataset = pq.ParquetDataset('dataset_name/')
table = dataset.read()

Using with Spark

Spark places some constraints on the types of Parquet files it will read. The option flavor='spark' will set these options automatically and also sanitize field characters unsupported by Spark SQL.

Multithreaded Reads

Each of the reading functions have an nthreads argument which will read columns with the indicated level of parallelism. Depending on the speed of IO and how expensive it is to decode the columns in a particular file (particularly with GZIP compression), this can yield significantly higher data throughput:

pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)