pyarrow.record_batch#

pyarrow.record_batch(data, names=None, schema=None, metadata=None)#

Create a pyarrow.RecordBatch from another Python data structure or sequence of arrays.

Parameters:
datadict, list, pandas.DataFrame, Arrow-compatible table

A mapping of strings to Arrays or Python lists, a list of Arrays, a pandas DataFame, or any tabular object implementing the Arrow PyCapsule Protocol (has an __arrow_c_array__ method).

nameslist, default None

Column names if list of arrays passed as data. Mutually exclusive with ‘schema’ argument.

schemaSchema, default None

The expected schema of the RecordBatch. If not passed, will be inferred from the data. Mutually exclusive with ‘names’ argument.

metadatadict or Mapping, default None

Optional metadata for the schema (if schema not passed).

Returns:
RecordBatch

Examples

>>> import pyarrow as pa
>>> n_legs = pa.array([2, 2, 4, 4, 5, 100])
>>> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
>>> names = ["n_legs", "animals"]

Construct a RecordBatch from a python dictionary:

>>> pa.record_batch({"n_legs": n_legs, "animals": animals})
pyarrow.RecordBatch
n_legs: int64
animals: string
----
n_legs: [2,2,4,4,5,100]
animals: ["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]
>>> pa.record_batch({"n_legs": n_legs, "animals": animals}).to_pandas()
   n_legs        animals
0       2       Flamingo
1       2         Parrot
2       4            Dog
3       4          Horse
4       5  Brittle stars
5     100      Centipede

Creating a RecordBatch from a list of arrays with names:

>>> pa.record_batch([n_legs, animals], names=names)
pyarrow.RecordBatch
n_legs: int64
animals: string
----
n_legs: [2,2,4,4,5,100]
animals: ["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]

Creating a RecordBatch from a list of arrays with names and metadata:

>>> my_metadata={"n_legs": "How many legs does an animal have?"}
>>> pa.record_batch([n_legs, animals],
...                  names=names,
...                  metadata = my_metadata)
pyarrow.RecordBatch
n_legs: int64
animals: string
----
n_legs: [2,2,4,4,5,100]
animals: ["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]
>>> pa.record_batch([n_legs, animals],
...                  names=names,
...                  metadata = my_metadata).schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'How many legs does an animal have?'

Creating a RecordBatch from a pandas DataFrame:

>>> import pandas as pd
>>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022],
...                    'month': [3, 5, 7, 9],
...                    'day': [1, 5, 9, 13],
...                    'n_legs': [2, 4, 5, 100],
...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
>>> pa.record_batch(df)
pyarrow.RecordBatch
year: int64
month: int64
day: int64
n_legs: int64
animals: string
----
year: [2020,2022,2021,2022]
month: [3,5,7,9]
day: [1,5,9,13]
n_legs: [2,4,5,100]
animals: ["Flamingo","Horse","Brittle stars","Centipede"]
>>> pa.record_batch(df).to_pandas()
   year  month  day  n_legs        animals
0  2020      3    1       2       Flamingo
1  2022      5    5       4          Horse
2  2021      7    9       5  Brittle stars
3  2022      9   13     100      Centipede

Creating a RecordBatch from a pandas DataFrame with schema:

>>> my_schema = pa.schema([
...     pa.field('n_legs', pa.int64()),
...     pa.field('animals', pa.string())],
...     metadata={"n_legs": "Number of legs per animal"})
>>> pa.record_batch(df, my_schema).schema
n_legs: int64
animals: string
-- schema metadata --
n_legs: 'Number of legs per animal'
pandas: ...
>>> pa.record_batch(df, my_schema).to_pandas()
   n_legs        animals
0       2       Flamingo
1       4          Horse
2       5  Brittle stars
3     100      Centipede