datafusion.DataFrame

class datafusion.DataFrame

Bases: object

A PyDataFrame is a representation of a logical plan and an API to compose statements. Use it to build a plan and .collect() to execute the plan and collect the result. The actual execution of a plan runs natively on Rust and Arrow on a multi-threaded environment.

__init__()

Methods

__init__()

aggregate(group_by, aggs)

cache()

Cache DataFrame.

collect()

Executes the plan, returning a list of `RecordBatch`es. Unless some order is specified in the plan, there is no guarantee of the order of the result.

collect_partitioned()

Executes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.

count()

describe()

Calculate summary statistics for a DataFrame

distinct()

Filter out duplicate rows

except_all(py_df)

Calculate the exception of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema.

execution_plan()

Get the execution plan for this DataFrame

explain([verbose, analyze])

Print the query plan

filter(predicate)

intersect(py_df)

Calculate the intersection of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema.

join(right, join_keys, how)

limit(count[, offset])

logical_plan()

Get the logical plan for this DataFrame

optimized_logical_plan()

Get the optimized logical plan for this DataFrame

repartition(num)

Repartition a DataFrame based on a logical partitioning scheme.

repartition_by_hash(*args, num)

Repartition a DataFrame based on a logical partitioning scheme.

schema()

Returns the schema from the logical plan

select(*args)

select_columns(*args)

show([num])

Print the result, 20 lines by default

sort(*exprs)

to_arrow_table()

Convert to Arrow Table Collect the batches and pass to Arrow Table

to_pandas()

Convert to pandas dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to Pandas DataFrame

to_polars()

Convert to polars dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to polars DataFrame

to_pydict()

Convert to Python dictionary using pyarrow Each dictionary key is a column and the dictionary value represents the column values

to_pylist()

Convert to Python list using pyarrow Each list item represents one row encoded as dictionary

union(py_df[, distinct])

Calculate the union of two `DataFrame`s, preserving duplicate rows.The two `DataFrame`s must have exactly the same schema

union_distinct(py_df)

Calculate the distinct union of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema.

with_column(name, expr)

with_column_renamed(old_name, new_name)

Rename one column by applying a new projection.

write_csv(path)

Write a DataFrame to a CSV file.

write_json(path)

Executes a query and writes the results to a partitioned JSON file.

write_parquet(path[, compression, ...])

Write a DataFrame to a Parquet file.

aggregate(group_by, aggs)
cache()

Cache DataFrame.

collect()

Executes the plan, returning a list of `RecordBatch`es. Unless some order is specified in the plan, there is no guarantee of the order of the result.

collect_partitioned()

Executes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.

count()
describe()

Calculate summary statistics for a DataFrame

distinct()

Filter out duplicate rows

except_all(py_df)

Calculate the exception of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema

execution_plan()

Get the execution plan for this DataFrame

explain(verbose=False, analyze=False)

Print the query plan

filter(predicate)
intersect(py_df)

Calculate the intersection of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema

join(right, join_keys, how)
limit(count, offset=0)
logical_plan()

Get the logical plan for this DataFrame

optimized_logical_plan()

Get the optimized logical plan for this DataFrame

repartition(num)

Repartition a DataFrame based on a logical partitioning scheme.

repartition_by_hash(*args, num)

Repartition a DataFrame based on a logical partitioning scheme.

schema()

Returns the schema from the logical plan

select(*args)
select_columns(*args)
show(num=20)

Print the result, 20 lines by default

sort(*exprs)
to_arrow_table()

Convert to Arrow Table Collect the batches and pass to Arrow Table

to_pandas()

Convert to pandas dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to Pandas DataFrame

to_polars()

Convert to polars dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to polars DataFrame

to_pydict()

Convert to Python dictionary using pyarrow Each dictionary key is a column and the dictionary value represents the column values

to_pylist()

Convert to Python list using pyarrow Each list item represents one row encoded as dictionary

union(py_df, distinct=False)

Calculate the union of two `DataFrame`s, preserving duplicate rows.The two `DataFrame`s must have exactly the same schema

union_distinct(py_df)

Calculate the distinct union of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema

with_column(name, expr)
with_column_renamed(old_name, new_name)

Rename one column by applying a new projection. This is a no-op if the column to be renamed does not exist.

write_csv(path)

Write a DataFrame to a CSV file.

write_json(path)

Executes a query and writes the results to a partitioned JSON file.

write_parquet(path, compression='uncompressed', compression_level=None)

Write a DataFrame to a Parquet file.