datafusion.DataFrame¶
- class datafusion.DataFrame¶
Bases:
object
A PyDataFrame is a representation of a logical plan and an API to compose statements. Use it to build a plan and .collect() to execute the plan and collect the result. The actual execution of a plan runs natively on Rust and Arrow on a multi-threaded environment.
- __init__()¶
Methods
__init__
()aggregate
(group_by, aggs)cache
()Cache DataFrame.
collect
()Executes the plan, returning a list of `RecordBatch`es. Unless some order is specified in the plan, there is no guarantee of the order of the result.
Executes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.
count
()describe
()Calculate summary statistics for a DataFrame
distinct
()Filter out duplicate rows
except_all
(py_df)Calculate the exception of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema.
Get the execution plan for this DataFrame
explain
([verbose, analyze])Print the query plan
filter
(predicate)intersect
(py_df)Calculate the intersection of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema.
join
(right, join_keys, how)limit
(count[, offset])Get the logical plan for this DataFrame
Get the optimized logical plan for this DataFrame
repartition
(num)Repartition a DataFrame based on a logical partitioning scheme.
repartition_by_hash
(*args, num)Repartition a DataFrame based on a logical partitioning scheme.
schema
()Returns the schema from the logical plan
select
(*args)select_columns
(*args)show
([num])Print the result, 20 lines by default
sort
(*exprs)Convert to Arrow Table Collect the batches and pass to Arrow Table
Convert to pandas dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to Pandas DataFrame
Convert to polars dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to polars DataFrame
Convert to Python dictionary using pyarrow Each dictionary key is a column and the dictionary value represents the column values
Convert to Python list using pyarrow Each list item represents one row encoded as dictionary
union
(py_df[, distinct])Calculate the union of two `DataFrame`s, preserving duplicate rows.The two `DataFrame`s must have exactly the same schema
union_distinct
(py_df)Calculate the distinct union of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema.
with_column
(name, expr)with_column_renamed
(old_name, new_name)Rename one column by applying a new projection.
write_csv
(path)Write a DataFrame to a CSV file.
write_json
(path)Executes a query and writes the results to a partitioned JSON file.
write_parquet
(path[, compression, ...])Write a DataFrame to a Parquet file.
- aggregate(group_by, aggs)¶
- cache()¶
Cache DataFrame.
- collect()¶
Executes the plan, returning a list of `RecordBatch`es. Unless some order is specified in the plan, there is no guarantee of the order of the result.
- collect_partitioned()¶
Executes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.
- count()¶
- describe()¶
Calculate summary statistics for a DataFrame
- distinct()¶
Filter out duplicate rows
- except_all(py_df)¶
Calculate the exception of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema
- execution_plan()¶
Get the execution plan for this DataFrame
- explain(verbose=False, analyze=False)¶
Print the query plan
- filter(predicate)¶
- intersect(py_df)¶
Calculate the intersection of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema
- join(right, join_keys, how)¶
- limit(count, offset=0)¶
- logical_plan()¶
Get the logical plan for this DataFrame
- optimized_logical_plan()¶
Get the optimized logical plan for this DataFrame
- repartition(num)¶
Repartition a DataFrame based on a logical partitioning scheme.
- repartition_by_hash(*args, num)¶
Repartition a DataFrame based on a logical partitioning scheme.
- schema()¶
Returns the schema from the logical plan
- select(*args)¶
- select_columns(*args)¶
- show(num=20)¶
Print the result, 20 lines by default
- sort(*exprs)¶
- to_arrow_table()¶
Convert to Arrow Table Collect the batches and pass to Arrow Table
- to_pandas()¶
Convert to pandas dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to Pandas DataFrame
- to_polars()¶
Convert to polars dataframe with pyarrow Collect the batches, pass to Arrow Table & then convert to polars DataFrame
- to_pydict()¶
Convert to Python dictionary using pyarrow Each dictionary key is a column and the dictionary value represents the column values
- to_pylist()¶
Convert to Python list using pyarrow Each list item represents one row encoded as dictionary
- union(py_df, distinct=False)¶
Calculate the union of two `DataFrame`s, preserving duplicate rows.The two `DataFrame`s must have exactly the same schema
- union_distinct(py_df)¶
Calculate the distinct union of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema
- with_column(name, expr)¶
- with_column_renamed(old_name, new_name)¶
Rename one column by applying a new projection. This is a no-op if the column to be renamed does not exist.
- write_csv(path)¶
Write a DataFrame to a CSV file.
- write_json(path)¶
Executes a query and writes the results to a partitioned JSON file.
- write_parquet(path, compression='uncompressed', compression_level=None)¶
Write a DataFrame to a Parquet file.