Basic OperationsΒΆ

In this section, you will learn how to display essential details of DataFrames using specific functions.

In [1]: from datafusion import SessionContext

In [2]: import random

In [3]: ctx = SessionContext()

In [4]: df = ctx.from_pydict({
   ...:     "nrs": [1, 2, 3, 4, 5],
   ...:     "names": ["python", "ruby", "java", "haskell", "go"],
   ...:     "random": random.sample(range(1000), 5),
   ...:     "groups": ["A", "A", "B", "C", "B"],
   ...: })
   ...: 

In [5]: df
Out[5]: 
DataFrame()
+-----+---------+--------+--------+
| nrs | names   | random | groups |
+-----+---------+--------+--------+
| 1   | python  | 352    | A      |
| 2   | ruby    | 463    | A      |
| 3   | java    | 395    | B      |
| 4   | haskell | 135    | C      |
| 5   | go      | 849    | B      |
+-----+---------+--------+--------+

Use DataFrame.limit() to view the top rows of the frame:

In [6]: df.limit(2)
Out[6]: 
DataFrame()
+-----+--------+--------+--------+
| nrs | names  | random | groups |
+-----+--------+--------+--------+
| 1   | python | 352    | A      |
| 2   | ruby   | 463    | A      |
+-----+--------+--------+--------+

Display the columns of the DataFrame using DataFrame.schema():

In [7]: df.schema()
Out[7]: 
nrs: int64
names: string
random: int64
groups: string

The method DataFrame.to_pandas() uses pyarrow to convert to pandas DataFrame, by collecting the batches, passing them to an Arrow table, and then converting them to a pandas DataFrame.

In [8]: df.to_pandas()
Out[8]: 
   nrs    names  random groups
0    1   python     352      A
1    2     ruby     463      A
2    3     java     395      B
3    4  haskell     135      C
4    5       go     849      B

DataFrame.describe() shows a quick statistic summary of your data:

In [9]: df.describe()
Out[9]: 
DataFrame()
+------------+--------------------+-------+--------------------+--------+
| describe   | nrs                | names | random             | groups |
+------------+--------------------+-------+--------------------+--------+
| count      | 5.0                | 5     | 5.0                | 5      |
| null_count | 5.0                | 5     | 5.0                | 5      |
| mean       | 3.0                | null  | 438.8              | null   |
| std        | 1.5811388300841898 | null  | 260.09459817535617 | null   |
| min        | 1.0                | go    | 135.0              | A      |
| max        | 5.0                | ruby  | 849.0              | C      |
| median     | 3.0                | null  | 395.0              | null   |
+------------+--------------------+-------+--------------------+--------+