Basic OperationsΒΆ

In this section, you will learn how to display essential details of DataFrames using specific functions.

In [1]: from datafusion import SessionContext

In [2]: import random

In [3]: ctx = SessionContext()

In [4]: df = ctx.from_pydict({
   ...:     "nrs": [1, 2, 3, 4, 5],
   ...:     "names": ["python", "ruby", "java", "haskell", "go"],
   ...:     "random": random.sample(range(1000), 5),
   ...:     "groups": ["A", "A", "B", "C", "B"],
   ...: })
   ...: 

In [5]: df
Out[5]: 
DataFrame()
+-----+---------+--------+--------+
| nrs | names   | random | groups |
+-----+---------+--------+--------+
| 1   | python  | 390    | A      |
| 2   | ruby    | 217    | A      |
| 3   | java    | 689    | B      |
| 4   | haskell | 183    | C      |
| 5   | go      | 1      | B      |
+-----+---------+--------+--------+

Use DataFrame.limit() to view the top rows of the frame:

In [6]: df.limit(2)
Out[6]: 
DataFrame()
+-----+--------+--------+--------+
| nrs | names  | random | groups |
+-----+--------+--------+--------+
| 1   | python | 390    | A      |
| 2   | ruby   | 217    | A      |
+-----+--------+--------+--------+

Display the columns of the DataFrame using DataFrame.schema():

In [7]: df.schema()
Out[7]: 
nrs: int64
names: string
random: int64
groups: string

The method DataFrame.to_pandas() uses pyarrow to convert to pandas DataFrame, by collecting the batches, passing them to an Arrow table, and then converting them to a pandas DataFrame.

In [8]: df.to_pandas()
Out[8]: 
   nrs    names  random groups
0    1   python     390      A
1    2     ruby     217      A
2    3     java     689      B
3    4  haskell     183      C
4    5       go       1      B

DataFrame.describe() shows a quick statistic summary of your data:

In [9]: df.describe()
Out[9]: 
DataFrame()
+------------+--------------------+-------+-------------------+--------+
| describe   | nrs                | names | random            | groups |
+------------+--------------------+-------+-------------------+--------+
| count      | 5.0                | 5     | 5.0               | 5      |
| null_count | 5.0                | 5     | 5.0               | 5      |
| mean       | 3.0                | null  | 296.0             | null   |
| std        | 1.5811388300841898 | null  | 259.4802497301095 | null   |
| min        | 1.0                | go    | 1.0               | A      |
| max        | 5.0                | ruby  | 689.0             | C      |
| median     | 3.0                | null  | 217.0             | null   |
+------------+--------------------+-------+-------------------+--------+