Functions

DataFusion provides a large number of built-in functions for performing complex queries without requiring user-defined functions. In here we will cover some of the more popular use cases. If you want to view all the functions go to the Functions API Reference.

We’ll use the pokemon dataset in the following examples.

In [1]: import urllib.request

In [2]: from datafusion import SessionContext

In [3]: urllib.request.urlretrieve(
   ...: "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
   ...: "pokemon.csv",
   ...: )
   ...: 
Out[3]: ('pokemon.csv', <http.client.HTTPMessage at 0x7f3f37a00eb0>)

In [4]: ctx = SessionContext()

In [5]: ctx.register_csv("pokemon", "pokemon.csv")

In [6]: df = ctx.table("pokemon")

Mathematical

DataFusion offers mathematical functions such as pow() or log()

In [7]: from datafusion import col, literal

In [8]: from datafusion import functions as f

In [9]: df.select(
   ...:     f.pow(col('"Attack"'), literal(2)) - f.pow(col('"Defense"'), literal(2))
   ...: ).limit(10)
   ...: 
Out[9]: 
DataFrame()
+------------------------------------------------------------------+
| power(pokemon.Attack,Int64(2)) - power(pokemon.Defense,Int64(2)) |
+------------------------------------------------------------------+
| 0                                                                |
| -125                                                             |
| -165                                                             |
| -5129                                                            |
| 855                                                              |
| 732                                                              |
| 972                                                              |
| 4579                                                             |
| 4732                                                             |
| -1921                                                            |
+------------------------------------------------------------------+

Conditional

There 3 conditional functions in DataFusion coalesce(), nullif() and case() (not available in Python)

In [10]: df.select(
   ....:     f.coalesce(col('"Type 1"'), col('"Type 2"')).alias("dominant_type")
   ....: ).limit(10)
   ....: 
Out[10]: 
DataFrame()
+---------------+
| dominant_type |
+---------------+
| Grass         |
| Grass         |
| Grass         |
| Grass         |
| Fire          |
| Fire          |
| Fire          |
| Fire          |
| Fire          |
| Water         |
+---------------+

Temporal

For selecting the current time use now()

In [11]: df.select(f.now())
Out[11]: 
DataFrame()
+--------------------------------+
| now()                          |
+--------------------------------+
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
| 2024-03-10T23:18:38.153563379Z |
+--------------------------------+

Convert to timestamps using to_timestamp()

In [12]: df.select(f.to_timestamp(col('"Total"')).alias("timestamp"))
Out[12]: 
DataFrame()
+---------------------+
| timestamp           |
+---------------------+
| 1970-01-01T00:05:18 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:45 |
| 1970-01-01T00:10:25 |
| 1970-01-01T00:05:09 |
| 1970-01-01T00:06:45 |
| 1970-01-01T00:08:54 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:10:34 |
| 1970-01-01T00:05:14 |
+---------------------+

String

In the field of data science, working with textual data is a common task. To make string manipulation easier, DataFusion offers a range of helpful options.

In [13]: df.select(
   ....:     f.char_length(col('"Name"')).alias("len"),
   ....:     f.lower(col('"Name"')).alias("lower"),
   ....:     f.left(col('"Name"'), literal(4)).alias("code")
   ....: )
   ....: 
Out[13]: 
DataFrame()
+-----+---------------------------+------+
| len | lower                     | code |
+-----+---------------------------+------+
| 9   | bulbasaur                 | Bulb |
| 7   | ivysaur                   | Ivys |
| 8   | venusaur                  | Venu |
| 21  | venusaurmega venusaur     | Venu |
| 10  | charmander                | Char |
| 10  | charmeleon                | Char |
| 9   | charizard                 | Char |
| 25  | charizardmega charizard x | Char |
| 25  | charizardmega charizard y | Char |
| 8   | squirtle                  | Squi |
+-----+---------------------------+------+

This also includes the functions for regular expressions regexp_replace() and regexp_match()

In [14]: df.select(
   ....:     f.regexp_match(col('"Name"'), literal("Char")).alias("dragons"),
   ....:     f.regexp_replace(col('"Name"'), literal("saur"), literal("fleur")).alias("flowers")
   ....: )
   ....: 
Out[14]: 
DataFrame()
+---------+---------------------------+
| dragons | flowers                   |
+---------+---------------------------+
|         | Bulbafleur                |
|         | Ivyfleur                  |
|         | Venufleur                 |
|         | VenufleurMega Venusaur    |
| [Char]  | Charmander                |
| [Char]  | Charmeleon                |
| [Char]  | Charizard                 |
| [Char]  | CharizardMega Charizard X |
| [Char]  | CharizardMega Charizard Y |
|         | Squirtle                  |
+---------+---------------------------+

Other

The function in_list() allows to check a column for the presence of multiple values:

In [15]: types = [literal("Grass"), literal("Fire"), literal("Water")]

In [16]: (
   ....:     df.select(f.in_list(col('"Type 1"'), types, negated=False).alias("basic_types"))
   ....:       .limit(20)
   ....:       .to_pandas()
   ....: )
   ....: 
Out[16]: 
    basic_types
0          True
1          True
2          True
3          True
4          True
5          True
6          True
7          True
8          True
9          True
10         True
11         True
12         True
13        False
14        False
15        False
16        False
17        False
18        False
19        False