2 Reading and Writing Data - Single Files

2.1 Introduction

When reading files into R using Apache Arrow, you can read:

  • a single file into memory as a data frame or an Arrow Table
  • a single file that is too large to fit in memory as an Arrow Dataset
  • multiple and partitioned files as an Arrow Dataset

This chapter contains recipes related to using Apache Arrow to read and write single file data into memory as an Arrow Table. There are a number of circumstances in which you may want to read in single file data as an Arrow Table:

  • your data file is large and having performance issues
  • you want faster performance from your dplyr queries
  • you want to be able to take advantage of Arrow’s compute functions

If a single data file is too large to load into memory, you can use the Arrow Dataset API. Recipes for using open_dataset() and write_dataset() are in the Reading and Writing Data - Multiple Files chapter.

2.2 Convert data from a data frame to an Arrow Table

You want to convert an existing data.frame or tibble object into an Arrow Table.

2.2.1 Solution

air_table <- arrow_table(airquality)
air_table
## Table
## 153 rows x 6 columns
## $Ozone <int32>
## $Solar.R <int32>
## $Wind <double>
## $Temp <int32>
## $Month <int32>
## $Day <int32>
## 
## See $metadata for additional Schema metadata

2.3 Convert data from an Arrow Table to a data frame

You want to convert an Arrow Table to a data frame to view the data or work with it in your usual analytics pipeline.

2.3.1 Solution

air_df <- as.data.frame(air_table)
air_df
##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      NA      NA 14.3   56     5   5
## 6      28      NA 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     NA     194  8.6   69     5  10
## 11      7      NA  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     NA     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78      NA  6.9   86     8   4
## 97     35      NA  7.4   85     8   5
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    NA      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    NA     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119    NA     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

2.3.2 Discussion

You can dplyr::collect() to return a tibble or as.data.frame() to return a data.frame.

2.4 Write a Parquet file

You want to write a single Parquet file to disk.

2.4.1 Solution

# Create table
my_table <- arrow_table(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
# Write to Parquet
write_parquet(my_table, "my_table.parquet")

2.5 Read a Parquet file

You want to read a single Parquet file into memory.

2.5.1 Solution

parquet_tbl <- read_parquet("my_table.parquet")
parquet_tbl
## # A tibble: 3 × 2
##   group score
##   <chr> <dbl>
## 1 A        99
## 2 B        97
## 3 C        99

As the argument as_data_frame was left set to its default value of TRUE, the file was read in as a tibble.

class(parquet_tbl)
## [1] "tbl_df"     "tbl"        "data.frame"

2.5.2 Discussion

If you set as_data_frame to FALSE, the file will be read in as an Arrow Table.

my_table_arrow <- read_parquet("my_table.parquet", as_data_frame = FALSE)
my_table_arrow
## Table
## 3 rows x 2 columns
## $group <string>
## $score <double>
class(my_table_arrow)
## [1] "Table"        "ArrowTabular" "ArrowObject"  "R6"

2.6 Read a Parquet file from S3

You want to read a single Parquet file from S3 into memory.

2.6.1 Solution

df <- read_parquet(file = "s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet")

2.6.2 See also

For more in-depth instructions, including how to work with S3 buckets which require authentication, you can find a guide to reading and writing to/from S3 buckets here: https://arrow.apache.org/docs/r/articles/fs.html.

2.7 Filter columns while reading a Parquet file

You want to specify which columns to include when reading in a single Parquet file into memory.

2.7.1 Solution

# Create table to read back in
dist_time <- arrow_table(data.frame(distance = c(12.2, 15.7, 14.2), time = c(43, 44, 40)))
# Write to Parquet
write_parquet(dist_time, "dist_time.parquet")

# Read in only the "time" column
time_only <- read_parquet("dist_time.parquet", col_select = "time")
time_only
## # A tibble: 3 × 1
##    time
##   <dbl>
## 1    43
## 2    44
## 3    40

2.8 Write a Feather V2/Arrow IPC file

You want to write a single Feather V2 file (also called Arrow IPC file).

2.8.1 Solution

my_table <- arrow_table(data.frame(group = c("A", "B", "C"), score = c(99, 97, 99)))
write_feather(my_table, "my_table.arrow")

2.8.2 Discussion

For legacy support, you can write data in the original Feather format by setting the version parameter to 1.

# Create table
my_table <- arrow_table(data.frame(group = c("A", "B", "C"), score = c(99, 97, 99)))
# Write to Feather format V1
write_feather(mtcars, "my_table.feather", version = 1)

2.9 Read a Feather/Arrow IPC file

You want to read a single Feather V1 or V2 file into memory (also called Arrow IPC file).

2.9.1 Solution

my_feather_tbl <- read_feather("my_table.arrow")

2.10 Write streaming Arrow IPC files

You want to write to the Arrow IPC stream format.

2.10.1 Solution

# Create table
my_table <- arrow_table(
  data.frame(
    group = c("A", "B", "C"),
    score = c(99, 97, 99)
    )
)
# Write to IPC stream format
write_ipc_stream(my_table, "my_table.arrows")

2.11 Read streaming Arrow IPC files

You want to read from the Arrow IPC stream format.

2.11.1 Solution

my_ipc_stream <- arrow::read_ipc_stream("my_table.arrows")

2.12 Write a CSV file

You want to write Arrow data to a single CSV file.

2.12.1 Solution

write_csv_arrow(cars, "cars.csv")

2.13 Read a CSV file

You want to read a single CSV file into memory.

2.13.1 Solution

my_csv <- read_csv_arrow("cars.csv", as_data_frame = FALSE)

2.14 Read a JSON file

You want to read a JSON file into memory.

2.14.1 Solution

# Create a file to read back in
tf <- tempfile()
writeLines('
    {"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38}
    {"country": "France", "code": "FR", "long": 2.21, "lat": 46.23}
    {"country": "Germany", "code": "DE", "long": 10.45, "lat": 51.17}
  ', tf, useBytes = TRUE)

# Read in the data
countries <- read_json_arrow(tf, col_select = c("country", "long", "lat"))
countries
## # A tibble: 3 × 3
##   country         long   lat
##   <chr>          <dbl> <dbl>
## 1 United Kingdom -3.44  55.4
## 2 France          2.21  46.2
## 3 Germany        10.4   51.2

2.15 Write a compressed single data file

You want to save a single file, compressed with a specified compression algorithm.

2.15.1 Solution

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write data compressed with the gzip algorithm instead of the default
write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")

2.15.2 See also

Some formats write compressed data by default. For more information on the supported compression algorithms and default settings, see:

  • ?write_parquet()
  • ?write_feather()

2.16 Read compressed data

You want to read in a single data file which has been compressed.

2.16.1 Solution

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write data which is to be read back in
write_parquet(iris, file.path(td, "iris.parquet"), compression = "gzip")

# Read in data
ds <- read_parquet(file.path(td, "iris.parquet"))
ds
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

2.16.2 Discussion

Note that Arrow automatically detects the compression and you do not have to supply it in the call to the read_*() or the open_dataset() functions.

Although the CSV format does not support compression itself, Arrow supports reading in CSV data which has been compressed, if the file extension is .gz.

# Create a temporary directory
td <- tempfile()
dir.create(td)

# Write data which is to be read back in
write.csv(iris, gzfile(file.path(td, "iris.csv.gz")), row.names = FALSE, quote = FALSE)

# Read in data
ds <- read_csv_arrow(file.path(td, "iris.csv.gz"))
ds
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows