Working with data stored in cloud storage systems like Amazon Simple Storage Service (S3) and Google Cloud Storage (GCS) is a very common task. Because of this, the Arrow C++ library provides a toolkit aimed to make it as simple to work with cloud storage as it is to work with the local filesystem.
To make this work, the Arrow C++ library contains a general-purpose
interface for file systems, and the arrow package exposes this interface
to R users. For instance, if you want to you can create a
LocalFileSystem
object that allows you to interact with the
local file system in the usual ways: copying, moving, and deleting
files, obtaining information about files and folders, and so on (see
help("FileSystem", package = "arrow")
for details). In
general you probably don’t need this functionality because you already
have tools for working with your local file system, but this interface
becomes much more useful in the context of remote file systems.
Currently there is a specific implementation for Amazon S3 provided by
the S3FileSystem
class, and another one for Google Cloud
Storage provided by GcsFileSystem
.
This article provides an overview of working with both S3 and GCS data using the Arrow toolkit.
S3 and GCS support on Linux
Before you start, make sure that your arrow install has support for S3 and/or GCS enabled. For most users this will be true by default, because the Windows and macOS binary packages hosted on CRAN include S3 and GCS support. You can check whether support is enabled via helper functions:
If these return TRUE
then the relevant support is
enabled.
In some cases you may find that your system does not have support enabled. The most common case for this occurs on Linux when installing arrow from source. In this situation S3 and GCS support is not always enabled by default, and there are additional system requirements involved. See the installation article for details on how to resolve this.
Connecting to cloud storage
One way of working with filesystems is to create
?FileSystem
objects. ?S3FileSystem
objects can
be created with the s3_bucket()
function, which
automatically detects the bucket’s AWS region. Similarly,
?GcsFileSystem
objects can be created with the
gs_bucket()
function. The resulting FileSystem
will consider paths relative to the bucket’s path (so for example you
don’t need to prefix the bucket path when listing a directory).
With a FileSystem
object, you can point to specific
files in it with the $path()
method and pass the result to
file readers and writers (read_parquet()
,
write_feather()
, et al.).
Often the reason users work with cloud storage in real world analysis
is to access large data sets. An example of this is discussed in the datasets article, but new users may prefer to
work with a much smaller data set while learning how the arrow cloud
storage interface works. To that end, the examples in this article rely
on a multi-file Parquet dataset that stores a copy of the
diamonds
data made available through the ggplot2
package,
documented in help("diamonds", package = "ggplot2")
. The
cloud storage version of this data set consists of 5 Parquet files
totaling less than 1MB in size.
The diamonds data set is hosted on both S3 and GCS, in a bucket named
voltrondata-labs-datasets
. To create an S3FileSystem object
that refers to that bucket, use the following command:
bucket <- s3_bucket("voltrondata-labs-datasets")
To do this for the GCS version of the data, the command is as follows:
bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
Note that anonymous = TRUE
is required for GCS if
credentials have not been configured.
Within this bucket there is a folder called diamonds
. We
can call bucket$ls("diamonds")
to list the files stored in
this folder, or bucket$ls("diamonds", recursive = TRUE)
to
recursively search subfolders. Note that on GCS, you should always set
recursive = TRUE
because directories often don’t appear in
the results.
Here’s what we get when we list the files stored in the GCS bucket:
bucket$ls("diamonds", recursive = TRUE)
## [1] "diamonds/cut=Fair/part-0.parquet"
## [2] "diamonds/cut=Good/part-0.parquet"
## [3] "diamonds/cut=Ideal/part-0.parquet"
## [4] "diamonds/cut=Premium/part-0.parquet"
## [5] "diamonds/cut=Very Good/part-0.parquet"
There are 5 Parquet files here, one corresponding to each of the
“cut” categories in the diamonds
data set. We can specify
the path to a specific file by calling bucket$path()
:
parquet_good <- bucket$path("diamonds/cut=Good/part-0.parquet")
We can use read_parquet()
to read from this path
directly into R:
diamonds_good <- read_parquet(parquet_good)
diamonds_good
## # A tibble: 4,906 × 9
## carat color clarity depth table price x y z
## <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 E VS1 56.9 65 327 4.05 4.07 2.31
## 2 0.31 J SI2 63.3 58 335 4.34 4.35 2.75
## 3 0.3 J SI1 64 55 339 4.25 4.28 2.73
## 4 0.3 J SI1 63.4 54 351 4.23 4.29 2.7
## 5 0.3 J SI1 63.8 56 351 4.23 4.26 2.71
## 6 0.3 I SI2 63.3 56 351 4.26 4.3 2.71
## 7 0.23 F VS1 58.2 59 402 4.06 4.08 2.37
## 8 0.23 E VS1 64.1 59 402 3.83 3.85 2.46
## 9 0.31 H SI1 64 54 402 4.29 4.31 2.75
## 10 0.26 D VS2 65.2 56 403 3.99 4.02 2.61
## # … with 4,896 more rows
## # ℹ Use `print(n = ...)` to see more rows
Note that this will be slower to read than if the file were local.
Connecting directly with a URI
In most use cases, the easiest and most natural way to connect to
cloud storage in arrow is to use the FileSystem objects returned by
s3_bucket()
and gs_bucket()
, especially when
multiple file operations are required. However, in some cases you may
want to download a file directly by specifying the URI. This is
permitted by arrow, and functions like read_parquet()
,
write_feather()
, open_dataset()
etc will all
accept URIs to cloud resources hosted on S3 or GCS. The format of an S3
URI is as follows:
s3://[access_key:secret_key@]bucket/path[?region=]
For GCS, the URI format looks like this:
gs://[access_key:secret_key@]bucket/path
gs://anonymous@bucket/path
For example, the Parquet file storing the “good cut” diamonds that we downloaded earlier in the article is available on both S3 and CGS. The relevant URIs are as follows:
uri <- "s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"
uri <- "gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"
Note that “anonymous” is required on GCS for public buckets.
Regardless of which version you use, you can pass this URI to
read_parquet()
as if the file were stored locally:
df <- read_parquet(uri)
URIs accept additional options in the query parameters (the part
after the ?
) that are passed down to configure the
underlying file system. They are separated by &
. For
example,
s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
is equivalent to:
bucket <- S3FileSystem$create(
endpoint_override="https://storage.googleapis.com",
allow_bucket_creation=TRUE
)
bucket$path("voltrondata-labs-datasets/")
Both tell the S3FileSystem
object that it should allow
the creation of new buckets and to talk to Google Storage instead of S3.
The latter works because GCS implements an S3-compatible API – see File systems that emulate S3
below – but if you want better support for GCS you should refer to a
GcsFileSystem
but using a URI that starts with
gs://
.
Also note that parameters in the URI need to be percent
encoded, which is why ://
is written as
%3A%2F%2F
.
For S3, only the following options can be included in the URI as
query parameters are region
, scheme
,
endpoint_override
, access_key
,
secret_key
, allow_bucket_creation
, and
allow_bucket_deletion
. For GCS, the supported parameters
are scheme
, endpoint_override
, and
retry_limit_seconds
.
In GCS, a useful option is retry_limit_seconds
, which
sets the number of seconds a request may spend retrying before returning
an error. The current default is 15 minutes, so in many interactive
contexts it’s nice to set a lower value:
gs://anonymous@voltrondata-labs-datasets/diamonds/?retry_limit_seconds=10
Authentication
S3 Authentication
To access private S3 buckets, you need typically need two secret
parameters: a access_key
, which is like a user id, and
secret_key
, which is like a token or password. There are a
few options for passing these credentials:
Include them in the URI, like
s3://access_key:secret_key@bucket-name/path/to/file
. Be sure to URL-encode your secrets if they contain special characters like “/” (e.g.,URLencode("123/456", reserved = TRUE)
).Pass them as
access_key
andsecret_key
toS3FileSystem$create()
ors3_bucket()
Set them as environment variables named
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
, respectively.Define them in a
~/.aws/credentials
file, according to the AWS documentation.Use an AccessRole for temporary access by passing the
role_arn
identifier toS3FileSystem$create()
ors3_bucket()
.
GCS Authentication
The simplest way to authenticate with GCS is to run the gcloud command to setup application default credentials:
gcloud auth application-default login
To manually configure credentials, you can pass either
access_token
and expiration
, for using
temporary tokens generated elsewhere, or json_credentials
,
to reference a downloaded credentials file.
If you haven’t configured credentials, then to access public
buckets, you must pass anonymous = TRUE
or
anonymous
as the user in a URI:
bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
fs <- GcsFileSystem$create(anonymous = TRUE)
df <- read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")
Using a proxy server
If you need to use a proxy server to connect to an S3 bucket, you can
provide a URI in the form http://user:password@host:port
to
proxy_options
. For example, a local proxy server running on
port 1316 can be used like this:
bucket <- s3_bucket(
bucket = "voltrondata-labs-datasets",
proxy_options = "http://localhost:1316"
)
File systems that emulate S3
The S3FileSystem
machinery enables you to work with any
file system that provides an S3-compatible interface. For example, MinIO is and object-storage server that
emulates the S3 API. If you were to run minio server
locally with its default settings, you could connect to it with arrow
using S3FileSystem
like this:
minio <- S3FileSystem$create(
access_key = "minioadmin",
secret_key = "minioadmin",
scheme = "http",
endpoint_override = "localhost:9000"
)
or, as a URI, it would be
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
(Note the URL escaping of the :
in
endpoint_override
).
Among other applications, this can be useful for testing out code locally before running on a remote S3 bucket.
Disabling environment variables
As mentioned above, it is possible to make use of environment variables to configure access. However, if you wish to pass in connection details via a URI or alternative methods but also have existing AWS environment variables defined, these may interfere with your session. For example, you may see an error message like:
: IOError: When resolving region for bucket 'analysis': AWS Error [code 99]: curlCode: 6, Couldn't resolve host name Error
You can unset these environment variables using
Sys.unsetenv()
, for example:
Sys.unsetenv("AWS_DEFAULT_REGION")
Sys.unsetenv("AWS_S3_ENDPOINT")
By default, the AWS SDK tries to retrieve metadata about user
configuration, which can cause conflicts when passing in connection
details via URI (for example when accessing a MINIO bucket). To disable
the use of AWS environment variables, you can set environment variable
AWS_EC2_METADATA_DISABLED
to TRUE
.
Sys.setenv(AWS_EC2_METADATA_DISABLED = TRUE)
Further reading
- To learn more about
FileSystem
classes, includingS3FileSystem
andGcsFileSystem
, seehelp("FileSystem", package = "arrow")
. - To see a data analysis example that relies on data hosted on cloud storage, see the dataset article.