Arrow Skyhook example#
The file cpp/examples/arrow/dataset_skyhook_scan_example.cc
located inside the source tree contains an example of using Skyhook to
offload filters and projections to a Ceph cluster.
Instructions#
Note
The instructions below are for Ubuntu 20.04 or above.
Install Ceph and Skyhook dependencies.
apt update apt install -y cmake \ libradospp-dev \ rados-objclass-dev \ ceph \ ceph-common \ ceph-osd \ ceph-mon \ ceph-mgr \ ceph-mds \ rbd-mirror \ ceph-fuse \ rapidjson-dev \ libboost-all-dev \ python3-pip
Build and install Skyhook.
git clone https://github.com/apache/arrow cd arrow/ mkdir -p cpp/release cd cpp/release cmake -DARROW_SKYHOOK=ON \ -DARROW_PARQUET=ON \ -DARROW_WITH_SNAPPY=ON \ -DARROW_BUILD_EXAMPLES=ON \ -DARROW_DATASET=ON \ -DARROW_CSV=ON \ -DARROW_WITH_LZ4=ON \ .. make -j install cp release/libcls_skyhook.so /usr/lib/x86_64-linux-gnu/rados-classes/
Deploy a Ceph cluster with a single in-memory OSD using this script.
./micro-osd.sh /tmp/skyhook
Generate the example dataset.
pip install pandas pyarrow python3 ../../ci/scripts/generate_dataset.py cp -r nyc /mnt/cephfs/
Execute the example.
LD_LIBRARY_PATH=/usr/local/lib release/dataset-skyhook-scan-example file:///mnt/cephfs/nyc