Arrow Skyhook example#

The file cpp/examples/arrow/ located inside the source tree contains an example of using Skyhook to offload filters and projections to a Ceph cluster.



The instructions below are for Ubuntu 20.04 or above.

  1. Install Ceph and Skyhook dependencies.

    apt update
    apt install -y cmake \
                   libradospp-dev \
                   rados-objclass-dev \
                   ceph \
                   ceph-common \
                   ceph-osd \
                   ceph-mon \
                   ceph-mgr \
                   ceph-mds \
                   rbd-mirror \
                   ceph-fuse \
                   rapidjson-dev \
                   libboost-all-dev \
  2. Build and install Skyhook.

    git clone
    cd arrow/
    mkdir -p cpp/release
    cd cpp/release
    cmake -DARROW_SKYHOOK=ON \
          -DARROW_PARQUET=ON \
          -DARROW_DATASET=ON \
          -DARROW_CSV=ON \
          -DARROW_WITH_LZ4=ON \
    make -j install
    cp release/ /usr/lib/x86_64-linux-gnu/rados-classes/
  3. Deploy a Ceph cluster with a single in-memory OSD using this script.

    ./ /tmp/skyhook
  4. Generate the example dataset.

    pip install pandas pyarrow
    python3 ../../ci/scripts/
    cp -r nyc /mnt/cephfs/
  5. Execute the example.

    LD_LIBRARY_PATH=/usr/local/lib release/dataset-skyhook-scan-example file:///mnt/cephfs/nyc