Arrow Skyhook example

The file cpp/examples/arrow/dataset_skyhook_scan_example.cc located inside the source tree contains an example of using Skyhook to offload filters and projections to a Ceph cluster.

Instuctions

Note

The instructions below are for Ubuntu 20.04 or above.

  1. Install Ceph and Skyhook dependencies.

    apt update
    apt install -y cmake \
                   libradospp-dev \
                   rados-objclass-dev \
                   ceph \
                   ceph-common \
                   ceph-osd \
                   ceph-mon \
                   ceph-mgr \
                   ceph-mds \
                   rbd-mirror \
                   ceph-fuse \
                   rapidjson-dev \
                   libboost-all-dev \
                   python3-pip
    
  2. Build and install Skyhook.

    git clone https://github.com/apache/arrow
    cd arrow/
    mkdir -p cpp/release
    cd cpp/release
    cmake -DARROW_SKYHOOK=ON \
          -DARROW_PARQUET=ON \
          -DARROW_WITH_SNAPPY=ON \
          -DARROW_BUILD_EXAMPLES=ON \
          -DARROW_DATASET=ON \
          -DARROW_CSV=ON \
          -DARROW_WITH_LZ4=ON \
          ..
    
    make -j install
    cp release/libcls_skyhook.so /usr/lib/x86_64-linux-gnu/rados-classes/
    
  3. Deploy a Ceph cluster with a single in-memory OSD using this script.

    ./micro-osd.sh /tmp/skyhook
    
  4. Generate the example dataset.

    pip install pandas pyarrow
    python3 ../../ci/scripts/generate_dataset.py
    cp -r nyc /mnt/cephfs/
    
  5. Execute the example.

    LD_LIBRARY_PATH=/usr/local/lib release/dataset-skyhook-scan-example file:///mnt/cephfs/nyc