Apache Arrow 0.2.0 (18 February 2017)
Download
Changelog
Contributors
$ git shortlog -sn apache-arrow-0.1.0..apache-arrow-0.2.0
73 Wes McKinney
55 Uwe L. Korn
16 Julien Le Dem
4 Bryan Cutler
4 Nong Li
2 Christopher C. Aycock
2 Jingyuan Wang
2 Kouhei Sutou
2 Laurent Goujon
2 Leif Walsh
1 Emilio Lahr-Vivaz
1 Holden Karau
1 Li Jin
1 Mohamed Zenadi
1 Peter Hoffmann
1 Steven Phillips
1 adeneche
1 ahnj
1 vkorukanti
New Features and Improvements
- ARROW-108 - [C++] Add IPC round trip for union types
- ARROW-189 - C++: Use ExternalProject to build thirdparty dependencies
- ARROW-191 - Python: Provide infrastructure for manylinux1 wheels
- ARROW-221 - Add switch for writing Parquet 1.0 compatible logical types
- ARROW-227 - [C++/Python] Hook arrow_io generic reader / writer interface into arrow_parquet
- ARROW-228 - [Python] Create an Arrow-cpp-compatible interface for reading bytes from Python file-like objects
- ARROW-243 - [C++] Add “driver” option to HdfsClient to choose between libhdfs and libhdfs3 at runtime
- ARROW-268 - [C++] Flesh out union implementation to have all required methods for IPC
- ARROW-303 - [C++] Also build static libraries for leaf libraries
- ARROW-312 - [Python] Provide Python API to read/write the Arrow IPC file format
- ARROW-317 - [C++] Implement zero-copy Slice method on arrow::Buffer that retains reference to parent
- ARROW-327 - [Python] Remove conda builds from Travis CI processes
- ARROW-328 - [C++] Return shared_ptr by value instead of const-ref?
- ARROW-33 - C++: Implement zero-copy array slicing
- ARROW-330 - [C++] CMake functions to simplify shared / static library configuration
- ARROW-332 - [Python] Add helper function to convert RecordBatch to pandas.DataFrame
- ARROW-333 - Make writers update their internal schema even when no data is written.
- ARROW-335 - Improve Type apis and toString() by encapsulating flatbuffers better
- ARROW-336 - Run Apache Rat in Travis builds
- ARROW-338 - [C++] Refactor IPC vector “loading” and “unloading” to be based on cleaner visitor pattern
- ARROW-350 - Add Kerberos support to HDFS shim
- ARROW-353 - Arrow release 0.2
- ARROW-355 - Add tests for serialising arrays of empty strings to Parquet
- ARROW-356 - Add documentation about reading Parquet
- ARROW-359 - Need to document ARROW_LIBHDFS_DIR
- ARROW-360 - C++: Add method to shrink PoolBuffer using realloc
- ARROW-361 - Python: Support reading a column-selection from Parquet files
- ARROW-363 - Set up Java/C++ integration test harness
- ARROW-365 - Python: Provide Array.to_pandas()
- ARROW-366 - [java] implement Dictionary vector
- ARROW-367 - [java] converter csv/json <=> Arrow file format for Integration tests
- ARROW-368 - Document use of LD_LIBRARY_PATH when using Python
- ARROW-369 - [Python] Add ability to convert multiple record batches at once to pandas
- ARROW-372 - Create JSON arrow file format for integration tests
- ARROW-373 - [C++] Implement C++ version of JSON file format for testing
- ARROW-374 - Python: clarify unicode vs. binary in API
- ARROW-377 - Python: Add support for conversion of Pandas.Categorical
- ARROW-379 - Python: Use setuptools_scm/setuptools_scm_git_archive to provide the version number
- ARROW-380 - [Java] optimize null count when serializing vectors.
- ARROW-381 - [C++] Simplify primitive array type builders to use a default type singleton
- ARROW-382 - Python: Extend API documentation
- ARROW-383 - [C++] Implement C++ version of ARROW-367 integration test validator
- ARROW-389 - Python: Write Parquet files to pyarrow.io.NativeFile objects
- ARROW-394 - Add integration tests for boolean, list, struct, and other basic types
- ARROW-396 - Python: Add pyarrow.schema.Schema.equals
- ARROW-409 - Python: Change pyarrow.Table.dataframe_from_batches API to create Table instead
- ARROW-410 - [C++] Add Flush method to arrow::io::OutputStream
- ARROW-411 - [Java] Move Intergration.compare and Intergration.compareSchemas to a public utils class
- ARROW-415 - C++: Add Equals implementation to compare Tables
- ARROW-416 - C++: Add Equals implementation to compare Columns
- ARROW-417 - C++: Add Equals implementation to compare ChunkedArrays
- ARROW-418 - [C++] Consolidate array container and builder code, remove arrow/types
- ARROW-419 - [C++] Promote util/{status.h, buffer.h, memory-pool.h} to top level of arrow/ source directory
- ARROW-423 - C++: Define BUILD_BYPRODUCTS in external project to support non-make CMake generators
- ARROW-425 - Python: Expose a C function to convert arrow::Table to pyarrow.Table
- ARROW-426 - Python: Conversion from pyarrow.Array to a Python list
- ARROW-427 - [C++] Implement dictionary-encoded array container
- ARROW-428 - [Python] Deserialize from Arrow record batches to pandas in parallel using a thread pool
- ARROW-430 - Python: Better version handling
- ARROW-432 - [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs
- ARROW-438 - [Python] Concatenate Table instances with equal schemas
- ARROW-440 - [C++] Support pkg-config
- ARROW-441 - [Python] Expose Arrow’s file and memory map classes as NativeFile subclasses
- ARROW-442 - [Python] Add public Python API to inspect Parquet file metadata
- ARROW-444 - [Python] Avoid unnecessary memory copies from use of PyBytes_* C APIs
- ARROW-449 - Python: Conversion from pyarrow.{Table,RecordBatch} to a Python dict
- ARROW-450 - Python: Fixes for PARQUET-818
- ARROW-456 - C++: Add jemalloc based MemoryPool
- ARROW-457 - Python: Better control over memory pool
- ARROW-458 - Python: Expose jemalloc MemoryPool
- ARROW-461 - [Python] Implement conversion between arrow::DictionaryArray and pandas.Categorical
- ARROW-463 - C++: Support jemalloc 4.x
- ARROW-466 - C++: ExternalProject for jemalloc
- ARROW-467 - [Python] Run parquet-cpp unit tests in Travis CI
- ARROW-468 - Python: Conversion of nested data in pd.DataFrames to/from Arrow structures
- ARROW-470 - [Python] Add “FileSystem” abstraction to access directories of files in a uniform way
- ARROW-471 - [Python] Enable ParquetFile to pass down separately-obtained file metadata
- ARROW-472 - [Python] Expose parquet::{SchemaDescriptor, ColumnDescriptor}::Equals
- ARROW-474 - Create an Arrow streaming file fomat
- ARROW-475 - [Python] High level support for reading directories of Parquet files (as a single Arrow table) from supported file system interfaces
- ARROW-476 - [Integration] Add integration tests for Binary / Varbytes type
- ARROW-477 - [Java] Add support for second/microsecond/nanosecond timestamps in-memory and in IPC/JSON layer
- ARROW-478 - [Python] Accept a PyBytes object in the pyarrow.io.BufferReader ctor
- ARROW-479 - Python: Test for expected schema in Pandas conversion
- ARROW-484 - Add more detail about what of technology can be found in the Arrow implementations to README
- ARROW-485 - [Java] Users are required to initialize VariableLengthVectors.offsetVector before calling VariableLengthVectors.mutator.getSafe
- ARROW-490 - Python: Update manylinux1 build scripts
- ARROW-495 - [C++] Add C++ implementation of streaming serialized format
- ARROW-497 - [Java] Integration test harness for streaming format
- ARROW-498 - [C++] Integration test harness for streaming format
- ARROW-503 - [Python] Interface to streaming binary format
- ARROW-506 - Implement Arrow Echo server for integration testing
- ARROW-508 - [C++] Make file/memory-mapped file interfaces threadsafe
- ARROW-509 - [Python] Add support for PARQUET-835 (parallel column reads)
- ARROW-512 - C++: Add method to check for primitive types
- ARROW-514 - [Python] Accept pyarrow.io.Buffer as input to StreamReader, FileReader classes
- ARROW-515 - [Python] Add StreamReader/FileReader methods that read all record batches as a Table
- ARROW-521 - [C++/Python] Track peak memory use in default MemoryPool
- ARROW-524 - [java] provide apis to access nested vectors and buffers
- ARROW-525 - Python: Add more documentation to the package
- ARROW-527 - clean drill-module.conf file
- ARROW-529 - Python: Add jemalloc and Python 3.6 to manylinux1 build
- ARROW-531 - Python: Document jemalloc, extend Pandas section, add Getting Involved
- ARROW-538 - [C++] Set up AddressSanitizer (ASAN) builds
- ARROW-546 - Python: Account for changes in PARQUET-867
- ARROW-547 - [Python] Expose Array::Slice and RecordBatch::Slice
- ARROW-553 - C++: Faster valid bitmap building
- ARROW-558 - Add KEYS files
- ARROW-81 - [Format] Add a Category logical type (distinct from dictionary-encoding)
- ARROW-96 - C++: API documentation using Doxygen
- ARROW-97 - Python: API documentation via sphinx-apidoc
Bug Fixes
- ARROW-112 - [C++] Style fix for constants/enums
- ARROW-202 - [C++] Integrate with appveyor ci for windows support and get arrow building on windows
- ARROW-220 - [C++] Build conda artifacts in a build environment with better cross-linux ABI compatibility
- ARROW-224 - [C++] Address static linking of boost dependencies
- ARROW-230 - Python: Do not name modules like native ones (i.e. rename pyarrow.io)
- ARROW-239 - [Python] HdfsFile.read called with no arguments should read remainder of file
- ARROW-261 - [C++] Refactor BinaryArray/StringArray classes to not inherit from ListArray
- ARROW-275 - Add tests for UnionVector in Arrow File
- ARROW-294 - [C++] Do not use fopen / fclose / etc. methods for memory mapped file implementation
- ARROW-322 - [C++] Do not build HDFS IO interface optionally
- ARROW-323 - [Python] Opt-in to PyArrow parquet build rather than skipping silently on failure
- ARROW-334 - [Python] OS X rpath issues on some configurations
- ARROW-337 - UnionListWriter.list() is doing more than it should, this can cause data corruption
- ARROW-339 - Make merge_arrow_pr script work with Python 3
- ARROW-340 - [C++] Opening a writeable file on disk that already exists does not truncate to zero
- ARROW-342 - Set Python version on release
- ARROW-345 - libhdfs integration doesn’t work for Mac
- ARROW-346 - Python API Documentation
- ARROW-348 - [Python] CMake build type should be configurable on the command line
- ARROW-349 - Six is missing as a requirement in the python setup.py
- ARROW-351 - Time type has no unit
- ARROW-354 - Connot compare an array of empty strings to another
- ARROW-357 - Default Parquet chunk_size of 64k is too small
- ARROW-358 - [C++] libhdfs can be in non-standard locations in some Hadoop distributions
- ARROW-362 - Python: Calling to_pandas on a table read from Parquet leaks memory
- ARROW-371 - Python: Table with null timestamp becomes float in pandas
- ARROW-375 - columns parameter in parquet.read_table() raises KeyError for valid column
- ARROW-384 - Align Java and C++ RecordBatch data and metadata layout
- ARROW-386 - [Java] Respect case of struct / map field names
- ARROW-387 - [C++] arrow::io::BufferReader does not permit shared memory ownership in zero-copy reads
- ARROW-390 - C++: CMake fails on json-integration-test with ARROW_BUILD_TESTS=OFF
- ARROW-392 - Fix string/binary integration tests
- ARROW-393 - [JAVA] JSON file reader fails to set the buffer size on String data vector
- ARROW-395 - Arrow file format writes record batches in reverse order.
- ARROW-398 - [Java] Java file format requires bitmaps of all 1’s to be written when there are no nulls
- ARROW-399 - [Java] ListVector.loadFieldBuffers ignores the ArrowFieldNode length metadata
- ARROW-400 - [Java] ArrowWriter writes length 0 for Struct types
- ARROW-401 - [Java] Floating point vectors should do an approximate comparison in integration tests
- ARROW-402 - [Java] “refCnt gone negative” error in integration tests
- ARROW-403 - [JAVA] UnionVector: Creating a transfer pair doesn’t transfer the schema to destination vector
- ARROW-404 - [Python] Closing an HdfsClient while there are still open file handles results in a crash
- ARROW-405 - [C++] Be less stringent about finding include/hdfs.h in HADOOP_HOME
- ARROW-406 - [C++] Large HDFS reads must utilize the set file buffer size when making RPCs
- ARROW-408 - [C++/Python] Remove defunct conda recipes
- ARROW-414 - [Java] “Buffer too large to resize to …” error
- ARROW-420 - Align Date implementation between Java and C++
- ARROW-421 - [Python] Zero-copy buffers read by pyarrow::PyBytesReader must retain a reference to the parent PyBytes to avoid premature garbage collection issues
- ARROW-422 - C++: IPC should depend on rapidjson_ep if RapidJSON is vendored
- ARROW-429 - git-archive SHA-256 checksums are changing
- ARROW-433 - [Python] Date conversion is locale-dependent
- ARROW-434 - Segfaults and encoding issues in Python Parquet reads
- ARROW-435 - C++: Spelling mistake in if(RAPIDJSON_VENDORED)
- ARROW-437 - [C++] clang compiler warnings from overridden virtual functions
- ARROW-445 - C++: arrow_ipc is built before arrow/ipc/Message_generated.h was generated
- ARROW-447 - Python: Align scalar/pylist string encoding with pandas’ one.
- ARROW-455 - [C++] BufferOutputStream dtor does not call Close()
- ARROW-469 - C++: Add option so that resize doesn’t decrease the capacity
- ARROW-481 - [Python] Fix Python 2.7 regression in patch for PARQUET-472
- ARROW-486 - [C++] arrow::io::MemoryMappedFile can’t be casted to arrow::io::FileInterface
- ARROW-487 - Python: ConvertTableToPandas segfaults if ObjectBlock::Write fails
- ARROW-494 - [C++] When MemoryMappedFile is destructed, memory is unmapped even if buffer referecnes still exist
- ARROW-499 - Update file serialization to use streaming serialization format
- ARROW-505 - [C++] Fix compiler warnings in release mode
- ARROW-511 - [Python] List[T] conversions not implemented for single arrays
- ARROW-513 - [C++] Fix Appveyor build
- ARROW-519 - [C++] Missing vtable in libarrow.dylib on Xcode 6.4
- ARROW-523 - Python: Account for changes in PARQUET-834
- ARROW-533 - [C++] arrow::TimestampArray / TimeArray has a broken constructor
- ARROW-535 - [Python] Add type mapping for NPY_LONGLONG
- ARROW-537 - [C++] StringArray/BinaryArray comparisons may be incorrect when values with non-zero length are null
- ARROW-540 - [C++] Fix build in aftermath of ARROW-33
- ARROW-543 - C++: Lazily computed null_counts counts number of non-null entries
- ARROW-544 - [C++] ArrayLoader::LoadBinary fails for length-0 arrays
- ARROW-545 - [Python] Ignore files without .parq or .parquet prefix when reading directory of files
- ARROW-548 - [Python] Add nthreads option to pyarrow.Filesystem.read_parquet
- ARROW-551 - C++: Construction of Column with nullptr Array segfaults
- ARROW-556 - [Integration] Can not run Integration tests if different cpp build path
- ARROW-561 - Update java & python dependencies to improve downstream packaging experience