Apache Arrow 0.8.0 (18 December 2017)
This is a major release.
Download
Contributors
$ git shortlog -sn apache-arrow-0.7.1..apache-arrow-0.8.0
90 Wes McKinney
23 Phillip Cloud
21 Kouhei Sutou
13 Licht-T
12 Korn, Uwe
12 Philipp Moritz
12 Uwe L. Korn
10 Bryan Cutler
5 Li Jin
5 Robert Nishihara
4 Paul Taylor
4 siddharth
3 Max Risuhin
3 Stephanie
2 Rene Sugar
2 Heimir Sverrisson
2 Brian Hulette
2 Yuliya Feldman
2 dhirschf
2 Matthias Vallentin
1 vkorukanti
1 Andrew Andrade
1 Benjamin Goldberg
1 Ivan Sadikov
1 John Jenkins
1 Joris Van den Bossche
1 Lewis John McGibbney
1 Lu Qi
1 Manuel
1 Nick White
1 Ofek Lev
1 Shixiong Zhu
1 Siddharth Teotia
1 Stephen G
1 Victor Uriarte
1 Wataru Shimizu
1 ksdevlife
1 lmeyerov
1 rvernica
1 Amir Malekpour
Patch Committers
The following Apache committers committed contributed patches to the repository.
$ git shortlog -csn apache-arrow-0.7.0..apache-arrow-0.8.0
236 Wes McKinney
35 Uwe L. Korn
10 Philipp Moritz
5 Kouhei Sutou
1 Steven Phillips
Changelog
New Features and Improvements
- ARROW-1032 - [JS] Support custom_metadata
- ARROW-1047 - [Java] Add generalized stream writer and reader interfaces that are decoupled from IO / message framing
- ARROW-1087 - [Python] add get_include to expose directory containing header files
- ARROW-1114 - [C++] Create Record Batch Builder class as a reusable and efficient way to transpose row-by-row data to columns
- ARROW-1134 - [C++] Allow C++/CLI projects to build with Arrow
- ARROW-1178 - [Python] Create alternative to Table.from_pandas that yields a list of RecordBatch objects with a given chunk size
- ARROW-1226 - [C++] Improve / correct doxygen function documentation in arrow::ipc
- ARROW-1250 - [Python] Define API for user type checking of array types
- ARROW-1369 - Support boolean types in the javascript arrow reader library
- ARROW-1371 - [Website] Add “Powered By” page to the website
- ARROW-1455 - [Python] Add Dockerfile for validating Dask integration outside of usual CI
- ARROW-1471 - [JAVA] Document requirements and non/requirements for ValueVector updates
- ARROW-1472 - [JAVA] Design updated ValueVector Object Hierarchy
- ARROW-1473 - [JAVA] Create Prototype Code Hierarchy (Implementation Phase 1)
- ARROW-1474 - [JAVA] ValueVector hierarchy (Implementation Phase 2)
- ARROW-1476 - [JAVA] Implement final ValueVector updates
- ARROW-1482 - [C++] Implement casts between date32 and date64
- ARROW-1483 - [C++] Implement casts between time32 and time64
- ARROW-1484 - [C++] Implement (safe and unsafe) casts between timestamps and times of different units
- ARROW-1486 - [C++] Decide if arrow::RecordBatch needs to be copyable
- ARROW-1487 - [C++] Implement casts from List<A> to List<B>, where a cast function is defined from any A to B
- ARROW-1488 - [C++] Implement ArrayBuilder::Finish in terms of internal::ArrayData
- ARROW-1498 - [GitHub] Add CONTRIBUTING.md and ISSUE_TEMPLATE.md
- ARROW-1503 - [Python] Add serialization callbacks for pandas objects in pyarrow.serialize
- ARROW-1522 - [C++] Support pyarrow.Buffer as built-in type in pyarrow.serialize
- ARROW-1523 - [C++] Add helper data struct with methods for reading a validity bitmap possibly having a non-zero offset
- ARROW-1524 - [C++] More graceful solution for handling non-zero offsets on inputs and outputs in compute library
- ARROW-1525 - [C++] Change functions in arrow/compare.h to not return Status
- ARROW-1526 - [Python] Unit tests to exercise code path in PARQUET-1100
- ARROW-1535 - [Python] Enable sdist source tarballs to build assuming that Arrow C++ libraries are available on the host system
- ARROW-1538 - [C++] Support Ubuntu 14.04 in .deb packaging automation
- ARROW-1539 - [C++] Remove functions deprecated as of 0.7.0 and prior releases
- ARROW-1556 - [C++] Incorporate AssertArraysEqual function from PARQUET-1100 patch
- ARROW-1559 - [C++] Kernel implementations for “unique” (compute distinct elements of array)
- ARROW-1573 - [C++] Implement stateful kernel function that uses DictionaryBuilder to compute dictionary indices
- ARROW-1575 - [Python] Add pyarrow.column factory function
- ARROW-1577 - [JS] Package release script for NPM modules
- ARROW-1588 - [C++/Format] Harden Decimal Format
- ARROW-1593 - [PYTHON] serialize_pandas should pass through the preserve_index keyword
- ARROW-1594 - [Python] Enable multi-threaded conversions in Table.from_pandas
- ARROW-1600 - [C++] Zero-copy Buffer constructor from std::string
- ARROW-1602 - [C++] Add IsValid/IsNotNull method to arrow::Array
- ARROW-1603 - [C++] Add BinaryArray method to get a value as a std::string
- ARROW-1604 - [Python] Support common type aliases in cast(…) and various type= arguments
- ARROW-1605 - [Python] pyarrow.array should be able to yield smaller integer types without an explicit cast
- ARROW-1607 - [C++] Implement DictionaryBuilder for Decimals
- ARROW-1613 - [Java] ArrowReader should not close the input ReadChannel
- ARROW-1616 - [Python] Add “write” method to RecordBatchStreamWriter that dispatches to write_table/write_back as appropriate
- ARROW-1626 - Add make targets to run the inter-procedural static analysis tool called “infer”.
- ARROW-1627 - [JAVA] Reduce heap usage(Phase 2) - memory footprint in AllocationManager.BufferLedger
- ARROW-1630 - [Serialization] Support Python datetime objects
- ARROW-1631 - [C++] Add GRPC to ThirdpartyToolchain.cmake
- ARROW-1635 - Add release management guide for PMCs
- ARROW-1637 - [C++] IPC round-trip for null type
- ARROW-1641 - [C++] Do not include
in public headers - ARROW-1648 - C++: Add cast from Dictionary[NullType] to NullType
- ARROW-1649 - C++: Print number of nulls in PrettyPrint for NullArray
- ARROW-1651 - [JS] Lazy row accessor in Table
- ARROW-1652 - [JS] Separate Vector into BatchVector and CompositeVector
- ARROW-1654 - [Python] pa.DataType cannot be pickled
- ARROW-1662 - Move OSX Dependency management into brew bundle Brewfiles
- ARROW-1665 - [Serialization] Support more custom datatypes in the default serialization context
- ARROW-1666 - [GLib] Enable gtk-doc on Travis CI Mac environment
- ARROW-1667 - [GLib] Support Meson
- ARROW-1671 - [C++] Change arrow::MakeArray to not return Status
- ARROW-1675 - [Python] Use RecordBatch.from_pandas in FeatherWriter.write
- ARROW-1677 - [Blog] Add blog post on Ray and Arrow Python serialization
- ARROW-1679 - [GLib] Add garrow_record_batch_reader_read_next()
- ARROW-1683 - [Python] Restore “TimestampType” to pyarrow namespace
- ARROW-1684 - [Python] Simplify user API for reading nested Parquet columns
- ARROW-1685 - [GLib] Add GArrowTableReader
- ARROW-1689 - [Python] Categorical Indices Should Be Zero-Copy
- ARROW-1690 - [GLib] Add garrow_array_is_valid()
- ARROW-1691 - [Java] Conform Java Decimal type implementation to format decisions in ARROW-1588
- ARROW-1697 - [GitHub] Add ISSUE_TEMPLATE.md
- ARROW-1701 - [Serialization] Support zero copy PyTorch Tensor serialization
- ARROW-1702 - Update jemalloc in manylinux1 build
- ARROW-1703 - [C++] Vendor exact version of jemalloc we depend on
- ARROW-1707 - Update dev README after movement to GitBox
- ARROW-1710 - [Java] Remove non-nullable vectors in new vector class hierarchy
- ARROW-1716 - [Format/JSON] Use string integer value for Decimals in JSON
- ARROW-1717 - [Java] Remove public static helper method in vector classes for JSONReader/Writer
- ARROW-1718 - [Python] Implement casts from timestamp to date32/date64 and support in Array.from_pandas
- ARROW-1719 - [Java] Remove accessor/mutator
- ARROW-1721 - [Python] Support null mask in places where it isn’t supported in numpy_to_arrow.cc
- ARROW-1724 - [Packaging] Support Ubuntu 17.10
- ARROW-1725 - [Packaging] Upload .deb for Ubuntu 17.10
- ARROW-1726 - [GLib] Add setup description to verify C GLib build
- ARROW-1727 - [Format] Expand Arrow streaming format to permit new dictionaries and deltas / additions to existing dictionaries
- ARROW-1728 - [C++] Run clang-format checks in Travis CI
- ARROW-1734 - C++/Python: Add cast function on Column-level
- ARROW-1736 - [GLib] Add GArrowCastOptions:allow-time-truncate
- ARROW-1737 - [GLib] Use G_DECLARE_DERIVABLE_TYPE
- ARROW-1746 - [Python] Add build dependencies for Arch Linux
- ARROW-1747 - [C++] Don’t export symbols of statically linked libraries
- ARROW-1748 - [GLib] Add GArrowRecordBatchBuilder
- ARROW-1750 - [C++] Remove the need for arrow/util/random.h
- ARROW-1752 - [Packaging] Add GPU packages for Debian and Ubuntu
- ARROW-1753 - [Python] Provide for matching subclasses with register_type in serialization context
- ARROW-1755 - [C++] Add build options for MSVC to use static runtime libraries
- ARROW-1758 - [Python] Remove pickle=True option for object serialization
- ARROW-1763 - [Python] DataType should be hashable
- ARROW-1765 - [Doc] Use dependencies from conda in C++ docker build
- ARROW-1767 - [C++] Support file reads and writes over 2GB on Windows
- ARROW-1772 - [C++] Add public-api-test module in style of parquet-cpp
- ARROW-1773 - [C++] Add casts from date/time types to compatible signed integers
- ARROW-1775 - Ability to abort created but unsealed Plasma objects
- ARROW-1777 - [C++] Add static ctor ArrayData::Make for nicer syntax in places
- ARROW-1779 - [Java] Integration test breaks without zeroing out validity vectors
- ARROW-1782 - [Python] Expose compressors as pyarrow.compress, pyarrow.decompress
- ARROW-1783 - [Python] Convert SerializedPyObject to/from sequence of component buffers with minimal memory allocation / copying
- ARROW-1784 - [Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format
- ARROW-1785 - [Format/C++/Java] Remove VectorLayout metadata from Flatbuffers metadata
- ARROW-1787 - [Python] Support reading parquet files into DataFrames in a backward compatible way
- ARROW-1794 - [C++/Python] Rename DecimalArray to Decimal128Array
- ARROW-1801 - [Docs] Update install instructions to use red-data-tools repos
- ARROW-1802 - [GLib] Add Arrow GPU support
- ARROW-1806 - [GLib] Add garrow_record_batch_writer_write_table()
- ARROW-1808 - [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns
- ARROW-1809 - [GLib] Use .xml instead of .sgml for GTK-Doc main file
- ARROW-1810 - [Plasma] Remove test shell scripts
- ARROW-1817 - Configure JsonFileReader to read NaN for floats
- ARROW-1818 - Examine Java Dependencies
- ARROW-1819 - [Java] Remove legacy vector classes
- ARROW-1826 - [JAVA] Avoid branching at cell level (copyFrom)
- ARROW-1827 - [Java] Add checkstyle config file and header file
- ARROW-1828 - [C++] Implement hash kernel specialization for BooleanType
- ARROW-1834 - [Doc] Build documentation in separate build folders
- ARROW-1838 - [C++] Use compute::Datum uniformly for input argument to kernels
- ARROW-1841 - [JS] Update text-encoding-utf-8 and tslib for node ESModules support
- ARROW-1844 - [C++] Basic benchmark suite for hash kernels
- ARROW-1849 - [GLib] Add input checks to GArrowRecordBatch
- ARROW-1850 - [C++] Use const void* in Writable::Write instead of const uint8_t*
- ARROW-1854 - [Python] Improve performance of serializing object dtype ndarrays
- ARROW-1855 - [GLib] Add workaround for build failure on macOS
- ARROW-1857 - [Python] Add switch for boost linkage with static parquet in wheels
- ARROW-1859 - [GLib] Add GArrowDictionaryDataType
- ARROW-1862 - [GLib] Add GArrowDictionaryArray
- ARROW-1864 - [Java] Upgrade Netty to 4.1.x
- ARROW-1867 - [Java] Add BitVector APIs from old vector class
- ARROW-1874 - [GLib] Add garrow_array_unique()
- ARROW-1878 - [GLib] Add garrow_array_dictionary_encode()
- ARROW-1884 - [C++] Make JsonReader/JsonWriter classes internal APIs
- ARROW-1885 - [Java] Restore previous MapVector class names
- ARROW-1901 - [Python] Support recursive mkdir for DaskFilesystem
- ARROW-1902 - [Python] Remove mkdir race condition from write_to_dataset
- ARROW-1905 - [Python] Add more functions for checking exact types in pyarrow.types
- ARROW-1911 - Add Graphistry to Arrow JS proof points
- ARROW-480 - [Python] Add accessors for Parquet column statistics
- ARROW-504 - [Python] Add adapter to write pandas.DataFrame in user-selected chunk size to streaming format
- ARROW-507 - [C++/Python] Construct List container from offsets and values subarrays
- ARROW-541 - [JS] Implement JavaScript-compatible implementation
- ARROW-571 - [Python] Add APIs to build Parquet files incrementally from Arrow tables
- ARROW-587 - Add JIRA fix version to merge tool
- ARROW-609 - [C++] Function for casting from days since UNIX epoch to int64 date
- ARROW-838 - [Python] Efficient construction of arrays from non-pandas 1D NumPy arrays
- ARROW-905 - [Docs] Add Dockerfile for reproducible documentation generation
- ARROW-942 - Support integration testing on Python 2.7
- ARROW-950 - [Site] Add Google Analytics tag
- ARROW-972 - [Python] Add test cases and basic APIs for UnionArray
Bug Fixes
- ARROW-1282 - Large memory reallocation by Arrow causes hang in jemalloc
- ARROW-1341 - [C++] Deprecate arrow::MakeTable in favor of new ctor from ARROW-1334
- ARROW-1347 - [JAVA] List null type should use consistent name for inner field
- ARROW-1398 - [Python] No support reading columns of type decimal(19,4)
- ARROW-1409 - [Format] Use for “page” attribute in Buffer in metadata
- ARROW-1540 - [C++] Fix valgrind warnings in cuda-test if possible
- ARROW-1541 - [C++] Race condition with arrow_gpu
- ARROW-1543 - [C++] row_wise_conversion example doesn’t correspond to ListBuilder constructor arguments
- ARROW-1549 - [JS] Integrate auto-generated Arrow test files
- ARROW-1555 - [Python] write_to_dataset on s3
- ARROW-1584 - [PYTHON] serialize_pandas on empty dataframe
- ARROW-1585 - serialize_pandas round trip fails on integer columns
- ARROW-1586 - [PYTHON] serialize_pandas roundtrip loses columns name
- ARROW-1609 - Plasma: Build fails with Xcode 9.0
- ARROW-1615 - CXX flags for development more permissive than Travis CI builds
- ARROW-1617 - [Python] Do not use symlinks in python/cmake_modules
- ARROW-1620 - Python: Download Boost in manylinux1 build from bintray
- ARROW-1624 - [C++] Follow up fixes / tweaks to compiler warnings for Plasma / LLVM 4.0, add to readme
- ARROW-1625 - [Serialization] Support OrderedDict properly
- ARROW-1629 - [C++] Fix problematic code paths identified by infer tool
- ARROW-1633 - [Python] numpy “unicode” arrays not understood
- ARROW-1640 - Resolve OpenSSL issues in Travis CI
- ARROW-1647 - [Plasma] Potential bug when reading/writing messages.
- ARROW-1653 - [Plasma] Use static cast to avoid compiler warning.
- ARROW-1656 - [C++] Endianness Macro is Incorrect on Windows And Mac
- ARROW-1657 - [C++] Multithreaded Read Test Failing on Arch Linux
- ARROW-1658 - [Python] Out of bounds dictionary indices causes segfault after converting to pandas
- ARROW-1663 - [Java] Follow up on ARROW-1347 and make schema backward compatible
- ARROW-1670 - [Python] Speed up deserialization code path
- ARROW-1672 - [Python] Failure to write Feather bytes column
- ARROW-1673 - [Python] NumPy boolean arrays get converted to uint8 arrays on NdarrayToTensor roundtrip
- ARROW-1676 - [C++] Correctly truncate oversized validity bitmaps when writing Feather format
- ARROW-1678 - [Python] Incorrect serialization of numpy.float16
- ARROW-1680 - [Python] Timestamp unit change not done in from_pandas() conversion
- ARROW-1686 - Documentation generation script creates “apidocs” directory under site/java
- ARROW-1693 - [JS] Error reading dictionary-encoded integration test files
- ARROW-1695 - [Serialization] Fix reference counting of numpy arrays created in custom serialializer
- ARROW-1698 - [JS] File reader attempts to load the same dictionary batch more than once
- ARROW-1704 - [GLib] Go example in test suite is broken
- ARROW-1708 - [JS] Linter problem breaks master build
- ARROW-1709 - [C++] Decimal.ToString is incorrect for negative scale
- ARROW-1711 - [Python] flake8 checks still not failing builds
- ARROW-1714 - [Python] No named pd.Series name serialized as u’None’
- ARROW-1720 - [Python] Segmentation fault while trying to access an out-of-bound chunk
- ARROW-1723 - Windows: __declspec(dllexport) specified when building arrow static library
- ARROW-1730 - [Python] Incorrect result from pyarrow.array when passing timestamp type
- ARROW-1732 - [Python] RecordBatch.from_pandas fails on DataFrame with no columns when preserve_index=False
- ARROW-1735 - [C++] Cast kernels cannot write into sliced output array
- ARROW-1738 - [Python] Wrong datetime conversion when pa.array with unit
- ARROW-1739 - [Python] Fix usages of assertRaises causing broken build
- ARROW-1742 - C++: clang-format is not detected correct on OSX anymore
- ARROW-1743 - [Python] Table to_pandas fails when index contains categorical column
- ARROW-1745 - Compilation failure on Mac OS in plasma tests
- ARROW-1749 - [C++] Handle range of Decimal128 values that require 39 digits to be displayed
- ARROW-1751 - [Python] Pandas 0.21.0 introduces a breaking API change for MultiIndex construction
- ARROW-1754 - [Python] Fix buggy Parquet roundtrip when an index name is the same as a column name
- ARROW-1756 - [Python] Observed int32 overflow in Feather write/read path
- ARROW-1762 - [C++] unittest failure for language environment
- ARROW-1764 - [Python] Add -c conda-forge for Windows dev installation instructions
- ARROW-1766 - [GLib] Fix failing builds on OSX
- ARROW-1768 - [Python] Fix suppressed exception in ParquetWriter.del
- ARROW-1770 - [GLib] Fix GLib compiler warning
- ARROW-1771 - [C++] ARROW-1749 Breaks Public API test in parquet-cpp
- ARROW-1776 - [C++[ arrow::gpu::CudaContext::bytes_allocated() isn’t defined
- ARROW-1778 - [Python] Link parquet-cpp statically, privately in manylinux1 wheels
- ARROW-1781 - [CI] OSX Builds on Travis-CI time out often
- ARROW-1788 - Plasma store crashes when trying to abort objects for disconnected client
- ARROW-1791 - Integration tests generate date[DAY] values outside of reasonable range
- ARROW-1793 - [Integration] fix a typo for README.md
- ARROW-1800 - [C++] Fix and simplify random_decimals
- ARROW-1805 - [Python] ignore non-parquet files when exploring dataset
- ARROW-1811 - [C++/Python] Rename all Decimal based APIs to Decimal128
- ARROW-1812 - Plasma store modifies hash table while iterating during client disconnect
- ARROW-1821 - Add integration test case to explicitly check for optional validity buffer
- ARROW-1829 - [Plasma] Clean up eviction policy bookkeeping
- ARROW-1830 - [Python] Error when loading all the files in a dictionary
- ARROW-1836 - [C++] Fix C4996 warning from arrow/util/variant.h on MSVC builds
- ARROW-1839 - [C++/Python] Add Decimal Parquet Read/Write Tests
- ARROW-1840 - [Website] The installation command failed on Windows10 anaconda environment.
- ARROW-1845 - [Python] Expose Decimal128Type
- ARROW-1852 - [Plasma] Make retrieving manager file descriptor const
- ARROW-1853 - [Plasma] Fix off-by-one error in retry processing
- ARROW-1863 - [Python] PyObjectStringify could render bytes-like output for more types of objects
- ARROW-1865 - [C++] Adding a column to an empty Table fails
- ARROW-1869 - Fix typo in LowCostIdentityHashMap
- ARROW-1871 - [Python/C++] Appending Python Decimals with different scales requires rescaling
- ARROW-1873 - [Python] Segmentation fault when loading total 2GB of parquet files
- ARROW-1877 - Incorrect comparison in JsonStringArrayList.equals
- ARROW-1879 - [Python] Dask integration tests are not skipped if dask is not installed
- ARROW-1881 - [Python] setuptools_scm picks up JS version tags
- ARROW-1882 - [C++] Reintroduce DictionaryBuilder
- ARROW-1883 - [Python] BUG: Table.to_pandas metadata checking fails if columns are not present
- ARROW-1889 - [Python] –exclude is not available in older git versions
- ARROW-1890 - [Python] Masking for date32 arrays not working
- ARROW-1891 - [Python] NaT date32 values are only converted to nulls if from_pandas is used
- ARROW-1892 - [Python] Unknown list item type: binary
- ARROW-1893 - [Python] test_primitive_serialization fails on Python 2.7.3
- ARROW-1895 - [Python] Add field_name to pandas index metadata
- ARROW-1897 - [Python] Incorrect numpy_type for pandas metadata of Categoricals
- ARROW-1904 - [C++] Deprecate PrimitiveArray::raw_values
- ARROW-1906 - [Python] Creating a pyarrow.Array with timestamp of different unit is not casted
- ARROW-1908 - [Python] Construction of arrow table from pandas DataFrame with duplicate column names crashes
- ARROW-1910 - CPP README Brewfile link incorrect
- ARROW-1914 - [C++] make -j may fail to build with -DARROW_GPU=on
- ARROW-1915 - [Python] Parquet tests should be optional
- ARROW-1916 - [Java] Do not exclude java/dev/checkstyle from source releases
- ARROW-1917 - [GLib] Must set GI_TYPELIB_PATH in verify-release-candidate.sh
- ARROW-226 - [C++] libhdfs: feedback to help determining cause of failure in opening file path
- ARROW-641 - [C++] Do not build/run io-hdfs-test if ARROW_HDFS=off