Bug reports and feature requests#

Arrow relies upon user feedback to identify defects and improvement opportunities. All users are encouraged to participate by creating bug reports and feature requests or commenting on existing issues. Even if you cannot contribute solutions to the issues yourself, your feedback helps us understand problems and prioritize work to improve the libraries.

GitHub issues#

The Arrow project uses GitHub issues to track issues - both bug reports and feature requests.

Creating issues#

Apache Arrow relies upon community contributions to address reported bugs and feature requests. As with most software projects, contributor time and resources are finite. The following guidelines aim to produce high-quality bug reports and feature requests, enabling community contributors to respond to more issues, faster:

Check existing issues#

Before you create a new issue, we recommend you first search for unresolved existing issues identifying the same problem or feature request.

Issue description#

A clear description of the problem or requested feature is the most important element of any issue. An effective description helps developers understand and efficiently engage on reported issues, and may include the following:

  • Clear, minimal steps to reproduce the issue, with as few non-Arrow dependencies as possible. If there’s a problem on reading a file, try to provide as small of an example file as possible, or code to create one. If your bug report says “it crashes trying to read my file, but I can’t share it with you,” it’s really hard for us to debug.

  • Any relevant operating system, language, and library version information

  • If it isn’t obvious, clearly state the expected behavior and what actually happened.

  • Avoid overloading a single issue with multiple problems or feature requests. Each issue should deal with a single bug or feature.

If a developer can’t get a failing unit test, they won’t be able to know that the issue has been identified, and they won’t know when it has been fixed. Try to anticipate the questions you might be asked by someone working to understand the issue and provide those supporting details up front.

Examples of good bug reports are found below:

The print method of a timestamp with timezone errors:

import pyarrow as pa

a = pa.array([0], pa.timestamp('s', tz='+02:00'))

print(a) # representation not correct?
# <pyarrow.lib.TimestampArray object at 0x7f834c7cb9a8>
# [
#  1970-01-01 00:00:00
# ]

print(a[0])
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "pyarrow/scalar.pxi", line 80, in pyarrow.lib.Scalar.__repr__
#  File "pyarrow/scalar.pxi", line 463, in pyarrow.lib.TimestampScalar.as_py
#  File "pyarrow/scalar.pxi", line 393, in pyarrow.lib._datetime_from_int
#ValueError: fromutc: dt.tzinfo is not self

Error when reading a CSV file with col_types option "T" or "t" when source data is in millisecond precision:

library(arrow, warn.conflicts = FALSE)
tf <- tempfile()
write.csv(data.frame(x = '2018-10-07 19:04:05.005'), tf, row.names = FALSE)

# successfully read in file
read_csv_arrow(tf, as_data_frame = TRUE)
#> # A tibble: 1 × 1
#>   x
#>   <dttm>
#> 1 2018-10-07 20:04:05

# the unit here is seconds - doesn't work
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "T",
  skip = 1
)
#> Error in `handle_csv_read_error()`:
#> ! Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value '2018-10-07 19:04:05.005'

# the unit here is ms - doesn't work
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "t",
  skip = 1
)
#> Error in `handle_csv_read_error()`:
#> ! Invalid: In CSV column #0: CSV conversion error to time32[ms]: invalid value '2018-10-07 19:04:05.005'

# the unit here is inferred as ns - does work!
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "?",
  skip = 1,
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x <timestamp[ns]>

Other resources for producing useful bug reports:

Identify Arrow component#

Arrow is an expansive project supporting many languages and organized into a number of components. Identifying the affected component(s) helps new issues get attention from appropriate contributors.

  • Component label, which can be added by a committer of the Apache Arrow project, is used to indicate the area of the project that your issue pertains to (for example “Component: Python” or “Component: C++”).

  • Prefix the issue title with the component name in brackets, for example [Python] issue summary ; this helps when navigating lists of open issues, and it also makes our changelogs more readable. Most prefixes are exactly the same as the Component name, with the following exceptions:

    • Component: Continuous Integration — Summary prefix: [CI]

    • Component: Developer Tools — Summary prefix: [Dev]

    • Component: Documentation — Summary prefix: [Docs]

Issue lifecycle#

Both bug reports and feature requests follow a defined lifecycle. If an issue is currently worked on, it should have a developer assigned. When an issue has reached a terminal status, it is closed with one of two outcomes:

  • Closed as completed - indicates the issue is complete; the PR that resolved the issue should have been automatically linked by GitHub (assuming the PR correctly mentioned the issue number).

    If you are merging a PR it is good practice to add a comment to the linked issue about which PR is resolving it. This way GitHub crates a notification for anybody that collaborated on the issue.

  • closed as not planned - indicates the issue is closed and should not receive any further updates, but without action being taken.

Issue assignment#

Assignment signals commitment to work on an issue, and contributors should self-assign issues when that work starts. Anyone can now self-assign issues by commenting take.