Arrow is a foundational project that will need to evolve over many years or even decades, while serving potentially millions of users. We believe that being meticulous when reviewing brings greater rewards to the project than being lenient and aiming for quick merges.
Code reviews like this lead to better quality code, more people who are engaged with and understand the code being changed, and a generally healthier project with more room to grow and accommodate emerging needs.
Use your own judgement. These guidelines are not hard rules. Committers are expected to have sufficient expertise on their work areas to be able to adjust their approach based on any concerns they have.
These guidelines are not listed in a particular order and are not intended to be used as a checklist.
Finally, these guidelines are not exhaustive.
Scope and completeness¶
Our general policy is to not introduce regressions or merge PRs that require follow-ons to function correctly (though exceptions to this can be made). Making necessary changes after a merge is more costly both for the contributor and the reviewer, but also for other developers who may be confused if they hit problems introduced by a merged PR.
What changes are in-scope for a PR and what changes might/could/should be pushed out of scope and have a follow-up JIRA created should be determined in collaboration between the authors and the reviewers.
When a large piece of functionality is being contributed and it seems desirable to integrate it piecewise, favour functional cohesion when deciding how to divide changes (for example, if a filesystem implementation is being contributed, a first PR may contribute directory metadata operations, a second PR file reading facilities and a third PR file writing facilities).
Public API design¶
Public APIs should nudge users towards the most desirable constructs. In other words, if there is a “best” way to do something, it should ideally also be the most easily discoverable and the most concise to type. For example, safe APIs should be featured more prominently than unsafe APIs that may crash or silently produce erroneous results on invalid input.
Public APIs should ideally tend to produce readable code. One example is when multiple options are expected to be added over time: it is better to try to organize options logically rather than juxtapose them all in a function’s signature (see for example the CSV reading APIs in C++ and Python).
Naming is important. Try to ask yourself if code calling the new API would be understandable without having to read the API docs. Vague naming should be avoided; inaccurate naming is even worse as it can mislead the reader and lead to buggy user code.
Be mindful of terminology. Every project has (explicitly or tacitly) set conventions about how to name important concepts; steering away from those conventions increases the cognitive workload both for contributors and users of the project. Conversely, reusing a well-known term for a different purpose than usual can also increase the cognitive workload and make developers’ lives more difficult.
If you are unsure whether an API is the right one for the task at hand, it is advisable to mark it experimental, such that users know that it may be changed over time, while contributors are less wary of bringing code-breaking improvements. However, experimental APIs should not be used as an excuse for eschewing basic API design principles.
Arrow is a set of open source libraries that will be used in a very wide array of contexts (including fiddling with deliberately artificial data at a Jupyter interpreter prompt). If you are writing a public API, make sure that it won’t crash or produce undefined behaviour on unusual (but valid) inputs.
When a non-trivial algorithm is implemented, defensive coding can be useful to catch potential problems (such as debug-only assertions, if the language allows them).
APIs ingesting potentially untrusted data, such as on-disk file formats, should try to avoid crashing or produce silent bugs when invalid or corrupt data is fed to them. This can require a lot of care that is out of the scope of regular code reviews (such as setting up fuzz testing), but basic checks can still be suggested at the code review stage.
When calling foreign APIs, especially system functions or APIs dealing with input / output, do check for errors and propagate them (if the language does not propagate errors automatically, such as C++).
Think about performance, but do not obsess over it. Algorithmic complexity is important if input size may be “large” (the meaning of large depends on the context: use your own expertise to decide!). Micro-optimizations improving performance by 20% or more on performance-sensitive functionality may be useful as well; lesser micro-optimizations may not be worth the time spent on them, especially if they lead to more complicated code.
If performance is important, measure it. Do not satisfy yourself with guesses and intuitions (which may be founded on incorrect assumptions about the compiler or the hardware).
Try to avoid trying to trick the compiler/interpreter/runtime by writing the code in a certain way, unless it’s really important. These tricks are generally brittle and dependent on platform details that may become obsolete, and they can make code harder to maintain (a common question that can block contributors is “what should I do about this weird hack that my changes would like to remove”?).
Avoiding rough edges or degenerate behaviour (such as memory blowups when a size estimate is inaccurately large) may be more important than trying to improve the common case by a small amount.
These guidelines should ideally apply to both prose documentation and in-code docstrings.
Look for ambiguous / poorly informative wording. For example, “it is an error if …” is less informative than either “An error is raised if … “ or “Behaviour is undefined if …” (the first phrasing doesn’t tell the reader what actually happens on such an error).
When reviewing documentation changes (or prose snippets, in general), be mindful about spelling, grammar, expression, and concision. Clear communication is essential for effective collaboration with people from a wide range of backgrounds, and contributes to better documentation.
Some contributors do not have English as a native language (and perhaps neither do you). It is advised to help them and/or ask for external help if needed.
Cross-linking increases the global value of documentation. Sphinx especially has great cross-linking capabilities (including topic references, glossary terms, API references), so be sure to make use of them!
When adding an API, all nominal cases should have test cases. Does a function allow null values? Then null values should be tested (alongside non-null values, of course). Does a function allow different input types? etc.
If some aspect of a functionality is delicate (either by definition or as an implementation detail), it should be tested.
Corner cases should be exercised, especially in low-level implementation languages such as C++. Examples: empty arrays, zero-chunk arrays, arrays with only nulls, etc.
Stress tests can be useful, for example to uncover synchronizations bugs if non-trivial parallelization is being added, or to validate a computational argument against a slow and straightforward reference implementation.
A mitigating concern, however, is the overall cost of running the test suite. Continuous Integration (CI) runtimes can be painfully long and we should be wary of increasing them too much. Sometimes it is worthwhile to fine-tune testing parameters to balance the usefulness of tests against the cost of running them (especially where stress tests are involved, since they tend to imply execution over large datasets).