Environment Variables#
The following environment variables can be used to affect the behavior of Arrow C++ at runtime. Many of these variables are inspected only once per process (for example, when the Arrow C++ DLL is loaded), so you cannot assume that changing their value later will have an effect.
- ACERO_ALIGNMENT_HANDLING#
Arrow C++’s Acero module performs computation on streams of data. This computation may involve a form of “type punning” that is technically undefined behavior if the underlying array is not properly aligned. On most modern CPUs this is not an issue, but some older CPUs may crash or suffer poor performance. For this reason it is recommended that all incoming array buffers are properly aligned, but some data sources such as Flight may produce unaligned buffers.
The value of this environment variable controls what will happen when Acero detects an unaligned buffer:
warn
: a warning is emittedignore
: nothing, alignment checking is disabledreallocate
: the buffer is reallocated to a properly aligned addresserror
: the operation fails with an error
The default behavior is
warn
. On modern hardware it is usually safe to change this toignore
. Changing toreallocate
is the safest option but this will have a significant performance impact as the buffer will need to be copied.
- ARROW_DEBUG_MEMORY_POOL#
Enable rudimentary memory checks to guard against buffer overflows. The value of this environment variable selects the behavior when a buffer overflow is detected:
abort
exits the processus with a non-zero return value;trap
issues a platform-specific debugger breakpoint / trap instruction;warn
prints a warning on stderr and continues execution;none
disables memory checks;
If this variable is not set, or has an empty value, it has the same effect as the value
none
- memory checks are disabled.Note
While this functionality can be useful and has little overhead, it is not a replacement for more sophisticated memory checking utilities such as Valgrind or Address Sanitizer.
- ARROW_DEFAULT_MEMORY_POOL#
Override the backend to be used for the default memory pool. Possible values are among
jemalloc
,mimalloc
andsystem
, depending on which backends were enabled when building Arrow C++.
- ARROW_IO_THREADS#
Override the default number of threads for the global IO thread pool. The value of this environment variable should be a positive integer.
- ARROW_LIBHDFS_DIR#
The directory containing the C HDFS library (
hdfs.dll
on Windows,libhdfs.dylib
on macOS,libhdfs.so
on other platforms). Alternatively, one can setHADOOP_HOME
.
- ARROW_S3_LOG_LEVEL#
Controls the verbosity of logging produced by S3 calls. Defaults to
FATAL
which only produces output in the case of fatal errors.DEBUG
is recommended when you’re trying to troubleshoot issues.Possible values include:
FATAL
(the default)ERROR
WARN
INFO
DEBUG
TRACE
OFF
See also
- ARROW_TRACING_BACKEND#
The backend where to export OpenTelemetry-based execution traces. Possible values are:
ostream
: emit textual log messages to stdout;otlp_http
: emit OTLP JSON encoded traces to a HTTP server (by default, the endpoint URL is “http://localhost:4318/v1/traces”);arrow_otlp_stdout
: emit JSON traces to stdout;arrow_otlp_stderr
: emit JSON traces to stderr.
If this variable is not set, no traces are exported.
This environment variable has no effect if Arrow C++ was not built with tracing enabled.
- ARROW_USER_SIMD_LEVEL#
The maximum SIMD optimization level selectable at runtime. Useful for comparing the performance impact of enabling or disabling respective code paths or working around situations where instructions are supported but are not performant or cause other issues.
By default, Arrow C++ detects the capabilities of the current CPU at runtime and chooses the best execution paths based on that information. This behavior can be overriden by setting this environment variable to a well-defined value. Supported values are:
NONE
disables any runtime-selected SIMD optimization;SSE4_2
enables any SSE2-based optimizations until SSE4.2 (included);AVX
enables any AVX-based optimizations and earlier;AVX2
enables any AVX2-based optimizations and earlier;AVX512
enables any AVX512-based optimizations and earlier.
This environment variable only has an effect on x86 platforms. Other platforms currently do not implement any form of runtime dispatch.
Note
In addition to runtime-selected SIMD optimizations dispatch, Arrow C++ can also be compiled with SIMD optimizations that cannot be disabled at runtime. For example, by default, SSE4.2 optimizations are enabled on x86 builds: therefore, with this default setting, Arrow C++ does not work at all on a CPU without support for SSE4.2. This setting can be changed using the
ARROW_SIMD_LEVEL
CMake variable so as to either raise or lower the optimization level.Finally, the
ARROW_RUNTIME_SIMD_LEVEL
CMake variable sets a compile-time upper bound to runtime-selected SIMD optimizations. This is useful in cases where a compiler reports support for an instruction set but does not actually support it in full.
- AWS_ENDPOINT_URL#
Endpoint URL used for S3-like storage, for example Minio or s3.scality. Alternatively, one can set
AWS_ENDPOINT_URL_S3
.
- AWS_ENDPOINT_URL_S3#
Endpoint URL used for S3-like storage, for example Minio or s3.scality. This takes precedence over
AWS_ENDPOINT_URL
if both variables are set.
- GANDIVA_CACHE_SIZE#
The number of entries to keep in the Gandiva JIT compilation cache. The cache is in-memory and does not persist across processes.
The default cache size is 5000. The value of this environment variable should be a positive integer and should not exceed the maximum value of int32. Otherwise the default value is used.
- HADOOP_HOME#
The path to the Hadoop installation.
- JAVA_HOME#
Set the path to the Java Runtime Environment installation. This may be required for HDFS support if Java is installed in a non-standard location.
- OMP_NUM_THREADS#
The number of worker threads in the global (process-wide) CPU thread pool. If this environment variable is not defined, the available hardware concurrency is determined using a platform-specific routine.
- OMP_THREAD_LIMIT#
An upper bound for the number of worker threads in the global (process-wide) CPU thread pool.
For example, if the current machine has 4 hardware threads and
OMP_THREAD_LIMIT
is 8, the global CPU thread pool will have 4 worker threads. But ifOMP_THREAD_LIMIT
is 2, the global CPU thread pool will have 2 worker threads.