pyarrow.fs.HadoopFileSystem#
- class pyarrow.fs.HadoopFileSystem(unicode host, int port=8020, unicode user=None, *, int replication=3, int buffer_size=0, default_block_size=None, kerb_ticket=None, extra_conf=None)#
Bases:
FileSystem
HDFS backed FileSystem implementation
- Parameters:
- host
str
HDFS host to connect to. Set to “default” for fs.defaultFS from core-site.xml.
- port
int
, default 8020 HDFS port to connect to. Set to 0 for default or logical (HA) nodes.
- user
str
, defaultNone
Username when connecting to HDFS; None implies login user.
- replication
int
, default 3 Number of copies each block will have.
- buffer_size
int
, default 0 If 0, no buffering will happen otherwise the size of the temporary read and write buffer.
- default_block_size
int
, defaultNone
None means the default configuration for HDFS, a typical block size is 128 MB.
- kerb_ticket
str
or path, defaultNone
If not None, the path to the Kerberos ticket cache.
- extra_conf
dict
, defaultNone
Extra key/value pairs for configuration; will override any hdfs-site.xml properties.
- host
Examples
>>> from pyarrow import fs >>> hdfs = fs.HadoopFileSystem(host, port, user=user, kerb_ticket=ticket_cache_path)
For usage of the methods see examples for
LocalFileSystem()
.- __init__(*args, **kwargs)#
Methods
__init__
(*args, **kwargs)copy_file
(self, src, dest)Copy a file.
create_dir
(self, path, *, bool recursive=True)Create a directory and subdirectories.
delete_dir
(self, path)Delete a directory and its contents, recursively.
delete_dir_contents
(self, path, *, ...)Delete a directory's contents, recursively.
delete_file
(self, path)Delete a file.
equals
(self, FileSystem other)- Parameters:
from_uri
(uri)Instantiate HadoopFileSystem object from an URI string.
get_file_info
(self, paths_or_selector)Get info for the given files.
move
(self, src, dest)Move / rename a file or directory.
normalize_path
(self, path)Normalize filesystem path.
open_append_stream
(self, path[, ...])Open an output stream for appending.
open_input_file
(self, path)Open an input file for random access reading.
open_input_stream
(self, path[, compression, ...])Open an input stream for sequential reading.
open_output_stream
(self, path[, ...])Open an output stream for sequential writing.
Attributes
The filesystem's type name.
- copy_file(self, src, dest)#
Copy a file.
If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.
- Parameters:
Examples
>>> local.copy_file(path, ... local_path + '/pyarrow-fs-example_copy.dat')
Inspect the file info:
>>> local.get_file_info(local_path + '/pyarrow-fs-example_copy.dat') <FileInfo for '/.../pyarrow-fs-example_copy.dat': type=FileType.File, size=4> >>> local.get_file_info(path) <FileInfo for '/.../pyarrow-fs-example.dat': type=FileType.File, size=4>
- create_dir(self, path, *, bool recursive=True)#
Create a directory and subdirectories.
This function succeeds if the directory already exists.
- delete_dir(self, path)#
Delete a directory and its contents, recursively.
- Parameters:
- path
str
The path of the directory to be deleted.
- path
- delete_dir_contents(self, path, *, bool accept_root_dir=False, bool missing_dir_ok=False)#
Delete a directory’s contents, recursively.
Like delete_dir, but doesn’t delete the directory itself.
- equals(self, FileSystem other)#
- Parameters:
- Returns:
- static from_uri(uri)#
Instantiate HadoopFileSystem object from an URI string.
The following two calls are equivalent
HadoopFileSystem.from_uri('hdfs://localhost:8020/?user=test&replication=1')
HadoopFileSystem('localhost', port=8020, user='test', replication=1)
- Parameters:
- uri
str
A string URI describing the connection to HDFS. In order to change the user, replication, buffer_size or default_block_size pass the values as query parts.
- uri
- Returns:
- get_file_info(self, paths_or_selector)#
Get info for the given files.
Any symlink is automatically dereferenced, recursively. A non-existing or unreachable file returns a FileStat object and has a FileType of value NotFound. An exception indicates a truly exceptional condition (low-level I/O error, etc.).
- Parameters:
- paths_or_selector
FileSelector
, path-like orlist
of path-likes Either a selector object, a path-like object or a list of path-like objects. The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, use allow_not_found.
- paths_or_selector
- Returns:
Examples
>>> local <pyarrow._fs.LocalFileSystem object at ...> >>> local.get_file_info("/{}/pyarrow-fs-example.dat".format(local_path)) <FileInfo for '/.../pyarrow-fs-example.dat': type=FileType.File, size=4>
- move(self, src, dest)#
Move / rename a file or directory.
If the destination exists: - if it is a non-empty directory, an error is returned - otherwise, if it has the same type as the source, it is replaced - otherwise, behavior is unspecified (implementation-dependent).
- Parameters:
Examples
Create a new folder with a file:
>>> local.create_dir('/tmp/other_dir') >>> local.copy_file(path,'/tmp/move_example.dat')
Move the file:
>>> local.move('/tmp/move_example.dat', ... '/tmp/other_dir/move_example_2.dat')
Inspect the file info:
>>> local.get_file_info('/tmp/other_dir/move_example_2.dat') <FileInfo for '/tmp/other_dir/move_example_2.dat': type=FileType.File, size=4> >>> local.get_file_info('/tmp/move_example.dat') <FileInfo for '/tmp/move_example.dat': type=FileType.NotFound>
Delete the folder: >>> local.delete_dir(‘/tmp/other_dir’)
- normalize_path(self, path)#
Normalize filesystem path.
- open_append_stream(self, path, compression='detect', buffer_size=None, metadata=None)#
Open an output stream for appending.
If the target doesn’t exist, a new empty file is created.
Note
Some filesystem implementations do not support efficient appending to an existing file, in which case this method will raise NotImplementedError. Consider writing to multiple files (using e.g. the dataset layer) instead.
- Parameters:
- path
str
The source to open for writing.
- compression
str
optional, default ‘detect’ The compression algorithm to use for on-the-fly compression. If “detect” and source is a file path, then compression will be chosen based on the file extension. If None, no compression will be applied. Otherwise, a well-known algorithm name must be supplied (e.g. “gzip”).
- buffer_size
int
optional, defaultNone
If None or 0, no buffering will happen. Otherwise the size of the temporary write buffer.
- metadata
dict
optional, defaultNone
If not None, a mapping of string keys to string values. Some filesystems support storing metadata along the file (such as “Content-Type”). Unsupported metadata keys will be ignored.
- path
- Returns:
- stream
NativeFile
- stream
Examples
Append new data to a FileSystem subclass with nonempty file:
>>> with local.open_append_stream(path) as f: ... f.write(b'+newly added') 12
Print out the content to the file:
>>> with local.open_input_file(path) as f: ... print(f.readall()) b'data+newly added'
- open_input_file(self, path)#
Open an input file for random access reading.
- Parameters:
- path
str
The source to open for reading.
- path
- Returns:
- stream
NativeFile
- stream
Examples
Print the data from the file with open_input_file():
>>> with local.open_input_file(path) as f: ... print(f.readall()) b'data'
- open_input_stream(self, path, compression='detect', buffer_size=None)#
Open an input stream for sequential reading.
- Parameters:
- path
str
The source to open for reading.
- compression
str
optional, default ‘detect’ The compression algorithm to use for on-the-fly decompression. If “detect” and source is a file path, then compression will be chosen based on the file extension. If None, no compression will be applied. Otherwise, a well-known algorithm name must be supplied (e.g. “gzip”).
- buffer_size
int
optional, defaultNone
If None or 0, no buffering will happen. Otherwise the size of the temporary read buffer.
- path
- Returns:
- stream
NativeFile
- stream
Examples
Print the data from the file with open_input_stream():
>>> with local.open_input_stream(path) as f: ... print(f.readall()) b'data'
- open_output_stream(self, path, compression='detect', buffer_size=None, metadata=None)#
Open an output stream for sequential writing.
If the target already exists, existing data is truncated.
- Parameters:
- path
str
The source to open for writing.
- compression
str
optional, default ‘detect’ The compression algorithm to use for on-the-fly compression. If “detect” and source is a file path, then compression will be chosen based on the file extension. If None, no compression will be applied. Otherwise, a well-known algorithm name must be supplied (e.g. “gzip”).
- buffer_size
int
optional, defaultNone
If None or 0, no buffering will happen. Otherwise the size of the temporary write buffer.
- metadata
dict
optional, defaultNone
If not None, a mapping of string keys to string values. Some filesystems support storing metadata along the file (such as “Content-Type”). Unsupported metadata keys will be ignored.
- path
- Returns:
- stream
NativeFile
- stream
Examples
>>> local = fs.LocalFileSystem() >>> with local.open_output_stream(path) as stream: ... stream.write(b'data') 4
- type_name#
The filesystem’s type name.