pyarrow.csv.ConvertOptions#
- class pyarrow.csv.ConvertOptions(check_utf8=None, *, column_types=None, null_values=None, true_values=None, false_values=None, decimal_point=None, strings_can_be_null=None, quoted_strings_can_be_null=None, include_columns=None, include_missing_columns=None, auto_dict_encode=None, auto_dict_max_cardinality=None, timestamp_parsers=None)#
Bases:
pyarrow.lib._Weakrefable
Options for converting CSV data.
- Parameters
- check_utf8bool, optional (default
True
) Whether to check UTF8 validity of string columns.
- column_types
pyarrow.Schema
ordict
, optional Explicitly map column names to column types. Passing this argument disables type inference on the defined columns.
- null_values
list
, optional A sequence of strings that denote nulls in the data (defaults are appropriate in most cases). Note that by default, string columns are not checked for null values. To enable null checking for those, specify
strings_can_be_null=True
.- true_values
list
, optional A sequence of strings that denote true booleans in the data (defaults are appropriate in most cases).
- false_values
list
, optional A sequence of strings that denote false booleans in the data (defaults are appropriate in most cases).
- decimal_point1-character
str
, optional (default ‘.’) The character used as decimal point in floating-point and decimal data.
- timestamp_parsers
list
, optional A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given). By default, a fast built-in ISO-8601 parser is used.
- strings_can_be_nullbool, optional (default
False
) Whether string / binary columns can have null values. If true, then strings in null_values are considered null for string columns. If false, then all strings are valid string values.
- quoted_strings_can_be_nullbool, optional (default
True
) Whether quoted values can be null. If true, then strings in “null_values” are also considered null when they appear quoted in the CSV file. Otherwise, quoted values are never considered null.
- auto_dict_encodebool, optional (default
False
) Whether to try to automatically dict-encode string / binary data. If true, then when type inference detects a string or binary column, it it dict-encoded up to auto_dict_max_cardinality distinct values (per chunk), after which it switches to regular encoding. This setting is ignored for non-inferred columns (those in column_types).
- auto_dict_max_cardinality
int
, optional The maximum dictionary cardinality for auto_dict_encode. This value is per chunk.
- include_columns
list
, optional The names of columns to include in the Table. If empty, the Table will include all columns from the CSV file. If not empty, only these columns will be included, in this order.
- include_missing_columnsbool, optional (default
False
) If false, columns in include_columns but not in the CSV file will error out. If true, columns in include_columns but not in the CSV file will produce a column of nulls (whose type is selected using column_types, or null by default). This option is ignored if include_columns is empty.
- check_utf8bool, optional (default
Examples
Defining an example data:
>>> import io >>> s = "animals,n_legs,entry,fast\nFlamingo,2,01/03/2022,Yes\nHorse,4,02/03/2022,Yes\nBrittle stars,5,03/03/2022,No\nCentipede,100,04/03/2022,No\n,6,05/03/2022," >>> print(s) animals,n_legs,entry,fast Flamingo,2,01/03/2022,Yes Horse,4,02/03/2022,Yes Brittle stars,5,03/03/2022,No Centipede,100,04/03/2022,No ,6,05/03/2022,
Change the type of a column:
>>> import pyarrow as pa >>> from pyarrow import csv >>> convert_options = csv.ConvertOptions(column_types={"n_legs": pa.float64()}) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: string n_legs: double entry: string fast: string ---- animals: [["Flamingo","Horse","Brittle stars","Centipede",""]] n_legs: [[2,4,5,100,6]] entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]] fast: [["Yes","Yes","No","No",""]]
Define a date parsing format to get a timestamp type column (in case dates are not in ISO format and not converted by default):
>>> convert_options = csv.ConvertOptions( ... timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"]) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: string n_legs: int64 entry: timestamp[s] fast: string ---- animals: [["Flamingo","Horse","Brittle stars","Centipede",""]] n_legs: [[2,4,5,100,6]] entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,2022-04-03 00:00:00,2022-05-03 00:00:00]] fast: [["Yes","Yes","No","No",""]]
Specify a subset of columns to be read:
>>> convert_options = csv.ConvertOptions( ... include_columns=["animals", "n_legs"]) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: string n_legs: int64 ---- animals: [["Flamingo","Horse","Brittle stars","Centipede",""]] n_legs: [[2,4,5,100,6]]
List additional column to be included as a null typed column:
>>> convert_options = csv.ConvertOptions( ... include_columns=["animals", "n_legs", "location"], ... include_missing_columns=True) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: string n_legs: int64 location: null ---- animals: [["Flamingo","Horse","Brittle stars","Centipede",""]] n_legs: [[2,4,5,100,6]] location: [5 nulls]
Define columns as dictionary type (by default only the string/binary columns are dictionary encoded):
>>> convert_options = csv.ConvertOptions( ... timestamp_parsers=["%m/%d/%Y", "%m-%d-%Y"], ... auto_dict_encode=True) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: dictionary<values=string, indices=int32, ordered=0> n_legs: int64 entry: timestamp[s] fast: dictionary<values=string, indices=int32, ordered=0> ---- animals: [ -- dictionary: ["Flamingo","Horse","Brittle stars","Centipede",""] -- indices: [0,1,2,3,4]] n_legs: [[2,4,5,100,6]] entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,2022-04-03 00:00:00,2022-05-03 00:00:00]] fast: [ -- dictionary: ["Yes","No",""] -- indices: [0,0,1,1,2]]
Set upper limit for the number of categories. If the categories is more than the limit, the conversion to dictionary will not happen:
>>> convert_options = csv.ConvertOptions( ... include_columns=["animals"], ... auto_dict_encode=True, ... auto_dict_max_cardinality=2) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: string ---- animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]
Set empty strings to missing values:
>>> convert_options = csv.ConvertOptions(include_columns=["animals", "n_legs"], ... strings_can_be_null=True) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table animals: string n_legs: int64 ---- animals: [["Flamingo","Horse","Brittle stars","Centipede",null]] n_legs: [[2,4,5,100,6]]
Define values to be True and False when converting a column into a bool type:
>>> convert_options = csv.ConvertOptions( ... include_columns=["fast"], ... false_values=["No"], ... true_values=["Yes"]) >>> csv.read_csv(io.BytesIO(s.encode()), convert_options=convert_options) pyarrow.Table fast: bool ---- fast: [[true,true,false,false,null]]
- __init__(*args, **kwargs)#
Methods
__init__
(*args, **kwargs)equals
(self, ConvertOptions other)validate
(self)Attributes
Whether to try to automatically dict-encode string / binary data.
The maximum dictionary cardinality for auto_dict_encode.
Whether to check UTF8 validity of string columns.
Explicitly map column names to column types.
The character used as decimal point in floating-point and decimal data.
A sequence of strings that denote false booleans in the data.
The names of columns to include in the Table.
If false, columns in include_columns but not in the CSV file will error out.
A sequence of strings that denote nulls in the data.
Whether quoted values can be null.
Whether string / binary columns can have null values.
A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given).
A sequence of strings that denote true booleans in the data.
- auto_dict_encode#
Whether to try to automatically dict-encode string / binary data.
- auto_dict_max_cardinality#
The maximum dictionary cardinality for auto_dict_encode.
This value is per chunk.
- check_utf8#
Whether to check UTF8 validity of string columns.
- column_types#
Explicitly map column names to column types.
- decimal_point#
The character used as decimal point in floating-point and decimal data.
- equals(self, ConvertOptions other)#
- false_values#
A sequence of strings that denote false booleans in the data.
- include_columns#
The names of columns to include in the Table.
If empty, the Table will include all columns from the CSV file. If not empty, only these columns will be included, in this order.
- include_missing_columns#
If false, columns in include_columns but not in the CSV file will error out. If true, columns in include_columns but not in the CSV file will produce a null column (whose type is selected using column_types, or null by default). This option is ignored if include_columns is empty.
- null_values#
A sequence of strings that denote nulls in the data.
- quoted_strings_can_be_null#
Whether quoted values can be null.
- strings_can_be_null#
Whether string / binary columns can have null values.
- timestamp_parsers#
A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given). By default, a fast built-in ISO-8601 parser is used.
- true_values#
A sequence of strings that denote true booleans in the data.
- validate(self)#