Package org.apache.arrow.vector.util
Class Text
java.lang.Object
org.apache.arrow.vector.util.ReusableByteArray
org.apache.arrow.vector.util.Text
- All Implemented Interfaces:
ReusableBuffer<byte[]>
A simplified byte wrapper similar to Hadoop's Text class without all the dependencies. Lifted
from Hadoop 2.7.1
-
Nested Class Summary
-
Field Summary
Fields inherited from class org.apache.arrow.vector.util.ReusableByteArray
bytes, EMPTY_BYTES, length
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
append
(byte[] utf8, int start, int len) Append a range of bytes to the end of the given text.static int
bytesToCodePoint
(ByteBuffer bytes) Returns the next code point at the current position in the buffer.int
charAt
(int position) Returns the Unicode Scalar Value (32-bit integer value) for the character atposition
.void
clear()
Clear the string to empty.byte[]
Get a copy of the bytes that is exactly the length of the data.static String
decode
(byte[] utf8) Converts the provided byte array to a String using the UTF-8 encoding.static String
decode
(byte[] utf8, int start, int length) static String
decode
(byte[] utf8, int start, int length, boolean replace) Converts the provided byte array to a String using the UTF-8 encoding.static ByteBuffer
Converts the provided String to bytes using the UTF-8 encoding.static ByteBuffer
Converts the provided String to bytes using the UTF-8 encoding.boolean
int
int
Finds any occurrence ofwhat
in the backing buffer, starting as positionstart
.byte[]
getBytes()
Returns the raw bytes; however, only data up toReusableByteArray.getLength()
is valid.void
readWithKnownLength
(DataInput in, int len) Read a Text object whose length is already known.void
set
(byte[] utf8) Set to an utf8 byte array.void
set
(byte[] utf8, int start, int len) Set the Text to range of bytes.void
Set to contain the contents of a string.void
copy a text.toString()
static int
utf8Length
(String string) For the given string, returns the number of UTF-8 bytes required to encode the string.static void
validateUTF8
(byte[] utf8) Check if a byte array contains valid utf-8.static void
validateUTF8
(byte[] utf8, int start, int len) Check to see if a byte array is valid utf-8.static boolean
validateUTF8NoThrow
(byte[] utf8) Check if a byte array contains valid utf-8.Methods inherited from class org.apache.arrow.vector.util.ReusableByteArray
getBuffer, getLength, hashCode, set, set, setCapacity
-
Field Details
-
DEFAULT_MAX_LEN
public static final int DEFAULT_MAX_LEN- See Also:
-
-
Constructor Details
-
Text
public Text() -
Text
Construct from a string.- Parameters:
string
- initialize from that string
-
Text
Construct from another text.- Parameters:
utf8
- initialize from that Text
-
Text
public Text(byte[] utf8) Construct from a byte array.- Parameters:
utf8
- initialize from that byte array
-
-
Method Details
-
copyBytes
public byte[] copyBytes()Get a copy of the bytes that is exactly the length of the data. SeegetBytes()
for faster access to the underlying array.- Returns:
- a copy of the underlying array
-
getBytes
public byte[] getBytes()Returns the raw bytes; however, only data up toReusableByteArray.getLength()
is valid. Please usecopyBytes()
if you need the returned array to be precisely the length of the data.- Returns:
- the underlying array
-
charAt
public int charAt(int position) Returns the Unicode Scalar Value (32-bit integer value) for the character atposition
. Note that this method avoids using the converter or doing String instantiation.- Parameters:
position
- the index of the char we want to retrieve- Returns:
- the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte
-
find
-
find
Finds any occurrence ofwhat
in the backing buffer, starting as positionstart
. The starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing buffer is not converted to a string for this operation.- Parameters:
what
- the string to search forstart
- where to start from- Returns:
- byte position of the first occurrence of the search string in the UTF-8 buffer or -1 if not found
-
set
Set to contain the contents of a string.- Parameters:
string
- the string to initialize from
-
set
public void set(byte[] utf8) Set to an utf8 byte array.- Parameters:
utf8
- the byte array to initialize from
-
set
copy a text.- Parameters:
other
- the text to initialize from
-
set
public void set(byte[] utf8, int start, int len) Set the Text to range of bytes.- Parameters:
utf8
- the data to copy fromstart
- the first position of the new stringlen
- the number of bytes of the new string
-
append
public void append(byte[] utf8, int start, int len) Append a range of bytes to the end of the given text.- Parameters:
utf8
- the data to copy fromstart
- the first position to append from utf8len
- the number of bytes to append
-
clear
public void clear()Clear the string to empty.Note: For performance reasons, this call does not clear the underlying byte array that is retrievable via
getBytes()
. In order to free the byte-array memory, callset(byte[])
with an empty byte array (For example,new byte[0]
). -
toString
- Overrides:
toString
in classReusableByteArray
-
readWithKnownLength
Read a Text object whose length is already known. This allows creating Text from a stream which uses a different serialization format.- Parameters:
in
- the input to initialize fromlen
- how many bytes to read from in- Throws:
IOException
- if something bad happens
-
equals
- Overrides:
equals
in classReusableByteArray
-
decode
Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, replace by a default value.- Parameters:
utf8
- bytes to decode- Returns:
- the decoded string
- Throws:
CharacterCodingException
- if this is not valid UTF-8
-
decode
- Throws:
CharacterCodingException
-
decode
public static String decode(byte[] utf8, int start, int length, boolean replace) throws CharacterCodingException Converts the provided byte array to a String using the UTF-8 encoding. Ifreplace
is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.- Parameters:
utf8
- the bytes to decodestart
- where to start fromlength
- length of the bytes to decodereplace
- whether to replace malformed characters with U+FFFD- Returns:
- the decoded string
- Throws:
CharacterCodingException
- if the input could not be decoded
-
encode
Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, invalid chars are replaced by a default value.- Parameters:
string
- the string to encode- Returns:
- ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
- Throws:
CharacterCodingException
- if the string could not be encoded
-
encode
Converts the provided String to bytes using the UTF-8 encoding. Ifreplace
is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.- Parameters:
string
- the string to encodereplace
- whether to replace malformed characters with U+FFFD- Returns:
- ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
- Throws:
CharacterCodingException
- if the string could not be encoded
-
validateUTF8NoThrow
public static boolean validateUTF8NoThrow(byte[] utf8) Check if a byte array contains valid utf-8.- Parameters:
utf8
- byte array- Returns:
- true if the input is valid UTF-8. False otherwise.
-
validateUTF8
Check if a byte array contains valid utf-8.- Parameters:
utf8
- byte array- Throws:
MalformedInputException
- if the byte array contains invalid utf-8
-
validateUTF8
Check to see if a byte array is valid utf-8.- Parameters:
utf8
- the array of bytesstart
- the offset of the first byte in the arraylen
- the length of the byte sequence- Throws:
MalformedInputException
- if the byte array contains invalid bytes
-
bytesToCodePoint
Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any mark set on this buffer will be changed by this method!- Parameters:
bytes
- the incoming bytes- Returns:
- the corresponding unicode codepoint
-
utf8Length
For the given string, returns the number of UTF-8 bytes required to encode the string.- Parameters:
string
- text to encode- Returns:
- number of UTF-8 bytes required to encode
-