java.lang.Object

org.apache.arrow.vector.util.ReusableByteArray

org.apache.arrow.vector.util.Text

All Implemented Interfaces:: ReusableBuffer<byte[]>

public class Text extends ReusableByteArray

A simplified byte wrapper similar to Hadoop's Text class without all the dependencies. Lifted from Hadoop 2.7.1

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

Text.TextSerializer

JSON serializer for Text.
Field Summary

Fields

Modifier and Type

Field

Description

static final int

DEFAULT_MAX_LEN

Fields inherited from class org.apache.arrow.vector.util.ReusableByteArray
bytes, EMPTY_BYTES, length
Constructor Summary

Constructors

Constructor

Description

Text()

Text(byte[] utf8)

Construct from a byte array.

Text(String string)

Construct from a string.

Text(Text utf8)

Construct from another text.
Method Summary

Modifier and Type

Method

Description

void

append(byte[] utf8, int start, int len)

Append a range of bytes to the end of the given text.

static int

bytesToCodePoint(ByteBuffer bytes)

Returns the next code point at the current position in the buffer.

int

charAt(int position)

Returns the Unicode Scalar Value (32-bit integer value) for the character at position.

void

clear()

Clear the string to empty.

byte[]

copyBytes()

Get a copy of the bytes that is exactly the length of the data.

static String

decode(byte[] utf8)

Converts the provided byte array to a String using the UTF-8 encoding.

static String

decode(byte[] utf8, int start, int length)

static String

decode(byte[] utf8, int start, int length, boolean replace)

Converts the provided byte array to a String using the UTF-8 encoding.

static ByteBuffer

encode(String string)

Converts the provided String to bytes using the UTF-8 encoding.

static ByteBuffer

encode(String string, boolean replace)

Converts the provided String to bytes using the UTF-8 encoding.

boolean

equals(Object o)

int

find(String what)

int

find(String what, int start)

Finds any occurrence of what in the backing buffer, starting as position start.

byte[]

getBytes()

Returns the raw bytes; however, only data up to ReusableByteArray.getLength() is valid.

void

readWithKnownLength(DataInput in, int len)

Read a Text object whose length is already known.

void

set(byte[] utf8)

Set to an utf8 byte array.

void

set(byte[] utf8, int start, int len)

Set the Text to range of bytes.

void

set(String string)

Set to contain the contents of a string.

void

set(Text other)

copy a text.

String

toString()

static int

utf8Length(String string)

For the given string, returns the number of UTF-8 bytes required to encode the string.

static void

validateUTF8(byte[] utf8)

Check if a byte array contains valid utf-8.

static void

validateUTF8(byte[] utf8, int start, int len)

Check to see if a byte array is valid utf-8.

static boolean

validateUTF8NoThrow(byte[] utf8)

Check if a byte array contains valid utf-8.

Methods inherited from class org.apache.arrow.vector.util.ReusableByteArray
getBuffer, getLength, hashCode, set, set, setCapacity

Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Field Details
- DEFAULT_MAX_LEN
  
  public static final int DEFAULT_MAX_LEN
  See Also:
  
  Constant Field Values
Constructor Details
- Text
  
  public Text()
- Text
  
  public Text(String string)
  
  Construct from a string.
  
  Parameters:
  
  string - initialize from that string
- Text
  
  public Text(Text utf8)
  
  Construct from another text.
  
  Parameters:
  
  utf8 - initialize from that Text
- Text
  
  public Text(byte[] utf8)
  
  Construct from a byte array.
  
  Parameters:
  
  utf8 - initialize from that byte array
Method Details
- copyBytes
  
  public byte[] copyBytes()
  
  Get a copy of the bytes that is exactly the length of the data. See getBytes() for faster access to the underlying array.
  
  Returns:
  
  a copy of the underlying array
- getBytes
  
  public byte[] getBytes()
  
  Returns the raw bytes; however, only data up to ReusableByteArray.getLength() is valid. Please use copyBytes() if you need the returned array to be precisely the length of the data.
  
  Returns:
  
  the underlying array
- charAt
  
  public int charAt(int position)
  
  Returns the Unicode Scalar Value (32-bit integer value) for the character at position. Note that this method avoids using the converter or doing String instantiation.
  
  Parameters:
  
  position - the index of the char we want to retrieve
  
  Returns:
  
  the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte
- find
  
  public int find(String what)
- find
  
  public int find(String what, int start)
  
  Finds any occurrence of what in the backing buffer, starting as position start. The starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing buffer is not converted to a string for this operation.
  
  Parameters:
  
  what - the string to search for
  
  start - where to start from
  
  Returns:
  
  byte position of the first occurrence of the search string in the UTF-8 buffer or -1 if not found
- set
  
  public void set(String string)
  
  Set to contain the contents of a string.
  
  Parameters:
  
  string - the string to initialize from
- set
  
  public void set(byte[] utf8)
  
  Set to an utf8 byte array.
  
  Parameters:
  
  utf8 - the byte array to initialize from
- set
  
  public void set(Text other)
  
  copy a text.
  
  Parameters:
  
  other - the text to initialize from
- set
  
  public void set(byte[] utf8, int start, int len)
  
  Set the Text to range of bytes.
  
  Parameters:
  
  utf8 - the data to copy from
  
  start - the first position of the new string
  
  len - the number of bytes of the new string
- append
  
  public void append(byte[] utf8, int start, int len)
  
  Append a range of bytes to the end of the given text.
  
  Parameters:
  
  utf8 - the data to copy from
  
  start - the first position to append from utf8
  
  len - the number of bytes to append
- clear
  
  public void clear()
  
  Clear the string to empty.
  Note: For performance reasons, this call does not clear the underlying byte array that is retrievable via getBytes(). In order to free the byte-array memory, call set(byte[]) with an empty byte array (For example, new byte[0]).
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class ReusableByteArray
- readWithKnownLength
  
  public void readWithKnownLength(DataInput in, int len) throws IOException
  
  Read a Text object whose length is already known. This allows creating Text from a stream which uses a different serialization format.
  
  Parameters:
  
  in - the input to initialize from
  
  len - how many bytes to read from in
  
  Throws:
  
  IOException - if something bad happens
- equals
  
  public boolean equals(Object o)
  
  Overrides:
  
  equals in class ReusableByteArray
- decode
  
  public static String decode(byte[] utf8) throws CharacterCodingException
  
  Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, replace by a default value.
  
  Parameters:
  
  utf8 - bytes to decode
  
  Returns:
  
  the decoded string
  
  Throws:
  
  CharacterCodingException - if this is not valid UTF-8
- decode
  
  public static String decode(byte[] utf8, int start, int length) throws CharacterCodingException
  
  Throws:
  
  CharacterCodingException
- decode
  
  public static String decode(byte[] utf8, int start, int length, boolean replace) throws CharacterCodingException
  
  Converts the provided byte array to a String using the UTF-8 encoding. If replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.
  
  Parameters:
  
  utf8 - the bytes to decode
  
  start - where to start from
  
  length - length of the bytes to decode
  
  replace - whether to replace malformed characters with U+FFFD
  
  Returns:
  
  the decoded string
  
  Throws:
  
  CharacterCodingException - if the input could not be decoded
- encode
  
  public static ByteBuffer encode(String string) throws CharacterCodingException
  
  Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, invalid chars are replaced by a default value.
  
  Parameters:
  
  string - the string to encode
  
  Returns:
  
  ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
  
  Throws:
  
  CharacterCodingException - if the string could not be encoded
- encode
  
  public static ByteBuffer encode(String string, boolean replace) throws CharacterCodingException
  
  Converts the provided String to bytes using the UTF-8 encoding. If replace is true, then malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a MalformedInputException.
  
  Parameters:
  
  string - the string to encode
  
  replace - whether to replace malformed characters with U+FFFD
  
  Returns:
  
  ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit()
  
  Throws:
  
  CharacterCodingException - if the string could not be encoded
- validateUTF8NoThrow
  
  public static boolean validateUTF8NoThrow(byte[] utf8)
  
  Check if a byte array contains valid utf-8.
  
  Parameters:
  
  utf8 - byte array
  
  Returns:
  
  true if the input is valid UTF-8. False otherwise.
- validateUTF8
  
  public static void validateUTF8(byte[] utf8) throws MalformedInputException
  
  Check if a byte array contains valid utf-8.
  
  Parameters:
  
  utf8 - byte array
  
  Throws:
  
  MalformedInputException - if the byte array contains invalid utf-8
- validateUTF8
  
  public static void validateUTF8(byte[] utf8, int start, int len) throws MalformedInputException
  
  Check to see if a byte array is valid utf-8.
  
  Parameters:
  
  utf8 - the array of bytes
  
  start - the offset of the first byte in the array
  
  len - the length of the byte sequence
  
  Throws:
  
  MalformedInputException - if the byte array contains invalid bytes
- bytesToCodePoint
  
  public static int bytesToCodePoint(ByteBuffer bytes)
  
  Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any mark set on this buffer will be changed by this method!
  
  Parameters:
  
  bytes - the incoming bytes
  
  Returns:
  
  the corresponding unicode codepoint
- utf8Length
  
  public static int utf8Length(String string)
  
  For the given string, returns the number of UTF-8 bytes required to encode the string.
  
  Parameters:
  
  string - text to encode
  
  Returns:
  
  number of UTF-8 bytes required to encode

Class Text

Nested Class Summary

Field Summary

Fields inherited from class org.apache.arrow.vector.util.ReusableByteArray

Constructor Summary

Method Summary

Methods inherited from class org.apache.arrow.vector.util.ReusableByteArray

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MAX_LEN

Constructor Details

Text

Text

Text

Text

Method Details

copyBytes

getBytes

charAt

find

find

set

set

set

set

append

clear

toString

readWithKnownLength

equals

decode

decode

decode

encode

encode

validateUTF8NoThrow

validateUTF8

validateUTF8

bytesToCodePoint

utf8Length