Apache Parquet Data Type Mappings

MATLAB® represents column-oriented data with tables and timetables. Each variable in a table or timetable can have a different data type and any number of columns. Column vectors are the most common shape of table and timetable variables.

The Apache™ Parquet file format is used for column-oriented heterogeneous data. Similar to MATLAB tables and timetables, each of the columns in a Parquet file can have different data types.

Despite their similarity, the permitted data types in MATLAB tables and timetables do not always map perfectly to the permitted data types in Parquet files. In some cases, it is necessary for MATLAB to perform data type conversions to retain information in the data (such as missing values). This conversion can sometimes result in a loss of precision in the data.

In general, MATLAB tables and timetables have these behaviors when they are converted to Parquet files:

  • Table properties set on the original table are not saved.

  • Table row names or timetable row times are converted into a new table variable before being written.

  • When reading a variable name from a Parquet file, invalid table variable names are converted to valid table variable names.

The following tables summarize the representable data types in MATLAB tables and timetables, as well as how those variables are represented in Parquet files. These data type mappings can go in both directions (MATLAB → Parquet and Parquet → MATLAB), unless otherwise noted. Parquet files use a small number of primitive (or physical) data types. The logical types extend the physical types by specifying how they should be interpreted. Parquet data types not covered here are not supported for reading from or writing to Parquet files (JSON, BSON, binary, and so on).

Numeric Data Types

MATLAB Table or Timetable Variable TypeApache Parquet Data TypeNotes

Physical Type

Logical Type

double

DOUBLE

NONE

MATLAB converts any missing floating-point numbers in a Parquet file into NaN values.

single

FLOAT

NONE

int8

INT32

INT_8

When reading a Parquet file, if an array with integral type contains missing values, then the array is converted into the MATLAB double data type instead of an integer data type. The missing values are set to NaN.

For 64-bit integers, this conversion can result in truncation of values that are larger in magnitude than flintmax.

uint8

UINT_8

int16

INT_16

uint16

UINT_16

int32

NONE

uint32

UINT_32

int64

INT64

NONE

uint64

UINT_64

logical

BOOLEAN

NONE

When reading a Parquet file, if an array with BOOLEAN type contains missing values, then the array is converted into the MATLAB double data type instead of the logical data type. The missing values are set to NaN.

Text Data Types

MATLAB Table or Timetable Variable TypeApache Parquet Data TypeNotes

Physical Type

Logical Type

categorical

BYTE_ARRAY

UTF8

Categorical arrays are converted into string arrays when written to Parquet files. Any <undefined> categorical values are converted to <missing> strings before being written.

string

string, char, and cellstr are all mapped to the same Parquet data type, and that data type is always read into MATLAB as a string array.

char

cellstr (cell array of character vectors)

Date and Time Data Types

MATLAB Table or Timetable Variable TypeApache Parquet Data TypeNotes

Physical Type

Logical Type

datetime

INT32

DATE

MATLAB datetime arrays written to a Parquet file use TIMESTAMP_MICROS format and have precision truncated to 1 microsecond. Display format settings are not saved.

INT64

TIMESTAMP_MILLIS

TIMESTAMP_MICROS

duration

INT32

TIME_MILLIS

MATLAB duration arrays written to a Parquet file use TIME_MICROS format and have precision truncated to 1 microsecond. Display format settings are not saved.

INT64

TIME_MICROS

See Also

| |