parquetwrite

Write columnar data to Parquet file

Description

example

parquetwrite(filename,T) writes a table or timetable T to a Parquet 2.0 file with the filename specified in filename.

example

parquetwrite(filename,T,'VariableCompression',VariableCompression) specifies the compression schemes to use when writing variables to the output file. The Parquet file format enables you to specify the compression schemes on a per-variable(column) level allowing very efficient compression and encoding of data.

Examples

collapse all

Write tabular data into a Parquet file and compare the size of the same tabular data in .csv and .parquet file formats.

Read the tabular data from the file outages.csv into a table.

T = readtable('outages.csv');

Write the data to Parquet file format. By default, the parquetwrite function uses the Snappy compression scheme. To specify other compression schemes see 'VariableCompression' name-value pair.

parquetwrite('outagesDefault.parquet',T)

Get the file sizes and compute the ratio of the size of tabular data in the .csv format to size of the same data in .parquet format.

Get size of .csv file.

fcsv = dir(which('outages.csv'));
size_csv = fcsv.bytes
size_csv = 101040

Get size of .parquet file.

fparquet  = dir('outagesDefault.parquet');
size_parquet = fparquet.bytes
size_parquet = 44881

Compute the ratio.

sizeRatio = ( size_parquet/size_csv )*100 ;
disp(['Size Ratio = ', num2str(sizeRatio) '% of original size'])
Size Ratio = 44.419% of original size

Input Arguments

collapse all

Name of output Parquet file, specified as a character vector or string scalar.

Depending on the location you are writing to, filename can take on one of these forms.

Location

Form

Current folder

To write to the current folder, specify the name of the file in filename.

Example: 'myData.parquet'

Other folders

To write to a folder different from the current folder, specify the full or relative path name in filename.

Example: 'C:\myFolder\myData.parquet'

Example: 'dataDir\myData.parquet'

Remote Location

To write to a remote location, filename must contain the full path of the file specified as a uniform resource locator (URL) of the form:

scheme_name://path_to_file/myData.parquet

Based on your remote location, scheme_name can be one of the values in this table.

Remote Locationscheme_name
Amazon S3™s3
Windows Azure® Blob Storagewasb, wasbs
HDFS™hdfs

For more information, see Work with Remote Data.

Example: 's3://bucketname/path_to_file/myData.parquet'

Data Types: char | string

Input data, specified as a table or timetable.

Compression scheme names, specified as one of these values:

  • 'Snappy', 'Brotli', 'Gzip', or 'Uncompressed'. If you specify one compression algorithm then parquetwrite compresses all variables using the same algorithm.

  • Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.

In general, 'Snappy' has better performance for reading and writing, 'Gzip' has a higher compression ratio at the cost of more CPU processing time, and 'Brotli' typically produces the smallest file size at the cost of compression speed.

parquetwrite writes Parquet 2.0 files using the Parquet dictionary encoding scheme. This encoding scheme is most efficient when the number of unique values is not too large. If the size of the dictionary or number of unique values grows to be too big, then the encoding automatically falls back to plain encoding.

Example: parquetwrite('myData.parquet', T, 'VariableCompression', 'Brotli')

Example: parquetwrite('myData.parquet', T, 'VariableCompression', {'Brotli' 'Snappy' 'Gzip'})

Limitations

In some cases, parquetwrite creates files that do not represent the original array T exactly. If you use parquetread or datastore to read the files, then the result might not have the same format or contents as the original table. For more information, see Apache Parquet Data Type Mappings.

Introduced in R2019a