ChuckSize for save with '-v7.3' being completely off resulting in large files.

6 Ansichten (letzte 30 Tage)
Hello everyone,
Using MATLAB R2022a.
I was using my own HDF5 function and was trying to figure out why save with '-v7.3' without compression was giving me always larger output files. After a while I noticed that, most of the time, the "ChunkSize" for the MATFILE was not fitting well the size of the input data (size of the input divided by ChunkSize is not an integer most of the time). This results in, most of the time, matfiles that are 10% to 25% larger compared to my own HDF5 files.
Here is an simplified example compared to HDF5 (the input data size is a multiple of 128 so the chunk size is easy to compute for HDF5, just did that here so it's easy to understand for most people):
Creating some int16 matrix for example:
%% VARIABLES
targetChunk = [128 1 256]; % For HDF5
deflateValue = 0;
data = int16((2^16-1)*2*(rand(6144, 256, 800)-0.5)); % Large int16 array
Creating the HDF5 and MATFILE:
%% CREATING THE HDF5 FOR COMPARISON
tic
dataSize = size(data);
chunkSize = dataSize./(ceil(dataSize./targetChunk));
h5create('.\test.h5', ...
'/data', size(data), ...
'Datatype', 'int16', ...
'Deflate', deflateValue, ...
'ChunkSize', chunkSize);
h5write('test.h5', '/data', data);
toc;
%% MATFILE
tic;
save('test.mat', 'data', '-v7.3', '-nocompression');
toc;
On my computer, creating the HDF5 is about two times faster; but this does not matter much to me.
Comparating the size of the two output files:
%% SIZE OF THE OUTPUT FILES
fileHDF5 = dir('.\test.h5');
fileMAT = dir('.\test.mat');
fprintf('HDF5: Total size %.2f MB - %0.3f bytes/sample\n', ...
(fileHDF5.bytes./(1024*1024)), fileHDF5.bytes./(numel(data)));
fprintf('MATFILE: Total size %.2f MB - %0.3f bytes/sample\n', ...
(fileMAT.bytes./(1024*1024)), fileMAT.bytes./(numel(data)));
For that example I am getting:
HDF5: Total size 2403.15 MB - 2.003 bytes/sample
MATFILE: Total size 2739.15 MB - 2.283 bytes/sample
The HDF5 is correct, there is of course some headers/overhead (which are almost empty here, having only a single dataset in those files), but still the MATFILE is way too big (about 12% here).
If we check the metadata for both files (hdf5 and matfile):
%% METADATA
hHDF5 = h5info('.\test.h5');
hMATFILE = h5info('.\test.mat');
hHDF5.Datasets.ChunkSize
hMATFILE.Datasets.ChunkSize
hHDF5.Datasets.ChunkSize is "128 1 200" which is expected.
hMATFILE.Datasets.ChunkSize is "1 256 114" which is weird to me.
size(data, 3) being equal to 800 I think this is why I am getting extra bytes with the matfile. In my opinion, this should not happen.
I simplified the input datasize, sometimes getting a more accurate chunksize for the HDF5 is not easy, but almost every time 'save' has a chunksize which is off.
This results in my function using HDF5 being better 95% of the time and 5% of the time having the same output as 'save' (including some headers/overhead).
Any idea why is that happening?
Thank you.

Antworten (1)

Piyush Dubey
Piyush Dubey am 15 Sep. 2023
Hi Vincent,
I understand that you are trying to create a HDF5 file using your own functions. Creation of HDF5 files is being done correctly but the size of MATFILE seems to be significantly larger than expected.
Please know that HDF5 files write new blocks after each update which are being appended to MATFILEs as well. Unfortunately, there is no defined garbage collection process for this. Space can be optimized by deleting the old chunk every time an update has been called and reclaiming the space after each update. This process of repacking will help in optimization of space and avoid MATFILEs having an unexpectedly large size.
Please refer to the following MATLAB Answers thread where a similar issue has been addressed:
Please refer to the following MathWorks documentation links for more information on HDF5 files and “h5repack” command:
  1. https://www.mathworks.com/help/matlab/hdf5-files.html
  2. https://www.systutorials.com/docs/linux/man/1-h5repack/
Hope this helps.

Produkte


Version

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by