Unreasonably Large MAT File

Question

AdiKaba am 21 Aug. 2020

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/582881-unreasonably-large-mat-file

Kommentiert: Walter Roberson am 27 Aug. 2020

I am applying a customer developed data transform domain based compression algorithm to compress a data file. The algoritm performs as expected where an acceptable compression ratio is achieved, and the orginial data is reconstructed with small error. I attached plots of the original and reconstructed data. No issues with the algorithm performance. However, I am having issues when I save the reconstructed data onto a disk as a MAT file. The input data is about 7 MB on disk while the reconstructed data takes more than 30 MB space on the hard disk.

The attached plots show two different input data sets along with the corresponding reconstructed data sets. To save the reconstructed data to MAT file, I used MATLAB's "save" command.

save test1.mat reconstructedData;  
save test2.mat inputData; %I did this to just to verify that the mat file has the same size as the input mat file.

Why is the reconstructedData much larger on disk than the input data even though the plots tell a different story?

21 Kommentare
19 ältere Kommentare anzeigen19 ältere Kommentare ausblenden

Walter Roberson am 25 Aug. 2020

Bearbeitet: Walter Roberson am 26 Aug. 2020

32000000 double is being stored in a file of length 30785398. That requires compression on the part of MATLAB. It is not unnecessary overhead, it is compression.

You are deriving the signal from a mat with disk size 4186913 that decompresses to 32000000 and you are expecting a file size that is one of 4186913/5 or 4186913 or 32000000/5, I am not sure which. That assumes that your transformed signal compresses at least 5 to 1 using the algorithm that MATLAB uses for saving files. Which we have no reason to expect.

MATLAB uses a standard zlib which is what used for most internet compression as being a reasonable trade-off between flexibility and speed and memory. It is LZSS+Huffman based, which is a dictionary lookup compression scheme. When you do your idwt you are spreading information across your data in a way that does not happen to align nicely with dictionary compression. Whereas the input file happened to be better suited for dictionary compression.

You should not be looking at it as if there is "really" only 4 megabytes of data in the input file. There is really 32 megabytes of data, that happened to compress roughly 8:1 with zlib compression. The transformed 32 megabytes signal does not happen to compress nearly as well for the zlib compression that MATLAB uses.

AdiKaba am 25 Aug. 2020

Bearbeitet: AdiKaba am 25 Aug. 2020

I don't think I agree with some of your comments, in particular the one related to IDWT. IDWT synthesizes the decomposed input wavelet coefficients into the time domain giving a reconstructed signal that is similar to the input signal. This gives a more compact representation of the signal that can be represented with less number of bits compared to the original signal. That is why I am expecting the file size to decrease (not the data type) by a factor of the compression ratio. As the plots above show the reconstructed signal has smaller amplitude relative to the input signal and I expect MATLAB to store in a MAT file with size at least 4186913/5. If MATLAB applies further compression, that is fine too.

I will state my question again: the reconstructed signal is a compact representation of the input signal, that is

, where

and

correspond to the amplitudes of the reconstructed and input signals. In other words, the amplitude of the reconstructed signal is upper bounded by the amplitude of the input signal. Thus, I expect the reconstructed signal to occupy a storage space atmost the same as that of the input signal. However, MATLAB is allocating an execessively large overhead! Please note that I don't MATLAB to compress the variable while saving it. Compression is done by the algorithm as shown in the block diagram below where thresholding and quantization provide the required compression.

Walter Roberson am 26 Aug. 2020

" IDWT synthesizes the decomposed input wavelet coefficients into the time domain giving a reconstructed signal that is similar to the input signal."

Yes.

"This gives a more compact representation of the signal that can be represented with less number of bits compared to the original signal."

The DWT often has that property, but the IDWT does not.

"That is why I am expecting the file size to decrease (not the data type) by a factor of the compression ratio."

Is your file size 4186913 the original signal, or is it the DWT version of the signal, or is it the reconstructed version of the signal?

"That is why I am expecting the file size to decrease"

File size is determined by how much compression zlib can find for the data, which is a different matter than the "information content" (entropy) of the data.

Consider, for example, a 17 Hz sine wave with no phase delay, sampled at 5 megahertz: the "information content" is the fundamental frequency and the sample rate and the number of samples. If you were in a situation where the only permitted fundamentals were the integers 0 to 31, and the only permitted sampling rates were integer megaherz 0 to 7, and the only permitted lengths were "full cycles" 0 to 255, then the "information content" would be only 16 bits (5 bits for fundamental, 3 for sampling, 8 bits for number of cycles.)

The compression available through a dictionary technique such as zlib uses, would be at most two copies of each y value (one for rise, one for fall) per full cycle -- not very good. zlib does not even attempt mathematical calculations to predict values.

A discrete fourier transform (fft) of such a signal would, to within round-off, show a single non-zero at 17 Hz and (two sided transform) at -17 Hz, and if you used find() to locate that you could arrive at a fairly compact representation.

Wavelet transform of the same signal... it would depend which wavelet you choose. The tests I did just now found some that could do a 2:1 compression (cd was small enough to potentially be all zero) but I did not encounter any that could do better.

You are confusing different representations of the data in your signal with the information content of the data.

And you are also confused in thinking that a 5:1 amplitude reduction makes a difference in the information content. There is as much information in the line segment between 1 and 2 as there is between 1/5 and 2/5 (infinite information if you are talking about real numbers). IEEE 754 floating point repreresentation does not use fewer bits for a value that is 1/5th of the original.

Walter Roberson am 26 Aug. 2020

"I didn't accept your explanation because your reasoning regarding the wavelet transform is not mathematically correct."

What I wrote about idwt is,

"When you do your idwt you are spreading information across your data in a way that does not happen to align nicely with dictionary compression."

I am distinguishing between information and data.

Consider for example the wavelet that is square waves. If your data happens to be square waves with duty cycle 1/2, then the wavelet can compact the information into a small number of coefficients -- just enough to encode the width and length in a structured way. And similar to the discrete fourier transform I described above, a lot of the coefficients might be zero, which would compress well with the dictionary-based compression scheme used by zlib (and so used by MATLAB) to store .mat files.

Then when you idwt(), the information ("square wave, amplitude, duty cycle, frequency, cycle count) gets spread out over the data that is the reconstructed square wave. And that data might not happen to compress nearly as well with the dictionary compression scheme as the wavelet transform was able to do with it.

"Please note that I don't MATLAB to compress the variable while saving it."

https://www.mathworks.com/help/matlab/ref/save.html#bvmu1wo

Notice how compression of .mat files is on automatically for -v7 and -v7.3 files, unless you specifically ask for -nocompression. The 4186913 byte file size you are seeing is after MATLAB's zlib compression has been used.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

AdiKaba am 26 Aug. 2020

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/582881-unreasonably-large-mat-file#answer_485501

Bearbeitet: AdiKaba am 26 Aug. 2020

I understand you have an MVP status to protect as there are some mathematical inaccuracies in your responses. You are referring to my comments as "confused" which I think it a bad choice of word. Again, I disagree with your comments, just because you wrote a long reponse doesn't mean it is correct. No confusion here. Good luck.

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Walter Roberson am 27 Aug. 2020

What result do you get when you save your inputSignal with -nocompression ?

I firmly recommend the book Text Compression, by Bell, Cleary, and Witten (Prentice Hall, 1990), https://books.google.ca/books/about/Text_Compression.html for making clearer the difference between information content and representation.

Reducing the amplitude of a signal does not reduce its entropy (disorder, difficulty of predicting). Filtering can reduce entropy (but does not necessarily do so.)

You have 32000000 bytes of data that under LZSS+Huffman Encoding (zlib) compresses to 4186913 bytes. You process the decompressed signal, and you expect the stored file to be at most 4186913 bytes and expect a 5:1 compression, so you are hoping for on the order of 837383 byte file output. But there is no certainty that the processing you do will happen to end up with something that compresses nicely with the LZSS+Huffman Encoding compression scheme.

Let me give another example drawn from fourier transform (which, I showed above, in some cases could be a way to get a significant compression, for some signals.) Consider a 50% duty cycle square wave. That is potentially just bi-level, a number of zeros followed by the same number of ones, and the pattern repeated many times. The (non-discrete) fourier transform of a square wave is an infinite series. Suppose we take the dft, and now we process it, filtering out the 4/5 of coefficients that are least absolute value. That would be compression under that model. Now ifft() . The result is not going to be a square wave: it is going to be a waveform with a lot of ringing on it, that does not lend itself nearly as well to LZSS+huffman dictionary compression. The inaccuracies caused by the approximation get smeared out over all of the data when you ifft() to reconstruct.

Likewise, wavelets are based upon repeated shapes at different amplitude and frequencies. Wavelets do not describe individual samples in the signal: they calculate amplitudes at different frequencies that when reconstructed try to approximate the signal well, and any change in coefficients (by zeroing them for compression) gets propagated as a subtle change across the entire signal. But "well" for reconstruction is not measured by exact reconstruction: it is measured by error in reconstruction. And although the SSE on the reconstruction may be small, that the wavelet might be an excellent representation of the "interesting" information from the signal, that does not mean that the reconstructed signal is going to happen to be a good match for the LZSS+Huffman compression scheme that MATLAB automatically applies when you save files unless you say to use -nocompression .

The processing you do might well have reduced the information in the signal in a way that is useful for your purpose. But that does not mean that the automatically compression that MATLAB uses (unless told not to) is a good match for the processed result. What it does mean is that you have the potential to write your own compression routine that does a good job on the signal.

For example you might want to experiment with using fwrite() of the processed data (producing a 32000000 byte file), and then using gzip -9 on the binary file.

MATLAB is not adding overhead to the saved file: you just happen to be using an output signal that does not compress especially well with its built-in compression. And you can demonstrate whether MATLAB's compression is faulty by writing out the binary 32000000 bytes and putting it through some compression tools.

Melden Sie sich an, um zu kommentieren.