Level 5 .mat file with UTF-8 encoded character array fails to load on R2020b

16 Ansichten (letzte 30 Tage)
I'm having some trouble with .mat files that loaded in R2020a but no longer load in R2020b. This appears to be due to a UTF-8 encoded string; a small example file is attached. If possible, I would like to get UTF-8 encoded strings in .mat files to load correctly in R2020b.
These files come from software we have written in-house that outputs mat files for later analysis, in accordance with the .mat file specification given by mathworks. The example file contains the string 'test°test', i.e., 'test' + the degree symbol (U+00B0) + 'test', in a variable 'x'. All of this is being done on Windows 10 64-bit version 1909 (build 18363.1379).
In R2020a (version = '9.8.0.1451342 (R2020a) Update 5') load('test.mat') gives:
x = 'test°test□'
That last character is the 2-byte squence E9FF (inspected with double(x(end))). In R2020b (version = '9.9.0.1467703 (R2020b)') load('test.mat') gives:
Error using load
Cannot read file D:\temp\mexload_unicode\bin\test.mat.
Obviously R2020a is not loading the string correctly either - I'm not sure why there are random bytes on the end - but it does load, which has been good enough for us so far (we almost never have non-ASCII data).
The hex dump of the bytes encoding the variable 'x' in the mat file are:
10 00 00 00 0A 00 00 00 74 65 73 74 C2 B0 74 65 73 74 00 00 00 00 00 00
Which is broken down into (per pgs 1-5 and 1-6 of the mat file specification):
  • 10 00 00 00 = (16 decimal) = miUTF8
  • 0A 00 00 00 = (10 decimal) bytes
  • 74 65 73 74 C2 B0 74 65 73 74 ... = UTF-8 encoded 'test°test' (C2 B0 = degree in UTF-8) + padding to a 64-bit boundary as required by the mat file format
We've been using the software that produces these files for a long time (since ~R2012) and it's only with R2020b that we've seen failures to load. I've seen some references to UTF-8 in the R2020b release notes but nothing detailed enough to be useful or even specifically related to mat files. Usually Google has all the answers but in this case I can't find anyone with a related issue.
Apart from distilling the problem down to the example above, I've tried:
  • Enabling the "Beta: Use Unicode UTF-8 for worldwide language support" option in the "Region" settings of Windows 10 (and restarting), this made no difference
  • Inspecting .mat files made from within matlab - these seem to all be UTF-16 encoded, even when the above option was checked, and I can't find a way to force UTF-8 encoding
  • Tweaking the byte count for the field in case matlab doesn't count the "C2" of "C2B0", this only corrupted the string further
Using UTF-16 encoding loads completely correctly (no spurious bytes, in both R2020a and R2020b), however this takes up twice as much space - and some of our files are large enough / have enough strings for this to matter (when being processed in RAM, doesn't matter so much once compressed on the disk). So I would like to get the UTF-8 encoding working.
Is there anything wrong with the UTF-8 encoding above or the mat file it's in? Or is there any detailed information about the changes between R2020a and R2020b with regards to UTF-8 encoding and mat file loading?

Akzeptierte Antwort

Russel Burgess
Russel Burgess am 9 Mär. 2021
I found the issue - it appears that matlab counts UTF-8 continuation bytes in the data element size but not in the array dimension size (which makes sense even if not explicitly pointed out anywhere).
Going further back in the hex dump of test.mat, the break down is:
(dimensions array subelement)
05 00 00 00 (miINT32)
08 00 00 00 (8 bytes)
01 00 00 00 (1 row)
0A 00 00 00 (10 columns)
(array name subelement)
01 00 01 00 78 00 00 00 (miINT8, 1 byte, 'x')
(data element)
10 00 00 00 (miUTF8)
0A 00 00 00 (10 bytes)
74 65 73 74 C2 B0 74 65 73 74 00 00 00 00 00 00 ('test°test')
By changing the column count in the dimensions array subelement from 0A to 09 (the number of complete UTF-8 characters in 'test°test') the file loads correctly. Presumably old versions of matlab ignored this discrepancy and a check was added in R2020b.

Weitere Antworten (0)

Kategorien

Mehr zu Workspace Variables and MAT-Files finden Sie in Help Center und File Exchange

Produkte


Version

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by