Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving

Hi,
I'm recently dealing with a bunch of csv files which are stored in different folders named by date sequence. The aim is to splice the data with the same csv filename (among all folders) together and save for further use.
First, I go over all folders and use unique function to derive a list of .csv file names. Next, I splice the data with identical filename according to date sequence.
data=[];
for i = 1:num_folder
data = [data; readtable(tmp_file)]; % Splicing
end
save(strcat(pwd,'\',filename,'.mat'),'data') % Saving
The saving is -v7.3 by default setting. (the file size range from hundreds of Mbs to several Gbs)
The whole process is running quite slow. I checked the profile viewer and it turns out save function consumed 70% of time and 'splicing' the other 30%. I would love to know how to elevate the time&space efficiency of my code.
[P.S.: The original .csv file have 60 columns. The Majority of them are numbers (double). One is filled with 'true' or 'false'; one filled with 'HH:MM:SS.TTT' and the other filled with 'HH:MM:SS.TTTTTT'. (Since I read them with readtable function, they are save in the table format in a cell manner.) I tried to 'simplify' the table by converting all elemets into double (I thought it could save some space?) but failed in correctly attaining all microseconds (TTTTTT). The textscan can only gives me HH, MM and SS.TTTT.]

4 Kommentare

dpb
dpb am 31 Jan. 2023
Bearbeitet: dpb am 31 Jan. 2023
Need to see a short section of one of the text files to know about the specifics of the time format. datetime can't store times alone so would need to convert to a duration class...they can store up to 9 digits for the fractional seconds.
For SAVE, heterogeneous data types as the table create additional overhead and the default compression also takes time.
IF the numeric data aren't needed with a full 15(+/-) digits of precision, you can save half the storage space for them by using single precision.
You can also save a sizable amount of storage by converting the T/F variables to categorical; just checking a table with 40 T/F stored as cellstr() occupied 5498 bytes in memory; the same table content as a categorical variable was only 1246 bytes; there's a littlel overhead associated with a categorical array, but once the size gets to be more than a handful, the storage savings grow very rapidly.
The most concise (and fastest) storage although less convenient would be to write the data as native data types to a stream (aka 'binary') file; compression then would probably not gain too much unless there are long sections of identical values in the data.
The more efficient way to approach this would probably be to concatenate the text files first alone, and then read the resultant into the table. If they're so large won't fit, then datastore is probably a better route to go than to create the single large file anyway.
Thanks! I tried to saved the spliced data into .csv format instead of .mat and saved some ~85% of space!
I am now doing the whole data-processing in a two-step procedure. First is what I'm doing now, read the original data and splice them by their specific filename (as 'Original Database re-Storage'? my aim was to minimize multiple times of loading when conducting the following data analysis). After 'Archiving' the Original Data, the next step is to standardlize/customize the needed columns of data in preparation for data analysis. It troubles me of how to smoothly convert 'cell' type 'HH:MM:SS.TTT' table elements into datetime/duration format (I am aware of the pure time is comes with 'today' when performing datestr. Is there any other way to derive and keep the simple decimal parts of the pure time alone?)
The time structure example is 'HH:MM:SS.TTT' (e.g. '19:14:23.000') and 'HH:MM:SS.TTTTTT' (e.g. '08:46:51.429869').
My solution now is using:
sscanf(data.Time,'%{HH:mm.ss.SSS}D')
but have no idea how to deal with 'HH:MM:SS.TTTTTT' format without losing precision. (pls show me an example if possible)
One example of my spliced data look like this (attachment). I'm considering turning the whole table into double matrix (true/false -> 1/0, time -> decimals) since it seems easier to use than table format (cell style). Any suggestions?

Melden Sie sich an, um zu kommentieren.

 Akzeptierte Antwort

Steven Lord
Steven Lord am 1 Feb. 2023
Assuming your data files don't have any header rows, I'd consider avoiding the read/save process and just use operating system commands to staple together the files. If you then want to process the data in MATLAB you'd only need to read it once (or you could assemble a datastore and create a tall array to work with the data.)
For Windows take a look at the last couple examples on the Microsoft documentation page for the copy command. You could assemble the command in MATLAB from the file names then call system to execute it.

2 Kommentare

Great! I'll try compare matrix splicing, fwrite function and OS command on the same set of files and see who's the fastest one.
Thanks! I used the copy command via MATLAB system. It is really way faster! It is the perfect way for the stapling.
Extra thanks for the datestore and tall array format. It provide me solution and ideas to some other problem I've encountered before on extra-large data analysis.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

dpb
dpb am 1 Feb. 2023
Bearbeitet: dpb am 1 Feb. 2023
Per usual when things are not totally plain vanilla in some fashion, you need to make use of the extended features supplied in MATLAB. We didn't need but a half-dozen lines to see the file format; I made a shorter version of the original by the following code (which post as may give you some ideas regarding the "splicing" operations you spoke of)
fid=fopen('OI401.csv','r');
fod=fopen('OI401_1.csv','w');
for i=1:100, fwrite(fod,fgets(fid)); end
fclose('all')
to leave me with a big enough file to do something with but still not be huge...
Anyways, that done,
opt=detectImportOptions('OI401.csv'); % create basic import object
opt=setvartype(opt,{'tradingday','logDay'},"datetime"); % set datetime data type
opt=setvaropts(opt,'tradingday','logDay'},"InputFormat",'yyyyDDDD'); % and the input format
opt=setvartype(opt,'isTrading','logical');
opt.SelectedVariableNames=opt.VariableNames([1:6 end-4:end]); % just pick a small subset for demo
tOI=readtable('OI401_1.csv',opt); % and read in with this help
The above returned
>> head(tOI)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ __________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 20230117 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
And, I didn't think to do it on the import option object, so to prove what we got,
>> ans.tradingday.Format='default'
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ ___________ _________ ____ ____ ____ _______________ _______________ ________ ______ _________
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:21.430146 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:31.430749 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:41.433568 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:46:51.431701 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:01.431491 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:11.436309 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:21.435230 20230117 0 true
22:59:58 27-Apr-2023 6 9802 9816 9708 08:45:11.567186 08:47:31.438434 20230117 0 true
>>
shows the dates were imported as datetime.
Now, as far as what to use for analyses, that'll be your call and will depend heavily upon what you intend to do with the data once you have it. There are all kinds of builtin analysis tools for tables that let you select variables by variable name that are quite convenient; whether they'll be of any direct use for your purposes will all depend upon what those purposes are...

14 Kommentare

Back to the previous comments on memory -- given the data are financial, there would appear to be no need for double precision so using single precision for the numeric data would seem worthwhile (as long as any toolbox you're planning on using is extended to handle it; singles were red-haired stepchildren for a long time; not sure all toolboxes even yet are fully supportive).
I will be comparing the fwrite efficiency against the matrix splicing once this batch of 200 csv is finished.
Your example of using the opt in readtable function is illuminating! especially on the logical values. However, my problem on datetime is actually not tradingday, the aim is on dealing with updatetime ('HH:MM:SS.TTT') and recvtime & logTime ('HH:MM:SS.TTTTTT) which I would like to convert them into a decimal format without date (e.g. '08:45:12.123456 = 8/24+45/1440+12/86400+123456/86400000000, which will be always<1) or any other format easy for comparison (I naturally assume using some time-comparison toolbox will be slower than double comparasion)
By the way, it is weird when I adopt 'yyyymmdd' format:
opt=setvaropts(opt,{'tradingday','logDay'},'InputFormat','yyyymmdd'); % and the input format
which gives out:
>> ans.tradingday.Format = 'default'
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
______________ ___________________ _________ ____ ____ ____ _________________ _________________ ________ ______ _________
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:46:21.430147' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:46:31.430749' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:46:41.433569' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:46:51.431702' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:47:01.431492' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:47:11.436310' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:47:21.435230' 20230117 0 true
'22:59:58.000' 2023-01-17 00:01:00 6 9802 9816 9708 '08:45:11.567186' '08:47:31.438435' 20230117 0 true
I'm not sure where the '00:01:00' comes from?
I've tried this and it seems working,
opt=setvartype(opt,{'updatetime'},"datetime");
opt=setvaropts(opt,{'updatetime'},'InputFormat','HH:mm:ss.SSS'); % and the input format
opt=setvartype(opt,{'logTime'},"datetime");
opt=setvaropts(opt,{'logTime'},'InputFormat','HH:mm:ss.SSSSSS'); % and the input format
It successfully reads. But the true value is actually the wanted precise time + today.
Another thing is ideally I may want to attain a accrate date+time number: For example the idea of logDay + logTime would be nice. But first logTime-today is not what I wanted.
datestr(tOI.logTime(1)-today)
ans = '27-Nov-7976 08:46:21'
And this is also not applicable:
datestr(tOI.logTime(1)-today+tOI.logDay(1))
"Addition is not defined between datetime arrays." % BUG report
It seems I'll have to use:
tOI.logTime(1)-datetime(0,1,0,0,0,0)+tOI.logDay(1)
ans =
datetime
20234717
>> ans.Format = 'yyyyMMdd HH:mm:ss.SSS'
ans =
datetime
20230117 08:47:21.430
which seems quite weird? especially the 20234717 value, and the time 0 point is datetime(0,1,0,0,0,0)?
Look at the posted section again; the two time-of-day fields are already converted to durations and include the precision of the inputs to the five decimal places. The datenum data type as the double has insufficient precision past milliseconds to store that resolution.
If the date format is actually year/month/day, then you need to read the doc for datetime input format -- the abbreviations for it are NOT the same as those for datenum; the lowercase "m" is a minute field, NOT month--months are capital "M". That's where the minutes is coming from; you should have gotten a warning about an apparent time field in the date position portion when you set that input format.
Is the input format for the date actually in yyyyMMdd and not the day of year? If so, then use
opt=setvaropts(opt,'tradingday',"InputFormat",'yyyyMMdd');
instead and you can compute the actual date/time of the observation as
tOI.Update=tOI.tradingday+tOI.updatetime;
Read the doc's carefully...
I'm not sure how you attained the duration with the poseted section? I just copied n pasted the code above again and it is still datetime format on my side (causing the Addition error: addidtion operator doesn't apply on datetime).
I looked into setvartype in ImportOptions.m, it says it doesn't support duration TYPE (R2017a). Thus, I have to derive duration via subtraction between datetime.
The input format of days is actually [year,month,day] so it is 'yyyyMMdd'. The 'yyyymmdd' was my typo but I've already fixed it when replying last time. And sInce,
datetime(0,0,0,0,0,0)
ans =
datetime
-0001-11-30 00:00:00
>> ans.Format = 'yyyyMMdd HH:mm:ss.SSSSSS'
ans =
datetime
00021130 00:00:00.000000
I've been through help datetime and datetime properties and didn't see any explanation to the phenomenon above.
>> opt
opt =
DelimitedTextImportOptions with properties:
Format Properties:
Delimiter: {','}
...
Variable Import Properties: Set types by name using setvartype
VariableNames: {'updatetime', 'tradingday', 'mktstatus' ... and 62 more}
VariableTypes: {'duration', 'datetime', 'double' ... and 62 more}
SelectedVariableNames: {'updatetime', 'tradingday', 'mktstatus' ... and 62 more}
...
>>
Dumping the import object shows that R2020b recognized them as durations automagically...I'd recommend upgrading if R2017x doesn't; but you can work around it by importing as string instead and then manually converting to duration...there are quite a number of other niceties that have been introduced since then that could be helpful...
>> opt=setvartype(opt,'updatetime','string');
>> tOI=readtable('OI401.csv',opt);
>> head(tOI)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
______________ __________ _________ _______ _______ _______ _______________ _______________ ________ ______ _________
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:21.430146 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:31.430749 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:41.433568 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:51.431701 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:01.431491 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:11.436309 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:21.435230 20230117 0.00 true
"22:59:58.000" 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:31.438434 20230117 0.00 true
>> ans.updatetime=duration(ans.updatetime)
ans =
8×11 table
updatetime tradingday mktstatus last high low recvtime logTime logDay status isTrading
__________ __________ _________ _______ _______ _______ _______________ _______________ ________ ______ _________
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:21.430146 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:31.430749 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:41.433568 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:46:51.431701 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:01.431491 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:11.436309 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:21.435230 20230117 0.00 true
22:59:58 20230117 6.00 9802.00 9816.00 9708.00 08:45:11.567186 08:47:31.438434 20230117 0.00 true
>> ans.tradingday+ans.updatetime
ans =
8×1 datetime array
20230117
20230117
20230117
20230117
20230117
20230117
20230117
20230117
>> ans.Format='default';
>> ans.Format=[ans.Format '.SSS']
ans =
8×1 datetime array
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
>>
Gets you the absolute time vector in all its glory.
There's no point in trying to take datetime back to the big bang; there's no practical use for datetime(0,0,0) and certainly no need/use for it here.
And I'd not worry about time comparisions with datetime unless (and until) profiling has proven it to be a performance bottleneck. I'll venture it won't be and can do direct comparisons of equality, or greater/less than with builtin operator support and there's the specific isbetween as well that can be helpful.
Use the MATLAB facilities as intended/supplied first; THEN if the application is too slow you can look into trying to optimize it. But, premature optimizations before even know there is a problem are likely to be misguided and provide little actual improvement at best and make coding/debugging/performance worse than the straightforward implementation at worst.
Actually, wasn't thinking -- the way if R2017x readtable still doesn't know about importing duration class directly, is to go ahead and let it use the datetime, then timeofday to get the time portion...
Picking up from where we left off with the import object above but fixing up the update and log time variables, we do the following --
opt=setvartype(opt,{'updatetime','logTime'},'datetime');
opt=setvaropts(opt,{'updatetime','logTime'},"InputFormat",'HH:mm:ss.SSSSSS');
tOI=readtable('OI401.csv',opt);
tOI.updatetime=timeofday(tOI.updatetime);
tOI.logTime=timeofday(tOI.logTime);
I'm not sure there is a way to do the conversion in place on the two datetime variables together with higher-level addressing; all the other forms I could think of didn't like changing the class of the datetime to a duration.
Either way, there are at least two solutions although the easiest if possible is to upgrade to a later release that understands the duration type on import/reading of the table to begin with.
I've read some help documents and find out there is quite some updates between 2017a and the latest version. The downloading of MATLAB 2022b took longer time than estimated. I will try both of the ways after the installation.
By the way, the method of calling system function of Windows Command (the answer by Steven Lord below) is really fast! It saves 80%-90% of time usage compared to our readtable - writetable method. (if it is just for stapling documents, of course)
Well, of course simply moving bytes from one location to another is far faster than do all the loading, data manipulations behind the scenes, and then rewriting it all back out again....that's obviously the "'round Robin Hood's barn" way to go at it for the purpose.
I've updated my MATLAB to 2022b and all the datetime issues once haunted me simply disappered.
Being unable to find the thread is frustrating... Any suggestions on how to gain senses of solving problems never encountered?
Sorry pal. I accepted Steven's answer as it might be more helpful when someone like me have similiar problems. But thank you very very much for all the time and effort! Your example of detectImportOptions is really detailed. I would be converting to and using primitive double matrix without you.
I think you can accept more than one Answer if more than one were useful??? Not positive it will keep both; haven't explored that much; I do know that reputation points aren't taken away from an earlier respondent if a second is selected...
It's not like we get anything other than personal satisfaction out of it, anyways... :) My penchant is to try to teach as well as "just answer"; hence the examples of some less-oftened facilities I notice newcomers tend to overlook/not use...
There's no 'Accept' button anymore after accepting one. Only a 'vote' button. I'd be happy to accept both of the answers as they're all very helpful to me.
And yes exploring useful but less-oftened facilities are fun. Thanks a lot again.

Melden Sie sich an, um zu kommentieren.

Kategorien

Produkte

Version

R2017a

Gefragt:

am 31 Jan. 2023

Kommentiert:

am 9 Feb. 2023

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by