Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving

Question

0 Stimmen

Hi,

I'm recently dealing with a bunch of csv files which are stored in different folders named by date sequence. The aim is to splice the data with the same csv filename (among all folders) together and save for further use.

First, I go over all folders and use unique function to derive a list of .csv file names. Next, I splice the data with identical filename according to date sequence.

data=[];
for i = 1:num_folder
    data = [data; readtable(tmp_file)];     % Splicing
end
save(strcat(pwd,'\',filename,'.mat'),'data')    % Saving

The saving is -v7.3 by default setting. (the file size range from hundreds of Mbs to several Gbs)

The whole process is running quite slow. I checked the profile viewer and it turns out save function consumed 70% of time and 'splicing' the other 30%. I would love to know how to elevate the time&space efficiency of my code.

[P.S.: The original .csv file have 60 columns. The Majority of them are numbers (double). One is filled with 'true' or 'false'; one filled with 'HH:MM:SS.TTT' and the other filled with 'HH:MM:SS.TTTTTT'. (Since I read them with readtable function, they are save in the table format in a cell manner.) I tried to 'simplify' the table by converting all elemets into double (I thought it could save some space?) but failed in correctly attaining all microseconds (TTTTTT). The textscan can only gives me HH, MM and SS.TTTT.]

4 Kommentare
2 ältere Kommentare anzeigen 2 ältere Kommentare ausblenden

dpb am 31 Jan. 2023

Bearbeitet: dpb am 31 Jan. 2023

Need to see a short section of one of the text files to know about the specifics of the time format. datetime can't store times alone so would need to convert to a duration class...they can store up to 9 digits for the fractional seconds.

For SAVE, heterogeneous data types as the table create additional overhead and the default compression also takes time.

IF the numeric data aren't needed with a full 15(+/-) digits of precision, you can save half the storage space for them by using single precision.

You can also save a sizable amount of storage by converting the T/F variables to categorical; just checking a table with 40 T/F stored as cellstr() occupied 5498 bytes in memory; the same table content as a categorical variable was only 1246 bytes; there's a littlel overhead associated with a categorical array, but once the size gets to be more than a handful, the storage savings grow very rapidly.

The most concise (and fastest) storage although less convenient would be to write the data as native data types to a stream (aka 'binary') file; compression then would probably not gain too much unless there are long sections of identical values in the data.

Frank am 1 Feb. 2023

In MATLAB Online öffnen

Thanks! I tried to saved the spliced data into .csv format instead of .mat and saved some ~85% of space!

I am now doing the whole data-processing in a two-step procedure. First is what I'm doing now, read the original data and splice them by their specific filename (as 'Original Database re-Storage'? my aim was to minimize multiple times of loading when conducting the following data analysis). After 'Archiving' the Original Data, the next step is to standardlize/customize the needed columns of data in preparation for data analysis. It troubles me of how to smoothly convert 'cell' type 'HH:MM:SS.TTT' table elements into datetime/duration format (I am aware of the pure time is comes with 'today' when performing datestr. Is there any other way to derive and keep the simple decimal parts of the pure time alone?)

The time structure example is 'HH:MM:SS.TTT' (e.g. '19:14:23.000') and 'HH:MM:SS.TTTTTT' (e.g. '08:46:51.429869').

My solution now is using:

sscanf(data.Time,'%{HH:mm.ss.SSS}D')

but have no idea how to deal with 'HH:MM:SS.TTTTTT' format without losing precision. (pls show me an example if possible)

Frank am 1 Feb. 2023

OI401.csv

One example of my spliced data look like this (attachment). I'm considering turning the whole table into double matrix (true/false -> 1/0, time -> decimals) since it seems easier to use than table format (cell style). Any suggestions?

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Follow Question

Answer 1

Steven Lord am 1 Feb. 2023

0 Stimmen

Assuming your data files don't have any header rows, I'd consider avoiding the read/save process and just use operating system commands to staple together the files. If you then want to process the data in MATLAB you'd only need to read it once (or you could assemble a datastore and create a tall array to work with the data.)

For Windows take a look at the last couple examples on the Microsoft documentation page for the copy command. You could assemble the command in MATLAB from the file names then call system to execute it.

2 Kommentare
Keine anzeigen Keine ausblenden

Frank am 2 Feb. 2023

Great! I'll try compare matrix splicing, fwrite function and OS command on the same set of files and see who's the fastest one.

Frank am 3 Feb. 2023

Thanks! I used the copy command via MATLAB system. It is really way faster! It is the perfect way for the stapling.

Extra thanks for the datestore and tall array format. It provide me solution and ideas to some other problem I've encountered before on extra-large data analysis.

Melden Sie sich an, um zu kommentieren.

Answer 2

dpb am 1 Feb. 2023

Bearbeitet: dpb am 1 Feb. 2023

In MATLAB Online öffnen

2 Stimmen

Per usual when things are not totally plain vanilla in some fashion, you need to make use of the extended features supplied in MATLAB. We didn't need but a half-dozen lines to see the file format; I made a shorter version of the original by the following code (which post as may give you some ideas regarding the "splicing" operations you spoke of)

fid=fopen('OI401.csv','r');
fod=fopen('OI401_1.csv','w');
for i=1:100, fwrite(fod,fgets(fid)); end
fclose('all')

to leave me with a big enough file to do something with but still not be huge...

Anyways, that done,

opt=detectImportOptions('OI401.csv');                                   % create basic import object
opt=setvartype(opt,{'tradingday','logDay'},"datetime");                 % set datetime data type
opt=setvaropts(opt,'tradingday','logDay'},"InputFormat",'yyyyDDDD');    % and the input format
opt=setvartype(opt,'isTrading','logical');
opt.SelectedVariableNames=opt.VariableNames([1:6 end-4:end]);           % just pick a small subset for demo
tOI=readtable('OI401_1.csv',opt);                                       % and read in with this help

The above returned

>> head(tOI)
ans =
  8×11 table
    updatetime    tradingday    mktstatus    last    high    low        recvtime            logTime         logDay     status    isTrading
    __________    __________    _________    ____    ____    ____    _______________    _______________    ________    ______    _________
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:46:21.430146    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:46:31.430749    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:46:41.433568    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:46:51.431701    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:47:01.431491    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:47:11.436309    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:47:21.435230    20230117      0         true   
     22:59:58      20230117         6        9802    9816    9708    08:45:11.567186    08:47:31.438434    20230117      0         true   
>> 

And, I didn't think to do it on the import option object, so to prove what we got,

>> ans.tradingday.Format='default'
ans =
  8×11 table
    updatetime    tradingday     mktstatus    last    high    low        recvtime            logTime         logDay     status    isTrading
    __________    ___________    _________    ____    ____    ____    _______________    _______________    ________    ______    _________
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:46:21.430146    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:46:31.430749    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:46:41.433568    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:46:51.431701    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:47:01.431491    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:47:11.436309    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:47:21.435230    20230117      0         true   
     22:59:58     27-Apr-2023        6        9802    9816    9708    08:45:11.567186    08:47:31.438434    20230117      0         true   
>> 

shows the dates were imported as datetime.

Now, as far as what to use for analyses, that'll be your call and will depend heavily upon what you intend to do with the data once you have it. There are all kinds of builtin analysis tools for tables that let you select variables by variable name that are quite convenient; whether they'll be of any direct use for your purposes will all depend upon what those purposes are...

14 Kommentare
12 ältere Kommentare anzeigen 12 ältere Kommentare ausblenden

Frank am 2 Feb. 2023

In MATLAB Online öffnen

I will be comparing the fwrite efficiency against the matrix splicing once this batch of 200 csv is finished.

Your example of using the opt in readtable function is illuminating! especially on the logical values. However, my problem on datetime is actually not tradingday, the aim is on dealing with updatetime ('HH:MM:SS.TTT') and recvtime & logTime ('HH:MM:SS.TTTTTT) which I would like to convert them into a decimal format without date (e.g. '08:45:12.123456 = 8/24+45/1440+12/86400+123456/86400000000, which will be always<1) or any other format easy for comparison (I naturally assume using some time-comparison toolbox will be slower than double comparasion)

By the way, it is weird when I adopt 'yyyymmdd' format:

opt=setvaropts(opt,{'tradingday','logDay'},'InputFormat','yyyymmdd');    % and the input format

which gives out:

>> ans.tradingday.Format = 'default'
ans =
  8×11 table
      updatetime          tradingday         mktstatus    last    high    low         recvtime              logTime          logDay     status    isTrading
    ______________    ___________________    _________    ____    ____    ____    _________________    _________________    ________    ______    _________
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:46:21.430147'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:46:31.430749'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:46:41.433569'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:46:51.431702'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:47:01.431492'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:47:11.436310'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:47:21.435230'    20230117    0         true     
    '22:59:58.000'    2023-01-17 00:01:00    6            9802    9816    9708    '08:45:11.567186'    '08:47:31.438435'    20230117    0         true     

I'm not sure where the '00:01:00' comes from?

Frank am 2 Feb. 2023

In MATLAB Online öffnen

I've tried this and it seems working,

opt=setvartype(opt,{'updatetime'},"datetime");
opt=setvaropts(opt,{'updatetime'},'InputFormat','HH:mm:ss.SSS');    % and the input format
opt=setvartype(opt,{'logTime'},"datetime");
opt=setvaropts(opt,{'logTime'},'InputFormat','HH:mm:ss.SSSSSS');    % and the input format

It successfully reads. But the true value is actually the wanted precise time + today.

Another thing is ideally I may want to attain a accrate date+time number: For example the idea of logDay + logTime would be nice. But first logTime-today is not what I wanted.

datestr(tOI.logTime(1)-today)
ans = '27-Nov-7976 08:46:21'

And this is also not applicable:

datestr(tOI.logTime(1)-today+tOI.logDay(1))
"Addition is not defined between datetime arrays."          % BUG report

It seems I'll have to use:

tOI.logTime(1)-datetime(0,1,0,0,0,0)+tOI.logDay(1)
ans = 
  datetime
   20234717
>> ans.Format = 'yyyyMMdd HH:mm:ss.SSS'
ans = 
  datetime
   20230117 08:47:21.430

which seems quite weird? especially the 20234717 value, and the time 0 point is datetime(0,1,0,0,0,0)?

dpb am 2 Feb. 2023

Bearbeitet: dpb am 2 Feb. 2023

In MATLAB Online öffnen

>> opt
opt = 
  DelimitedTextImportOptions with properties:
   Format Properties:
                    Delimiter: {','}
...
   Variable Import Properties: Set types by name using setvartype
                VariableNames: {'updatetime', 'tradingday', 'mktstatus' ... and 62 more}
                VariableTypes: {'duration', 'datetime', 'double' ... and 62 more}
        SelectedVariableNames: {'updatetime', 'tradingday', 'mktstatus' ... and 62 more}
...
>> 

Dumping the import object shows that R2020b recognized them as durations automagically...I'd recommend upgrading if R2017x doesn't; but you can work around it by importing as string instead and then manually converting to duration...there are quite a number of other niceties that have been introduced since then that could be helpful...

>> opt=setvartype(opt,'updatetime','string');
>> tOI=readtable('OI401.csv',opt);
>> head(tOI)
ans =
8×11 table
updatetime      tradingday    mktstatus     last       high        low         recvtime            logTime         logDay     status    isTrading
______________    __________    _________    _______    _______    _______    _______________    _______________    ________    ______    _________
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:21.430146    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:31.430749    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:41.433568    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:51.431701    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:01.431491    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:11.436309    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:21.435230    20230117     0.00       true   
"22:59:58.000"     20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:31.438434    20230117     0.00       true   
>> ans.updatetime=duration(ans.updatetime)
ans =
8×11 table
updatetime    tradingday    mktstatus     last       high        low         recvtime            logTime         logDay     status    isTrading
__________    __________    _________    _______    _______    _______    _______________    _______________    ________    ______    _________
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:21.430146    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:31.430749    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:41.433568    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:46:51.431701    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:01.431491    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:11.436309    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:21.435230    20230117     0.00       true   
22:59:58      20230117       6.00       9802.00    9816.00    9708.00    08:45:11.567186    08:47:31.438434    20230117     0.00       true   
>> ans.tradingday+ans.updatetime
ans = 
8×1 datetime array
20230117
20230117
20230117
20230117
20230117
20230117
20230117
20230117
>> ans.Format='default';
>> ans.Format=[ans.Format '.SSS']
ans = 
8×1 datetime array
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
17-Jan-2023 22:59:58.000
>> 

Gets you the absolute time vector in all its glory.

There's no point in trying to take datetime back to the big bang; there's no practical use for datetime(0,0,0) and certainly no need/use for it here.

dpb am 2 Feb. 2023

Bearbeitet: dpb am 3 Feb. 2023

In MATLAB Online öffnen

Actually, wasn't thinking -- the way if R2017x readtable still doesn't know about importing duration class directly, is to go ahead and let it use the datetime, then timeofday to get the time portion...

Picking up from where we left off with the import object above but fixing up the update and log time variables, we do the following --

opt=setvartype(opt,{'updatetime','logTime'},'datetime');
opt=setvaropts(opt,{'updatetime','logTime'},"InputFormat",'HH:mm:ss.SSSSSS');
tOI=readtable('OI401.csv',opt);
tOI.updatetime=timeofday(tOI.updatetime);
tOI.logTime=timeofday(tOI.logTime);

I'm not sure there is a way to do the conversion in place on the two datetime variables together with higher-level addressing; all the other forms I could think of didn't like changing the class of the datetime to a duration.

Either way, there are at least two solutions although the easiest if possible is to upgrade to a later release that understands the duration type on import/reading of the table to begin with.

dpb am 6 Feb. 2023

I think you can accept more than one Answer if more than one were useful??? Not positive it will keep both; haven't explored that much; I do know that reputation points aren't taken away from an earlier respondent if a second is selected...

It's not like we get anything other than personal satisfaction out of it, anyways... :) My penchant is to try to teach as well as "just answer"; hence the examples of some less-oftened facilities I notice newcomers tend to overlook/not use...

Frank am 9 Feb. 2023

There's no 'Accept' button anymore after accepting one. Only a 'vote' button. I'd be happy to accept both of the answers as they're all very helpful to me.

And yes exploring useful but less-oftened facilities are fun. Thanks a lot again.

Melden Sie sich an, um zu kommentieren.

Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving

4 Kommentare
2 ältere Kommentare anzeigen 2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigen Keine ausblenden

Weitere Antworten (1)

14 Kommentare
12 ältere Kommentare anzeigen 12 ältere Kommentare ausblenden

Kategorien

Produkte

Version

Tags

Community Treasure Hunt

Time and Space Efficiency Problem in Multi-Format .csv file Splicing and Saving

4 Kommentare 2 ältere Kommentare anzeigen 2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigen Keine ausblenden

Weitere Antworten (1)

14 Kommentare 12 ältere Kommentare anzeigen 12 ältere Kommentare ausblenden

Kategorien

Produkte

Version

Tags

Siehe auch

Community Treasure Hunt

4 Kommentare
2 ältere Kommentare anzeigen 2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigen Keine ausblenden

14 Kommentare
12 ältere Kommentare anzeigen 12 ältere Kommentare ausblenden