MATLAB Answers

Fan Li
0

How to read large text data into matlab

Asked by Fan Li
on 5 Jan 2018
Latest activity Commented on by Abdullahi Samantar on 12 Dec 2018
Hi every one I have a text file up to 10 GB which has to be read into matlab. The part of the data is listed below
ITEM: TIMESTEP
0
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 51.4573 -48.0145 -55.5854 0.00121546 -0.00693737 -0.00454935
2242 -25.5898 -24.3081 -29.3729 0.00671099 0.00205397 -0.0108453
2243 9.2867 27.1493 -37.9274 -0.00115821 0.00912371 -0.00178601
2244 3.89714 -48.5019 70.5903 0.0041159 -0.00255481 -0.0029498
2245 49.8803 -40.1819 -5.30361 -0.0106695 0.0224494 0.00918698
2246 0.22115 -19.9758 -2.30173 0.0190817 0.0262146 -0.0153229
2247 -53.6289 50.5517 -23.5032 0.00388499 -0.00559089 0.000787281
.
.
.
.
.
.
.
.
ITEM: TIMESTEP
10
ITEM: NUMBER OF ATOMS
4323
ITEM: BOX BOUNDS pp pp ff
3.6821000000000000e-02 3.6996820000000000e+01
8.5320999999999994e-02 3.4761423000000001e+01
9.0000000000000002e-06 6.8636712000000003e+01
ITEM: ATOMS id c_water_force[1] c_water_force[2] c_water_force[3] c_water_force[4] c_water_force[5] c_water_force[6]
2241 -50.0606 -93.6118 -70.4534 0.000504085 -0.00684199 -0.00394166
2242 -14.4928 20.0993 3.55963 0.00244236 0.00203074 -0.0162865
2243 -2.64823 8.26566 23.6457 -0.000503352 0.0140246 -0.00909782
2244 -153.189 40.6383 -12.0141 0.00192712 -0.00177534 -0.00194966
2245 35.0712 -14.4107 6.31868 0.00668828 0.012556 0.00468532
2246 22.0675 -14.7867 61.4774 0.0182799 0.0194239 -0.00942033
2247 -3.80959 -88.6786 1.61222 0.00459477 -0.00577238 0.000324204
2248 -18.4777 -9.35017 -1.12766 0.0146401 0.00924069 -0.00730373
2249 16.2354 -7.34658 -25.1694 -0.0169203 0.0249397 0.0085598
2250 110.508 19.9749 -4.95758 -0.00500049 0.000961677 0.00667405
2251 -7.46059 3.35324 -41.665 0.0175383 -0.00791068 -0.00702065
Basically, it has many parts which start with the "ITEM: TIMESTEP". I have to skip the first 9 lines for each part and then read the other lines.
I tried the textscan function (May be I misused it), but it is very slow. Is there a faster way to do it in Matlab?

  1 Comment

How much RAM do you have available?

Sign in to comment.

3 Answers

Answer by Cedric Wannaz
on 6 Jan 2018
Edited by Cedric Wannaz
on 6 Jan 2018
 Accepted Answer

If you have enough RAM for this, the following could run a little faster. It is way less versatile than Per's solution though, and exploits specific characters present in the header. You may have to adapt it a bit if there are e.g. other types of header content:
content = fileread( 'data.txt' ) ;
blockEnds = strfind( content, 'ITEM: T' ) - 1 ;
blockEnds = [blockEnds(2:end), numel( content )] ;
blockStarts = strfind( content, '6]' ) + 3 ;
nBlocks = numel( blockStarts ) ;
data = cell( nBlocks, 1 ) ;
fprintf( '%d blocks found.\n', nBlocks ) ;
for bId = 1 : nBlocks
data{bId} = reshape( sscanf( content(blockStarts(bId):blockEnds(bId)), '%f' ), 7, [] ).' ;
end
PS: this takes < 20s for a 1GB data file on a small laptop (with 32GB RAM though).

  2 Comments

Thanks. This way is faster.
Hi Fan Li,
How did you manipulate Cedric code to get your large txt file (lammps) run.
I couldnt figure it out.
Thank you

Sign in to comment.


Answer by per isakson
on 5 Jan 2018
Edited by per isakson
on 6 Jan 2018

Given:
  • All headers consist of 9 lines
  • All data blocks consist of 7 columns of numerical data
  • The blocks of numerical data should be converted to double. (Added later.)
  • The columns of the data are separated by space, char(32)
  • There is RAM enough to store the parsed data. Nearly 10GB is needed to store in double. Single would introduce a rounding error.
Try:
>> cac = cssm( 'cssm.txt' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x6 3696 cell
>> cac
cac =
[7x7 double] [11x7 double] [7x7 double] [11x7 double] [7x7 double] [11x7 double]
>>
>> cac{1}
ans =
1.0e+03 *
2.2410 0.0515 -0.0480 -0.0556 0.0000 -0.0000 -0.0000
2.2420 -0.0256 -0.0243 -0.0294 0.0000 0.0000 -0.0000
2.2430 0.0093 0.0271 -0.0379 -0.0000 0.0000 -0.0000
2.2440 0.0039 -0.0485 0.0706 0.0000 -0.0000 -0.0000
2.2450 0.0499 -0.0402 -0.0053 -0.0000 0.0000 0.0000
2.2460 0.0002 -0.0200 -0.0023 0.0000 0.0000 -0.0000
2.2470 -0.0536 0.0506 -0.0235 0.0000 -0.0000 0.0000
>>
where
function cac = cssm( ffs )
fid = fopen( ffs );
cac = cell(1,0);
while not( feof( fid ) )
cac(1,end+1) = textscan( fid, '%f%f%f%f%f%f%f' ...
, 'Headerlines',9, 'CollectOutput',true );
end
fclose( fid );
end
and cssm.txt contains three copies of the data of the question
"textscan [...] is very slow. Is there a faster way [...] Matlab?" AFAIK: No, not significantly faster. However, I don't agree that it's very slow.

  2 Comments

Hi per isakson
I am using fgetl function which is faster.
That's comparing apples to oranges. I assumed without stating it that the "numerical blocks" should be parsed.

Sign in to comment.


Answer by Steven Lord
on 8 Jan 2018

If your data is too large to fit in memory all at once, consider using a datastore. Since you have data in the headers that I assume you want to access, using a TabularTextDatastore probably won't suit your needs. You may need to use a general FileDatastore or develop your own custom datastore using your knowledge of the way your data is formatted.
Once you have a datastore you could use it to create a tall array.

  1 Comment

Hi Steven Lord
I have read the part about tall array and datastore. It is useful for me. But I do not know how to skip the headers which I do not need . The function I am using now and other method provided here is not for datastore. There is limited source for skipping the headers with datastore. So, can you tell me how to skip the headers with the datastore? The format of the data is provided above.
Thanks

Sign in to comment.