MATLAB Answers

0

How to save a tall array / table to a text or csv file?

Asked by Benjamin Imbach on 5 Sep 2017
Latest activity Edited by Adam Filion on 1 Oct 2018
I'm trying to save the contents of a tall table that doesn't fit into memory to a .txt file.
MATLAB provides the function write. However, it can only write the table contents to .mat files. So far I haven't found an option or another function that could write the data to a text file.
A workaround I'm trying to do is to continously gather a part of the tall table, save it to a text file, gather the next part etc. until the table end is reached. For that to work in tall syntax I suppose I need a vector containing the row numbers of the tall table. That way I could find the index of the rows I want to gather and write with:
idx = (RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit);
With the index vector it is then possible to gather the rows I want to save to the text file:
TableToSave = gather(Table(idx,:));
once the data is gathered the table could be saved with the writetable function. After that, the lowerLimit and upperLimit could be adjusted and a new chunk of the table could be saved.
The point where I'm failing is the construction of this vector containing the row numbers. In theory it's simply
RowNumbers = 1:1:size(Table,1); // Or: RowNumbers = 1:1:gather(size(Table,1));
The first one doesn't work because the 1:X syntax doesn't support 'tall doubles' and the second approach doesn't work I suppose because the resulting RowNumbers and index vector are completely in-memory while the table is not.
So if I try to
idx = tall((RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit));
and use
gather(head(DB(idx,:)));
The following error appears:
Incompatible tall array arguments. The first dimension in each tall array must have the same size, and each tall array must be based on the same datastore.
To sum up:
1. Is there another way to save tall arrays / tables to text files?
2. How to create an "unevaluated" row number array that then could be used for the described workaround?
Thanks a lot!

  0 Comments

Sign in to comment.

2 Answers

Answer by Edric Ellis
on 7 Sep 2017
 Accepted Answer

I think the least inefficient method is probably to combine use of tall/write with writetable. Calling gather repeatedly is going to be inefficient - the approach below takes 2 passes over the data (one of which is in the optimised .mat form, so should be quicker). Here's the sort of thing I mean.
%%Step 1 - create a tall table
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'};
ds1 = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ...
'SelectedVariableNames', varnames);
tt = tall(ds1);
%%Step 2 - operate on tall table
tt.TotalDelay = tt.ArrDelay + tt.DepDelay;
%%Step 3 - use tall/write to emit .mat files
writeDir = tempname
mkdir(writeDir);
write(writeDir, tt);
%%Step 4 - iteratively convert the tall/write output to CSV
ds = datastore(writeDir);
csvDir = tempname
mkdir(csvDir);
idx = 0;
while hasdata(ds)
idx = 1 + idx;
fname = fullfile(csvDir, sprintf('out_%06d.csv', idx));
writetable(read(ds), fname);
end
A refinement of this would be to partition the datastore and operate on it in parallel, using the techniques described here in the documentation.

  2 Comments

Here's the parfor version of Step 4.
%%Step 5 - use parfor to parallelise the writetable loop
ds = datastore(writeDir);
N = numpartitions(ds, gcp);
csvDir2 = tempname
mkdir(csvDir2);
parfor idx1 = 1 : N
idx2 = 0;
subds = partition(ds, idx1, N);
while hasdata(subds)
idx2 = 1 + idx2;
fname = fullfile(csvDir2, sprintf('out_%06d_%06d.csv', idx1, idx2));
writetable(read(subds), fname);
end
end
Thanks! This is what I was looking for. I still wish there was a direct way to write the tall table to a csv but until then this will do!
A small correction:
subds = partition(ds, idx1, N);
should be
subds = partition(ds, N, idx1);
In your case it worked since the datastore is probably so small that N = 1.
Thanks again!

Sign in to comment.


Answer by Adam Filion on 1 Oct 2018
Edited by Adam Filion on 1 Oct 2018

As of R2018b the tall write command now directly supports writing tall arrays to .txt files (also .csv, .xls* and custom formats) in addition to .mat:

  0 Comments

Sign in to comment.