In memory calculations with tall arrays from different databases

1 view (last 30 days)
Imagine I have two data bases as (Table and double numbers in them)
ds_1 = tabularTextDatastore('file_1.txt');
ds_2 = tabularTextDatastore('file_2.txt');
Also imagine that I created my tall arrays as
X = tall(ds_1);
Y = tall(ds_2);
Now, let's imagine that I trained a model, mdl, with fitlm and I want to use this model to predict from X and Y as
Anwer = predict(mdl, [X,Y]);
The error I receive is this
Error using tall/horzcat (line 23)
Incompatible tall array arguments. The tall arrays must be based on the
same datastore.
How can I solve this problem without gathering the data and just use in memory capabilities?

Answers (1)

Guillaume on 16 Jul 2019
Edited: Guillaume on 16 Jul 2019
If you have R2019a, you can combine your two datastores.
ds_1 = tabularTextDatastore('file_1.txt');
ds_2 = tabularTextDatastore('file_2.txt');
ds_combined = combine(ds_1, ds_2);
Answer = predict(mdl, tall(ds_combined));
In previous versions, I'm not sure that there's a way to do it other than creating your own custom datastore that would keep track of both datastores (essentially recreating the R2019a CombinedDatastore).
Guillaume on 18 Jul 2019
"The tall array generation from combined datasores is not compatible with parallel compution"
I would recommend raising a service request with matlab then, as they should make it possible to create a combined datastore that has the exact same properties as the source datastores (if they are compatible). I don't have the parallel toolbox, so I'm not sure what these properties are. Since you now have access to the source code of CombinedDatastore (in fullfile(matlaroot, 'toolbox\matlab\datastoreio\+matlab\+io\+datastore')), you could also copy it and make the required modifications.
I'm not sure you will be able to concatenate two tall arrays from the same datastore since by necessity they will have the same variable names, so indeed horizontal concatenation will create duplicate variable names which is not allowed. The only way this could work is if you are allowed to modify the variable names of the tall array. See if this work:
DS = tabularTextDatastore({'file_1.txt', 'file_2.txt'});
X1 = tall(datastore(DS.Files{1}));
X2 = tall(datastore(DS.Files{2}));
X2.Properties.VariableNames = compose('X2Var%d', 1:width(X2));

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by