Building tall table from tall arrays generates error

2 Ansichten (letzte 30 Tage)
Harry Cho
Harry Cho am 14 Mär. 2023
Kommentiert: Harry Cho am 15 Mär. 2023
clear
dataFile = 'data.csv';
ds = tabularTextDatastore(dataFile, FileExtensions='.csv');
ds.ReadVariableNames = true;
ds.Delimiter = ',';
ds.SelectedVariableNames = ["hash", "count"];
ds.SelectedFormats = {'%s', '%f'};
data = tall(ds);
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 2).
[g, THash] = findgroups(data.hash);
TCount = splitapply(@(x) {x}, data.count, g);
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 1: 0% complete - Pass 1 of 1: 100% complete - Pass 1 of 1: Completed in 1.9 sec Evaluation completed in 2.8 sec
count = gather(TCount);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 3: 0% complete - Pass 1 of 3: 100% complete - Pass 1 of 3: Completed in 0.54 sec - Pass 2 of 3: 0% complete - Pass 2 of 3: 100% complete - Pass 2 of 3: Completed in 0.46 sec - Pass 3 of 3: 0% complete - Pass 3 of 3: 100% complete - Pass 3 of 3: Completed in 0.58 sec Evaluation completed in 2.3 sec
T1 = table(hash, count);
%% This is the intended code but doesn't work
TT = table(THash,TCount);
Error using tall/table
Incompatible non-scalar tall array arguments. Each of the tall arrays must be the same size in the first dimension, must be derived from a single tall array, and must not have been indexed
differently in the first dimension (indexing operations include functions such as VERTCAT, SPLITAPPLY, SORT, CELL2MAT, SYNCHRONIZE, RETIME and so on).
write(fullfile(pwd,'data'),TT,FileType="parquet");

Antworten (1)

Oguz Kaan Hancioglu
Oguz Kaan Hancioglu am 15 Mär. 2023
Your code wasn't work because "gather(TCount)" returns cell array for each element. Therefore you are trying to write double array in to one single cell. You can find the length of each array into the cell. I hope this solves your problem.
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
count = gather(TCount);
cellsz = cellfun(@size,count,'uni',false);
newCount = cellfun(@(x) x(1),cellsz,'UniformOutput',false)
T1 = table(hash, newCount);
  1 Kommentar
Harry Cho
Harry Cho am 15 Mär. 2023
Thank you for the reply. Unfortunately I have to collect cell array, in which each cell has different length of double array. My question is why it works in-memory table T1 but not in tall table TT.

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Analysis of Big Data with Tall Arrays finden Sie in Help Center und File Exchange

Produkte


Version

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by