Filter löschen
Filter löschen

How do I compare a cell array containing string arrays against a string array without using loops?

3 Ansichten (letzte 30 Tage)
I'm comparing two variables (data attached):
  • tableOfTextByTime.("tweetUniqueMentions"), which is a 500x1 cell array. The content of each cell is a string array that may contain 0 or more words. Each cell can contain a different number of words. See screenshot:
  • tableOfUsers{:,1}, which is a 334x1 string array
The code below works using a for loop, an anonymous function, and cellfun, but it's slow.
It's ok for a small test dataset, but when running on a real data set (20,000 x 1 cell array) and (5,000 x 1 string array) it takes way too long.
for i = height(tableOfUsers): -1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp = @(anonInp1) any(strcmp(anonInp1, tableOfUsers{i,1}));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat(:,i) = cellfun( wStrcmp, tableOfTextByTime.("tweetUniqueMentions"),'UniformOutput',false);
% grab the relevant text that match the criteria
correspondingText{i,1} = tableOfTextByTime(cell2mat(idxMat(:,i)),:);
end
How can I get an equivalent result while drastically speeding up the code? Is there a way to do this in a vectorized or element-wise manner? bsxfun and arrayfun seem to have limitations when working with strings. Parallel computing toolbox not an option : )
  3 Kommentare
Walter Roberson
Walter Roberson am 21 Dez. 2022
Please describe in words what the desired outcome is.
  • for each cell, you need to know whether at least one string in the cell appears anywhere in the string array?
  • for each string in each cell, you need to know of the string appears anywhere in the string array?
  • for each string in the string array, you need to know which cells it appears in?
Ed Marquez
Ed Marquez am 21 Dez. 2022
@the cyclist - uploaded sample data that can be shared (sizes may be slightly different than described in the question).
@Walter Roberson - option 3 - the desired outcome, captured in the matrix of indices (idxMat), is to know:
  • for each string in the string array, what are the cells that contain that string?
  • with that knowledge, extract the table rows that match that criteria (tableOfTextByTime(cell2mat(idxMat(:,i)),:))

Melden Sie sich an, um zu kommentieren.

Akzeptierte Antwort

Stephen23
Stephen23 am 21 Dez. 2022
Bearbeitet: Stephen23 am 21 Dez. 2022
S = load('answersData.mat');
A = S.tableOfTextByTime.("tweetUniqueMentions")
A = 100×1 cell array
{["" ]} {["vids_v" ]} {["ItsRoshni08070" ]} {["" ]} {["shradhaarao" ]} {["" ]} {["GemsOfBollywood"]} {["" ]} {["" ]} {["simmyxchauhan" ]} {["" ]} {["" ]} {["" ]} {["cheerlights" ]} {["Sumanth_077" ]} {2×1 string }
B = S.tableOfUsers{:,1}
B = 88×1 string array
"ArylieSumaan" "BeingSalmanKhan" "ColorsTV" "MaximZiatdinov" "RRejeleene" "ANINewsUP" "Arsalan418296" "Aslam29Munawar" "BOLNETWORK" "BiggBoss" "DramebaazPorgi" "FIFAcom" "FawadAhsanFawad" "GemsOfBollywood" "HarisRauf14" "High_735" "ItsRoshni08070" "Keth_2000" "KhadkaDeepali" "KhudaJaane_" "MahuaMoitra" "MathWorks" "NVIDIAAI" "NaeemRehmanEngr" "OrmaxMedia" "PSushreesangita" "RashamiXmagic" "Rsumaiya" "SUBWAY" "SabirMehmood26"
tic
T = vertcat(A{:});
X = repelem(1:numel(A),cellfun(@numel,A)).';
[Y,Z] = ismember(T,B);
F = @(x) {S.tableOfTextByTime(x,:)};
C = accumarray(Z(Y),X(Y),[],F);
toc
Elapsed time is 0.043696 seconds.
C
C = 88×1 cell array
{2×8 table} {2×8 table} {2×8 table} {2×8 table} {2×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table}
Lets compare against a loop (as the cyclist mentioned, faster due to moving indexing before the loop):
tic
for kk = numel(B):-1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp = @(anonInp1) any(strcmp(anonInp1, B{kk}));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat(:,kk) = cellfun( wStrcmp, A,'UniformOutput',false);
%idx = cell2mat(idxMat(:,kk))
% grab the relevant text that match the criteria
correspondingText{kk,1} = S.tableOfTextByTime(cell2mat(idxMat(:,kk)),:);
end
toc
Elapsed time is 0.066449 seconds.
correspondingText
correspondingText = 88×1 cell array
{2×8 table} {2×8 table} {2×8 table} {2×8 table} {2×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table} {1×8 table}
isequal(C,correspondingText)
ans = logical
1
  1 Kommentar
Ed Marquez
Ed Marquez am 21 Dez. 2022
Thank you @the cyclist and @Stephen23!
Using vertcat, repelem, ismember, anon fcn, and acumarray results in a massive speedup.
For a real dataset of size 14,025 by 4,444 this now completes in 0.501984 seconds. That's unimaginable compared to what I was seeing before. I'll need to read up on acumarray and what's it's doing, but it all works as intended. Happy holidays.

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

the cyclist
the cyclist am 21 Dez. 2022
I think more can be done, but here a couple improvements that make the small test case faster. Hopefully it is an ever larger speed-up on your real problem.
load("answersData.mat")
tic
for i = height(tableOfUsers): -1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp = @(anonInp1) any(strcmp(anonInp1, tableOfUsers{i,1}));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat(:,i) = cellfun( wStrcmp, tableOfTextByTime.("tweetUniqueMentions"),'UniformOutput',false);
% grab the relevant text that match the criteria
correspondingText{i,1} = tableOfTextByTime(cell2mat(idxMat(:,i)),:);
end
toc
Elapsed time is 0.544876 seconds.
tic
% Preallocate, and pull out desired subset of data (so indexing doesn't need to be done repeatedly)
idxMat2 = false(height(tableOfTextByTime),height(tableOfUsers));
C = tableOfTextByTime.("tweetUniqueMentions");
T = tableOfUsers{:,1};
for i = height(tableOfUsers): -1: 1
% create a wrapped strcmp anon fcn that takes each cell element and
% each string element
wStrcmp2 = @(anonInp1) any(strcmp(anonInp1, T(i)));
% create matrix of indices for the entries that match the criteria (500x334)
idxMat2(:,i) = cellfun( wStrcmp2, C);
% grab the relevant text that match the criteria
correspondingText2{i,1} = tableOfTextByTime((idxMat2(:,i)),:);
end
toc
Elapsed time is 0.062504 seconds.
% Test that the two methods result in the same output
isequal(correspondingText,correspondingText2)
ans = logical
1

Kategorien

Mehr zu Characters and Strings finden Sie in Help Center und File Exchange

Produkte


Version

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by