Finding duplicate strings in a cell array and their index
41 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Jonathan Nastasi
am 11 Apr. 2015
Kommentiert: Paul Wintz
am 10 Sep. 2021
I have to convert a cell array with more than 100,000 elements and convert it to a structure array with four fields. Right now, I have something like:
% cell array = nameData
n = 1;
for j = 2:102
for i = 2:length(nameData)
S(n).name = nameData{i,j};
S(n).frequency = 1;
n = n+1;
end
end
However, I need to find duplicate strings in this array, and find information about them. Basically, I am collecting a database of strings and if I run across a duplicate, increase the frequency of that string rather than adding it to the structure.
I had been using loops within the previous two loops to achieve this:
for k = 1:n
if strcmpi(S(k).name, nameData{i,j}
S(k).frequency = S(k).frequency + 1;
end
end
However, I always just end up with all 100,000 structure elements. Any other solution I have gotten to work was entirely too slow, and this conversion from cell to structure array must happen in less than 20 seconds.
Thanks!
2 Kommentare
Paul Wintz
am 10 Sep. 2021
The use of i and j as index variables are so ubiquitous to programming that I would say, instead, that you should avoid using i and j as the imaginary unit, and instead use 1i or 1j, which cannot be overwritten.
Akzeptierte Antwort
Stephen23
am 12 Apr. 2015
Bearbeitet: Stephen23
am 13 Apr. 2015
Learn to write vectorized code to make your code neater, faster and more robust: loops are not the first choice for solving problems in MATLAB, vectorization is!
This solution takes less than one second on my machine. First we generate an array of fake data, consisting of 100000 two-character strings of random characters:
N = 100000;
C = cellstr(char(32+randi(94,N,2)));
tic
[D,~,X] = unique(C(:));
Y = hist(X,unique(X));
Z = struct('name',D,'freq',num2cell(Y(:)));
toc
Elapsed time is 0.379057 seconds.
And we can have a look at a random example of the output Z:
>> Z(5).name
ans =
!%
>> Z(5).freq
ans =
12
For newer versions you can use histogram instead. Note that vectorized code scale up to larger array sizes much nicer than loops do: even for one million elements in array C this method only took 4.87 seconds on my machine.
0 Kommentare
Weitere Antworten (0)
Siehe auch
Kategorien
Mehr zu Matrix Indexing finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!