Using strcmpi (or similar) to extract indexes with multiple strings
11 Ansichten (letzte 30 Tage)
Ältere Kommentare anzeigen
Hi team,
I need to be able to locate the row in a spreadsheet which contains a string in column A AND a string in column B
The strings are different but remain the same throughout my large set of files, can I use strcmpi to do this (potentially with an & or something)?
I have ~90k files to go through, some files contain additional data above the specific row I need which is why I need to use an index for it and not use a simple range technique
The code below is for a different set of data that I used strcmpi for, is there anything that can be done to simply modify this?
rowix = [];
file = convertCharsToStrings(filedir(3).name);
[num,txt,raw] = xlsread(file);
for k = 1:length(duid)
p = strcmpi(duid(k,1),raw(:,6));
rownum = find(p==1);
rowix = [rowix; rownum];
end
rowix = sort(rowix);
will the ismember() function be better here?
Thanks
1 Kommentar
Walter Roberson
am 25 Okt. 2022
I have ~90k files to go through
I recommend that you switch from xlsread() to readtable() .
Antworten (1)
Walter Roberson
am 25 Okt. 2022
will the ismember() function be better here?
You specifically use strcmpi() so we will assume that you need a case-insensitive match. In such a situation, to use ismember() you would need to ismember(lower(TERM), lower(TARGET))
ismember() is known to work by sorting the target and then doing binary searches. sorting is an O(n*log(n)) execution time, and then binary search is O(log(n)) so the overall search time for ismember(A,B) would be roughly numel(B)*log(numel(B)) + numel(A) * log(numel(B))... which would work out as roughly O( log(numel(B)) * (numel(A) + numel(B)) )
The method used by strcmpi() is probably linear. Where ismember(A,B) permits looking for each member of A in each member of B, strcmpi(A,B) requires A to be scalar, or else A must be the same size as B and corresponding elements are compared. Thus, strcmpi(A,B) does not have any need to be able to efficiently compare multiple elements in A to multiple elements in B. A linear search that compared each element of A to each element of B would require numel(A) * numel(B) time -- which could be considerably higher than O( log(numel(B)) * (numel(A) + numel(B)) ) . So having both algorithms available is useful.
You are comparing multiple duid to multiple raw(:,6) so potentially ismember() would be faster if you compared everything in one call
Let us try some timing...
letters = ['A' : 'Z', 'a' : 'z', '0' : '9'];
numletters = numel(letters);
N = 1e5;
raw6 = cell(N, 1);
for K = 1 : N
raw6{K} = letters(randi(numletters, 1, randi(20)));
end
M = 25;
duid = letters(randi(numletters, 1, 10));
strcmp_nomatch = zeros(M,1);
member_nomatch = zeros(M,1);
lower_nomatch = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_nomatch(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_nomatch(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_nomatch(K) = stop; end
duid = upper(raw6{1});
strcmp_first = zeros(M,1);
member_first = zeros(M,1);
lower_first = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_first(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_first(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_first(K) = stop; end
duid = upper(raw6{floor(end/2)});
strcmp_middle = zeros(M,1);
member_middle = zeros(M,1);
lower_middle = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_middle(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_middle(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_middle(K) = stop; end
duid = upper(raw6{end});
strcmp_last = zeros(M,1);
member_last = zeros(M,1);
lower_last = zeros(M,1);
for K = 1 : M; start = tic; strcmpi(duid, raw6); stop = toc(start); strcmp_last(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_last(K) = stop; end
for K = 1 : M; start = tic; lower(duid); lower(raw6); stop = toc(start); lower_last(K) = stop; end
strcmp_means = mean( [strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last] );
member_means = mean( [member_nomatch, member_first, member_middle, member_last] );
lower_means = mean( [lower_nomatch, lower_first, lower_middle, lower_last] );
T = array2table([strcmp_means; member_means; lower_means], ...
'VariableNames', {'no match', 'first', 'middle', 'last'}, ...
'RowNames', {'strcmpi', 'ismember', 'lower() only'} );
disp(T)
plot([strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last, ...
member_nomatch, member_first, member_middle, member_last, ...
lower_nomatch, lower_first, lower_middle, lower_last] );
legend( {'strcmpi no match', 'strcmpi first', 'strcmpi middle', 'strcmpi last', ...
'ismember no match', 'ismember first', 'ismember middle', 'ismember last', ...
'lower() no match', 'lower() first', 'lower() middle', 'lower() last' }, ...
'location', 'eastoutside');
What does this tell us? Well, it tells us that the great majority of the time difference between the strcmpi() approach and the ismember() approach is mostly due to the calls to lower() that are needed for the case where the cases of the values do not match. If it happened to be the case that the stored data in raw(:,6) was already all lower-case or all upper-case, then most of the time could be avoided.
(But if you look at the difference between the lower() only version and the ismember() time, you can see that the ismember() version must still have been slower than the strcmp()
1 Kommentar
Walter Roberson
am 25 Okt. 2022
And what if case sensitivity is used so we do not need to take into account converting case?
letters = ['A' : 'Z', 'a' : 'z', '0' : '9'];
numletters = numel(letters);
N = 1e5;
raw6 = cell(N, 1);
for K = 1 : N
raw6{K} = letters(randi(numletters, 1, randi(20)));
end
M = 25;
duid = letters(randi(numletters, 1, 10));
strcmp_nomatch = zeros(M,1);
member_nomatch = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_nomatch(K) = stop; end
for K = 1 : M; start = tic; ismember(duid, raw6); stop = toc(start); member_nomatch(K) = stop; end
duid = raw6{1};
strcmp_first = zeros(M,1);
member_first = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_first(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_first(K) = stop; end
duid = raw6{floor(end/2)};
strcmp_middle = zeros(M,1);
member_middle = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_middle(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_middle(K) = stop; end
duid = raw6{end};
strcmp_last = zeros(M,1);
member_last = zeros(M,1);
for K = 1 : M; start = tic; strcmp(duid, raw6); stop = toc(start); strcmp_last(K) = stop; end
for K = 1 : M; start = tic; ismember(lower(duid), lower(raw6)); stop = toc(start); member_last(K) = stop; end
strcmp_means = mean( [strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last] );
member_means = mean( [member_nomatch, member_first, member_middle, member_last] );
T = array2table([strcmp_means; member_means], ...
'VariableNames', {'no match', 'first', 'middle', 'last'}, ...
'RowNames', {'strcmpi', 'ismember'} );
disp(T)
plot([strcmp_nomatch, strcmp_first, strcmp_middle, strcmp_last, ...
member_nomatch, member_first, member_middle, member_last] );
legend( {'strcmp no match', 'strcmp first', 'strcmp middle', 'strcmp last', ...
'ismember no match', 'ismember first', 'ismember middle', 'ismember last'}, ...
'location', 'eastoutside');
You can see that strcmp() slows down as it has to search further to find the match, but that ismember() does not. But strcmp() is still more than twice as fast.
Siehe auch
Kategorien
Mehr zu Large Files and Big Data finden Sie in Help Center und File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!