Fastest way to replace multipe substrings with a single new string?

7 Ansichten (letzte 30 Tage)
Omar Salah
Omar Salah am 6 Jun. 2020
Kommentiert: Omar Salah am 18 Jun. 2020
Hello Everyone,
I'm trying to replace 7k different substrings with the same Tag in a 50 milllion words dataset (cell array of size 1 million of strings of average size 50 words). and as you can see, using replace or regexprep takes a long time. I tried using strrep the same way as replace but it gives me this error.
Error using strrep
All nonscalar inputs must be the same size.
I want to ask, what is the fastest and less memory consuming way to do it?
Here is the code:
%using replace
Tag='IMPORTANT'
substr={'very','much'} % a cell array of +7k words
reptag=cell(1,size(substr,2));
tagcell=cellfun(@(x) Tag,reptag,'Uniformoutput',false);
maintext=replace(maintext,substr,tagcell);
% using regexprep
ev='(';
for evi=1:size(substr,2)
ev=[ev substr '|'];
end
ev=[ev(1:end-1) ')'];
maintext=regexprep(maintext,ev,Tag);
  4 Kommentare
Omar Salah
Omar Salah am 10 Jun. 2020
@james I can actually work with both. Either a cella rray of character vectors or a cell of strings. I move between them easily. Is one type faster than the other?
Omar Salah
Omar Salah am 10 Jun. 2020
@stephen I never worked with C++ but I'm wondering, why would they be faster? Is it because they are compiled or because C++ functions are generally faster?

Melden Sie sich an, um zu kommentieren.

Antworten (1)

Mohammad Sami
Mohammad Sami am 11 Jun. 2020
After some experimentations I think that if you tokenize your sentences, you can use a hashmap to lookup the words to replace.
An example code is as follows. If you want case insensitive matching, use function lower on both the words and sentences.
substr = cellstr(substr);
w = containers.Map(substr,substr); %create a hashmap of substring you want to replace
m2 = cellstr(sentences);
m5 = cell(length(m2),1);
for i = 1:length(m2)
m3 = split(m2{i},' '); % tokenize the sentence
m4 = w.isKey(m3); % lookup which words to replace
m3(m4) = {'IMPORTANT'}; % replace the words
m5(i) = join(m3,' '); % store the updated sentence
end
  1 Kommentar
Omar Salah
Omar Salah am 18 Jun. 2020
Wow! thanks. that's definitely something to try. I will try it tonight ang get back to you :)

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Characters and Strings finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by