Code to search for a word in a file taking too long to execute.

Question

Samyukta Ramnath am 16 Aug. 2013

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/84883-code-to-search-for-a-word-in-a-file-taking-too-long-to-execute

Geschlossen: MATLAB Answer Bot am 20 Aug. 2021

I have a code that has to process a very large amount of textual data. There is a file, say A which has around 0.4 million sentences, and another file, say B with around 15000 words. For every word in file B, I need to search for that word in file A and so I need to do a strcmp and use the data in file A to return some result. I have currently defined a function which takes a word(from file B) as an argument and searches through all the words in file A to return. This function is called around 15000 times (There are that many words in file B). This is taking ages to complete in MATLAB. However, Python is able to do it with the same method in much less time. Is there a way to improve the speed? The code that I have written :

global Table3;
count_word_tag = 0;
count_tag = 0;
for i = 1:size(Table3,2)
  if strcmpi(word,Table3{i})
      if mod(i,4)==0
          if strcmpi(tag, Table3{i-1}) 
              i
              count_word_tag = count_word_tag + str2num(Table3{i-3});
              break;
          end
      end
  end
end
for i = 1:size(Table3,2)
  if strcmp(tag,Table3{i})
      if mod(i,4) == 3
          count_tag = count_tag + str2num(Table3{i-2});
      end
      if isempty(count_tag)
          break;
      end
  end
end
e = count_word_tag./count_tag;
end

And the code where I have called the function is

    for i = 1:size(inputTable,2)
      e1 = emission(inputTable{i},'O');
      e2 = emission(inputTable{i},'I-GENE');
      f(i) = max(e1,e2);
      disp('iteration no.'),i
    end

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Diese Frage ist geschlossen.

Answer 1

Cedric am 16 Aug. 2013

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/84883-code-to-search-for-a-word-in-a-file-taking-too-long-to-execute#answer_94427

Bearbeitet: Cedric am 16 Aug. 2013

In MATLAB Online öffnen

Have you tried using regular expressions? For counting occurrence in file A of words from file B, you would do something like

 content_A = fileread('file_A.txt') ;
 counts = zeros(1e6, 1) ;                 % Overshoot prealloc.
 words  = cell(1e6, 1) ;
 wordId = 0 ;
 fid_B = fopen('file_B.txt', 'r') ;
 while ~feof(fid_B)
    wordId         = wordId + 1 ;
    words{wordId}  = fgetl(fid_B) ;
    starts  = regexpi(content_A, words{wordId}) ;    % If case doesn't matter.
    %starts  = strfind(content_A, words{wordId}) ;   % If case matters.
    counts(wordId) = length(starts) ;
 end
 fclose(fid_B) ;
 counts = counts(1:wordId) ;              % Truncate to true size.
 words  = words(1:wordId) ;

But I don't really understand what you are doing in your code. How do you read files, how is emission() defined, how do you define word, tag, Table3, inputTable?

2 Kommentare
Keine anzeigenKeine ausblenden

Samyukta Ramnath am 16 Aug. 2013

In MATLAB Online öffnen

if true
  % code
end
function[e] = emission(word,tag)
global Table3;
count_word_tag = 0;
count_tag = 0;
for i = 1:size(Table3,2)
  if strcmpi(word,Table3{i})
      if mod(i,4)==0
          if strcmpi(tag, Table3{i-1}) 
              i
              count_word_tag = count_word_tag + str2num(Table3{i-3});
              break;
          end
      end
  end
end
for i = 1:size(Table3,2)
  if strcmp(tag,Table3{i})
      if mod(i,4) == 3
          count_tag = count_tag + str2num(Table3{i-2});
      end
      if isempty(count_tag)
          break;
      end
  end
end
e = count_word_tag./count_tag;
end
if true
  % code
end

This is the emission function. Table3 is a cell array of strings with 0.4 million elements, and inputTable is another cell array of strings with 15000 elements. I need to iteratively take every element in inputTable and compare it with every element in Table3.

Cedric am 16 Aug. 2013

Bearbeitet: Cedric am 16 Aug. 2013

Why iteratively? Couldn't you work on the whole cell arrays in one shot? And what happens for example when you use the tag 'I-GENE'? It looks like when strcmpi(tag, Table3{i-1}) is true, which means that Table3{i-1} is 'I-GENE', you are converting this string to num and adding the result to a counter.. which makes little sense. Also, it seems that Table3 has two dimensions; what are they? Then you address this cell array with a unique index i..

I think that the simplest would be to make a small example which shows e.g. 20 entries of Table3, a word and a tag, and explain what you would like to obtain with that.

Code to search for a word in a file taking too long to execute.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (1)

2 Kommentare
Keine anzeigenKeine ausblenden

Siehe auch

Tags

Community Treasure Hunt

Code to search for a word in a file taking too long to execute.

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (1)

2 Kommentare Keine anzeigenKeine ausblenden

Siehe auch

Tags

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden