Fastest way to find text keywords out of large amount of textual news sentences?

Question

Song Decn am 24 Jan. 2021

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/725367-fastest-way-to-find-text-keywords-out-of-large-amount-of-textual-news-sentences

Beantwortet: Walter Roberson am 8 Feb. 2021

Hello, I have a database containing over 900,000 line of news. And I want to scan these lines of texts for certain keyword. I tried

tic; strfind(newsDb.SingleNewline, kws{1}); toc
tic; contains(newsDb.SingleNewline, kws{1}); toc

both takes over 0.003 sec for search in one keyword in one news line.

If I want to create a new database with over 20,000 keywords, then it would take

900000 * 20000 * 0.003 / 60 / 60 / 24

over 600 days to do this. :(

Anyone has perhaps an idea how to to this within perhaps one-two day?

Thank you very much

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Walter Roberson am 24 Jan. 2021

You have not defined your desired output. Is it:

for each different keyword, a list of the positions that the word occurs at, for each different line, for each news article?
for each news article, a list of all of the keywords found in it?
for each news article, a list per line of all of the keywords found on the line?
for each keyword, a list of all of the news articles the keyword was found in?

Because if what you really want to know is which keywords were found in each news article, or if you just want to know which news articles match at least one keyword, then there are more efficient ways.

Question: what do you want to do about substrings, such as "bus" occurring inside "busy", or about the fact that the word "strudels" contains a rude word? What do you want to do about pluralizations, which may or may not be regular -- if the keyword is "cat" then should "cats" be matched? If "bus" is the keyword should "busses" be matched? If "mouse" is the keyword should "mice" be matched? If "moose" is the keyword, should "meese" be matched?

Song Decn am 28 Jan. 2021

Hi, Walter, thank u for your questions and advice. My question was, like i have a database containing followinng news headlines (ca. 890.000 lines):

"Elon musk is the richest man on the planet"
"Elon musk is the poorst man on the mars"
"Trump is the president of US"
"Trump is the not president of US"

Then I have a database of tags like "Musk" "Trump" "Mars"

I would like to create a new database with index:

"Musk" - {1,2}

"Trump" - {3,4}

"Mars" - {2}

no matter in which case the tag appears in the news headlines, just in whole word.

The problem is the the tag database contains ca. 4300 tags, and strfind / contains each takes too much time.

----------------------------------------

Right now I found out that "regexp" can "scan" a string line with multiple tags in a very very very fast tempo (i dont know why ....). With help of parfor, I can now reduce the processing time to approx. 3 hours to finish the sorting job. If you have much better solution, pls. let me know.

Walter Roberson am 28 Jan. 2021

What do you want to do about substrings, and plurals, and upper/lowercase and the other factors I asked about? For example if the headline were "Elon visits Oak Hammock Marsh" then is it acceptable that this would match "Mars" ? And "Elon eats musk-melon" ? And "Eucre trumps Bridge in recent poll" ?

Song Decn am 8 Feb. 2021

Hi Walter. Searching is based on whole word and same case. Thx.

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Walter Roberson am 8 Feb. 2021

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/725367-fastest-way-to-find-text-keywords-out-of-large-amount-of-textual-news-sentences#answer_617502

In MATLAB Online öffnen

You can do the search phase efficiently:

   S = [ "Elon Musk is the richest man on the planet"
    "Elon Musk is the poorest man on Mars"
    "Trump is the president of US"
    "Elon eats musk-melon"
    "Eucre Trumps Bridge in recent poll"
    "Trump is the not president of US"]
S = 6×1 string array
    "Elon Musk is the richest man on the planet"
    "Elon Musk is the poorest man on Mars"
    "Trump is the president of US"
    "Elon eats musk-melon"
    "Eucre Trumps Bridge in recent poll"
    "Trump is the not president of US"
  Tags = ["Musk" "Trump" "Mars"]
Tags = 1×3 string array
    "Musk"    "Trump"    "Mars"
 numTags = length(Tags);
  
  pattern = "\<(?<word>(" + strjoin(Tags, "|") + "))\>"
pattern = "\<(?<word>(Musk|Trump|Mars))\>"
  search_results = regexp(S, pattern, 'names')
search_results = 6x1 cell array
    {1×1 struct}
    {1×2 struct}
    {1×1 struct}
    {0×0 struct}
    {0×0 struct}
    {1×1 struct}

However, the output is not really what you want: it is information about each tag that was matched for each cell, and needs to re-arranged to give information about where each tag was found.

  tags_matched = cellfun(@(C) string({C.word}), search_results, 'uniform', 0).'
tags_matched = 1x6 cell array
    {["Musk"]}    {1×2 string}    {["Trump"]}    {0×0 string}    {0×0 string}    {["Trump"]}
 TagWasFoundAt = cell(numTags,1);
 for K = 1 : numTags; TagWasFoundAt{K} = find(cellfun(@(C) ismember(Tags{K}, C), tags_matched)); end
 [cellstr(Tags(:)), TagWasFoundAt]
ans = 3x2 cell array
    {'Musk' }    {1×2 double}
    {'Trump'}    {1×2 double}
    {'Mars' }    {[       2]}
 
 %OR
 match_bits = cell2mat(cellfun(@(C) ismember(Tags, string({C.word})), search_results, 'uniform', 0));
 TagWasFoundAt = arrayfun(@(COL) find(match_bits(:,COL)).', (1:numTags).', 'uniform', 0);
 [cellstr(Tags(:)), TagWasFoundAt]
ans = 3x2 cell array
    {'Musk' }    {1×2 double}
    {'Trump'}    {1×2 double}
    {'Mars' }    {[       2]}

It is likely that there are other ways to do the matching from tags to entries.

The first of those two is probably more efficient, but the match_bits array would be useful if you wanted a single data structure that you could easily query to find out which articles contain a particular tag, or which tags a particular article contains. The match_bits array is good for doing boolean searches, for example, such as trying to find articles that contain Musk Or Mars but not Trump

 (match_bits(:,1) | match_bits(:,3)) & ~match_bits(:,2)
ans = 6x1 logical array
   1
   1
   0
   0
   0
   0

There might be better ways of doing the matching.

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Fastest way to find text keywords out of large amount of textual news sentences?

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Akzeptierte Antwort

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Fastest way to find text keywords out of large amount of textual news sentences?

6 Kommentare 4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

Akzeptierte Antwort

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Community Treasure Hunt

6 Kommentare
4 ältere Kommentare anzeigen4 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden