Problem with Fuzzy Substring Matching (brainteaser)

Question

Léon am 15 Dez. 2011

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/24036-problem-with-fuzzy-substring-matching-brainteaser

Hello,

I implemented the Levenshtein algorithm in Matlab and modified it that that it can search for substrings within a longer string. (Setting the insertion costs at the beginning to zero).

Example:

original_levenshtein('Hello Tim','Tim') = 6
modified_levenshtein('Hello Tim','Tim') = 0

Where 0 is a perfect match and every additional point means 1 modification is necessary to change the second string to match the first one exactly.

So this works very well, but unfortunately I have to compare huge amounts of strings and not all of them are that clean and nice. For example, I have this string1 ('Hello Tim') on the one hand and a bunch of strings on the other and I want to know which of these is closest to string 1. With Levenshtein I can now rank the scores and that's it.

The problem is that shorter strings always get a better score and ruin my rankings.

Example:

modified_levenshtein('Hello Tim','Tim') = 0
modified_levenshtein('Hello Tim','i') = 0

So although 'Tim' is 3 times longer than 'i' and achieves a perfect match, both get the same score and are 'indistinguishable' with regard to the best matching for 'Hello Tim'. But of course the second one is nonsense. So my question is how can I link the score to the length of the string to compensate for that?

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Jan am 16 Dez. 2011

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/24036-problem-with-fuzzy-substring-matching-brainteaser#answer_31643

In MATLAB Online öffnen

What about a simple scaling:

score = (modified_levenshtein(S1, S2) + 1) / length(S2)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Answer 2

Léon am 16 Dez. 2011

0
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/24036-problem-with-fuzzy-substring-matching-brainteaser#answer_31645

In MATLAB Online öffnen

Hello Jan,

I actually didn't think of that.

My actual approach is to examine the shortest string on the list and subtract the difference between that and the actual string to the score.

Like:

string1 = ' Hello Tim';
candidates = ('Tim' ; 'i');
mlength = min(length(candidates)); % I know that the real code must be slightly different
score = (modified_levenshtein(string1,candidates(1)) - (length(candidates(1) - mlength);

Which one serves best?

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Jan am 16 Dez. 2011

The choice depends on the needs. If you look in the string "Tim", should "Xim" or "im" be preferred?

Melden Sie sich an, um zu kommentieren.

Problem with Fuzzy Substring Matching (brainteaser)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (2)

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

Problem with Fuzzy Substring Matching (brainteaser)

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Antworten (2)

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar -1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden

Siehe auch

Kategorien

Tags

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

1 Kommentar
-1 ältere Kommentare anzeigen-1 ältere Kommentare ausblenden