Create Extension Dictionary for Spelling Correction

This example shows how to create a Hunspell extension dictionary for spelling correction.

When using the correctSpelling function, the function may update some correctly spelled words. To provide a list of known words, you can use the 'KnownWords' option directly with a string array of known words. Alternatively, you can specify a Hunspell extension dictionary (also known as a personal dictionary) that not only specifies a list of known words, it can also specify forbidden words and words alongside affix rules.

Specify Known Words

Create an array of tokenized documents.

str = [
    "Use MATLAB to correct spelling of words."
    "Correctly spelled worrds are important for lemmatizing."
    "Text Analytics Toolbox providesfunctions for spelling correction."];
documents = tokenizedDocument(str);

Correct the spelling of the documents using the correctSpelling function.

updatedDocuments = correctSpelling(documents)
updatedDocuments = 
  3x1 tokenizedDocument:

    9 tokens: Use MAT LAB to correct spelling of words .
    8 tokens: Correctly spelled words are important for legitimatizing .
    9 tokens: Text Analytic Toolbox provides functions for spelling correction .

The function has corrected the spelling of the words "worrds" and "providesfunctions", though it has also updated some correctly spelled words:

  • The input word "MATLAB" has been split into the two words "MAT" and "LAB".

  • The input word "lemmatizing" has been changed to "legitimatizing".

  • The input word "Analytics" has been changed to "Analytic".

To create a Hunspell extension dictionary containing a list of known words, create a .dic file containing these words with one word per line. Create an extension dictionary with name knownWords.dic file containing the words "MATLAB", "lemmatization", and "Analytics".

MATLAB
Analytics
lemmatizing

Correct the spelling of the documents again and specify the extension dictionary knownWords.dic.

updatedDocuments = correctSpelling(documents,'ExtensionDictionary','knownWords.dic')
updatedDocuments = 
  3x1 tokenizedDocument:

    8 tokens: Use MATLAB to correct spelling of words .
    8 tokens: Correctly spelled words are important for lemmatizing .
    9 tokens: Text Analytics Toolbox provides functions for spelling correction .

Specify Affix Rules

When specifying multiple words with the same root word (for example, specifying the words "lemmatize", "lemmatizer", "lemmatized", and so on), it can be easier to indicate a set of affix rules. Instead of specifying the same word multiple times with different affixes, you can specify particular word to inherit a set of affix rules from.

For example, create an array of tokenized documents and use the correctSpelling function.

str = [
    "A lemmatizer reduces words to their dictionary forms."
    "To lemmatize words, use the normalizeWords function."
    "Before lemmatizing, add part of speech details to the text."
    "Display lemmatized words in a word cloud."];
documents = tokenizedDocument(str);
updatedDocuments = correctSpelling(documents)
updatedDocuments = 
  4x1 tokenizedDocument:

     9 tokens: A legitimatize reduces words to their dictionary forms .
    10 tokens: To legitimatize words , use the normalize Words function .
    12 tokens: Before legitimatizing , add part of speech details to the text .
     8 tokens: Display legitimatized words in a word cloud .

Notice that the word "normalizeWords" and variants of "lemmatize" do not get updated correctly.

Create an extension dictionary with name knownWordsWithAffixes.dic file containing the words "normalizeWords" and "lemmatize". For the word "lemmatize", also specify to also include valid affixes of the word "equalize" using the "/" symbol.

normalizeWords
lemmatize/equalize

Correct the spelling of the documents again and specify the extension dictionary knownWordsWithAffixes.dic.

updatedDocuments = correctSpelling(documents,'ExtensionDictionary','knownWordsWithAffixes.dic')
updatedDocuments = 
  4x1 tokenizedDocument:

     9 tokens: A lemmatizer reduces words to their dictionary forms .
     9 tokens: To lemmatize words , use the normalizeWords function .
    12 tokens: Before lemmatizing , add part of speech details to the text .
     8 tokens: Display lemmatized words in a word cloud .

Notice that the variants of "lemmatize" have not been changed. The default dictionary contains the word "equalize" and also recognizes the words "equalizer" and "equalized" via the "-r" and "-d" suffixes, respectively. By specifying the entry "lemmatize/equalize", the software recognizes the word "lemmatize" as well as other words by extension of the affixes corresponding to "equalize". For example, the words "lemmatizer" and "lemmatized".

Specify Forbidden Words

When using the correctSpelling function, the function may output undesirable words, even if a more desirable word is in the dictionary. For example, for the input word "MALTAB", the correctSpelling function may output the words "MALT AB" or the word "MALTA". To ensure that certain words to not appear in the output, you can specify forbidden words in the extension dictionary.

For example, create an array of tokenized documents and correct the spelling using the extension dictionary knownWords.dic. Note that this dictionary contains the word "MATLAB".

str = [
    "Analyze text data using MATLAB."
    "Use MALTAB for text analysis."];
documents = tokenizedDocument(str);
updatedDocuments = correctSpelling(documents,'ExtensionDictionary','knownWords.dic')
updatedDocuments = 
  2x1 tokenizedDocument:

    6 tokens: Analyze text data using MATLAB .
    7 tokens: Use MALT AB for text analysis .

Even though the word "MATLAB" is in the dictionary or extension dictionary, the software may still choose other words as matches to incorrectly spelled words close to "MATLAB".

Create an extension dictionary with name knownWordsWithForbiddenWords.dic file containing the word "MATLAB" and also specify the forbidden words "malt" and "Malta" using the "*" symbol.

MATLAB
*malt
*Malta

Correct the spelling using the extension dictionary knownWordsWithForbiddenWords.dic.

updatedDocuments = correctSpelling(documents,'ExtensionDictionary','knownWordsWithForbiddenWords.dic')
updatedDocuments = 
  2x1 tokenizedDocument:

    6 tokens: Analyze text data using MATLAB .
    6 tokens: Use MATLAB for text analysis .

Here, the word "MALTAB" is corrected to "MATLAB".

See Also

|

Related Topics