correctSpelling

Correct spelling of words

Description

Use correctSpelling to correct spelling of words in string arrays or documents.

The function supports English, German, and Korean text.

example

updatedDocuments = correctSpelling(documents) corrects the spelling of the words in the tokenizedDocument array documents.

example

updatedWords = correctSpelling(words) corrects the spelling of the words in the string vector words.

updatedWords = correctSpelling(words,'Language',language) also specifies the language of the words in the string vector words.

[___,unknownWords] = correctSpelling(___) also returns a vector of words in the input that were not found in the dictionary and for which no suggestion was found.

example

___ = correctSpelling(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Create a tokenized document array.

str = [
    "A documnent containing some misspelled worrds."
    "Another documnent cntaining typos."];
documents = tokenizedDocument(str);

Correct the spelling of the words in the documents using the correctSpelling function.

updatedDocuments = correctSpelling(documents)
updatedDocuments = 
  2×1 tokenizedDocument:

    7 tokens: A document containing some misspelled words .
    5 tokens: Another document containing typos .

Create a string array of words.

words = ["A" "strng" "array" "containing" "misspelled" "worrds" "."];

Correct the spelling of the words in the string array using the correctSpelling function.

updatedWords = correctSpelling(words)
updatedWords = 1x7 string
  Columns 1 through 6

    "A"    "string"    "array"    "containing"    "misspelled"    "words"

  Column 7

    "."

Create a tokenized document array.

str = [
    "Analyze text data using MATLAB."
    "Another documnent cntaining typos."];
documents = tokenizedDocument(str);

Correct the spelling of the words in the documents using the correctSpelling function.

updatedDocuments = correctSpelling(documents)
updatedDocuments = 
  2×1 tokenizedDocument:

    7 tokens: Analyze text data using MAT LAB .
    5 tokens: Another document containing typos .

Notice that the word "MATLAB" gets split into the two words "MAT" and "LAB".

Correct the spelling of the documents and specify "MATLAB" as a known word using the 'KnownWords' option.

updatedDocuments = correctSpelling(documents,'KnownWords',"MATLAB")
updatedDocuments = 
  2×1 tokenizedDocument:

    6 tokens: Analyze text data using MATLAB .
    5 tokens: Another document containing typos .

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specify words as a character vector, then the function treats the argument as a single word.

Data Types: string | char | cell

Word language, specified as one of the following:

  • 'en' – English language

  • 'de' – German language

  • 'ko' – Korean language

If you do not specify language, then the software detects the language automatically.

Data Types: char | string

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: correctSpelling(documents,'KnownWords',["MathWorks" "MATLAB"]) corrects the spelling of the words in documents and treats the words "MathWorks" and "MATLAB" as correctly spelled words.

Words to be treated as correct, specified as the comma-separated pair consisting of 'KnownWords' and a string array or a cell array of character vectors.

If you specify a list of known words, then these words remain unchanged when the function corrects spelling. The software may also substitute misspelled words with words from the list of known words.

Example: ["MathWorks" "MATLAB"]

Data Types: char | string | cell

Hunspell extension dictionary file (also known as personal dictionary file), specified as the comma-separated pair consisting of 'ExtensionDictionary' and a file path of a Hunspell extension dictionary file.

A Hunspell extension dictionary file is a .dic file containing the number of words in the dictionary followed by a list of the words in the following format:

word1/affixWord1
word2/affixWord2
...
wordN/affixWordN
*forbiddenWord1
*forbiddenWord2
...
*forbiddenWordM
where:

  • word1, word2, …, wordN is a list words to extend the Hunspell dictionary with.

  • affixWord1, affixWord2, …, affixWordN (optional) indicate words in the Hunspell dictionary that share affixes. Indicate affixes by concatenating them to the corresponding word with a forward slash (/). For example, the entry exxxtreme/extreme indicates that affixes that apply to the word "extreme" also apply to the custom word "exxxtreme".

  • forbiddenWord1, forbiddenWord2, …, forbiddenWordN is a list of forbidden words to use for spelling correction. Indicate forbidden words using an asterisk (*).

The entries in the Hunspell extension dictionary file can appear in any order.

For example, to create a Hunspell extension dictionary file specifying:

  • The words "MathWorks", "MATLAB", and "exxxtreme".

  • The affixes that apply to the word "extreme" also apply to the word "exxxtreme".

  • The word "MATLOB" is a forbidden word.

use:

MathWorks
MATLAB
exxxtreme/extreme
*MATLOB

For an example showing how to create Hunspell extension dictionary files, see Create Extension Dictionary for Spelling Correction. For more information about the options of Hunspell dictionary files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.

Data Types: char | string

Hunspell dictionary file, specified as the comma-separated pair consisting of 'Dictionary' and a file path of a Hunspell dictionary file.

A Hunspell dictionary file is a .dic file containing the number of words in the dictionary followed by a list of the words in the following format:

N
word1/flags1
word2/flags2
...
wordN/flagsN

where N is the number of words in the dictionary file, word1, word2, …, wordN are the N words in the dictionary, and flags1, …, flagsN specify optional flags corresponding to the words word1, word2, …, wordN, respectively. Use flags to specify word attributes, for example affixes. To specify a Hunspell affix file, use the 'Affixes' option.

For example, a to create a Hunspell dictionary file containing the 4 words "MathWorks", "MATLAB", "correctSpelling", and "tokenizedDocument", use:

4
MathWorks
MATLAB
correctSpelling
tokenizedDocument

For more information about the options of Hunspell dictionary files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.

Data Types: char | string

Hunspell affix file, specified as the comma-separated pair consisting of 'Affixes' and a file path of a Hunspell affix file.

A Hunspell affix file is a .aff file containing the number of words in the dictionary followed by a list of the words in the following format:

option1 values1
option2 values2
...
optionM valuesM

where M is the number of options in the affix file, option1, option2, …, optionM are the M options, and values1, …, valuesN specify the values corresponding to the options option1, option2, …, optionM, respectively. Use these options to specify affixes.

Prefixes

To define a prefix rule, use the PFX option with the format:

PFX flag crossProduct K
PFX flag stripping1 prefix1 condition1
...
PFX flag strippingK prefixK conditionK
where the values:

  • flag corresponds to the flags used in the Hunspell dictionary file.

  • crossProduct indicates whether prefixes and suffixes can be mixed, specified as Y or N.

  • K is the number of prefixes defined for the specified flag.

  • stripping1, stripping2, …, strippingK indicate characters to be stripped from the word when applying prefix. If the stripping value is 0, then no stripping takes place.

  • prefix1, prefix2, …, prefixK specify the prefixes to use.

  • condition1, condition2, …, conditionK specify the optional conditions for which to apply the prefixes prefix1, prefix2, …, prefixK, respectively. For the trivial condition, specify ".".

Suffixes

To define a suffix rule, use the SFX option with the format:

SFX flag crossProduct K
SFX flag stripping1 suffix1 condition1
...
SFX flag strippingK suffixK conditionK
where suffix1, suffix2, …, suffixK specify the prefixes to use, and the flag, cross product, K, stripping, and condition values are the same as the prefix format.

Example

Create a Hunspell affix file defining the following affix rules:

  • Flag A:

    • prefix words with "re"

  • Flag B:

    • suffix words not ending with "y" with "ed".

    • suffix words ending with "y" with "ied", removing "y".

use the Hunspell affix file:

PFX A Y 1
PFX A 0 re .

SFX B Y 1
SFX B 0 ed [^y]
SFX B y ied y

To use these flags in a Hunspell dictionary file, append the appropriate flags to the words using the "/". For each word, you can specify multiple flags. For example, to specify a dictionary file containing:

  • The words "ptest" and "ptry".

  • For the word "ptest" only, also include the prefix "re" using flag A.

  • For both words, also include the suffixes "ed" or "ied" where appropriate using flag B

For more information about the options of Hunspell affix files, see https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html.

Data Types: char | string

Method to retokenize documents, specified as the comma-separated pair consisting of 'RetokenizeMethod' and one of the following:

  • 'split' – Correct spelling by splitting tokens. For example, split the incorrectly spelled token "twowords" into the correctly spelled tokens "two" and "words".

  • 'none' – Do not split tokens for spelling correction.

Output Arguments

collapse all

Corrected documents, returned as a tokenizedDocument array. If the 'RetokenizeMethod' option is 'split', then the number of words in each updated document may be different to the corresponding input document.

If there are multiple candidates for corrected words, then the function automatically selects a single word for correction.

Corrected words, returned as a string vector. If the 'RetokenizeMethod' option is 'split', then the number of updated words may be different the number of input words.

If there are multiple candidates for corrected words, then the function automatically selects a single word for correction.

Unknown words, returned as a string vector. The string vector unknownWords contains the input words that are not in the spelling correction dictionary and for which no suggestions are found.

Introduced in R2020a