Analyze Text Data Using Multiword Phrases

This example shows how to analyze text using n-gram frequency counts.

N-Grams

An n-gram is a tuple of n consecutive words. For example, a bigram (the case when n=2) is a pair of consecutive words such as "heavy rainfall". A unigram (the case when n=1) is a single word. A bag-of-n-grams model records the number of times that different n-grams appear in document collections.

Using a bag-of-n-grams model, you can retain more information on word ordering in the original text data. For example, a bag-of-n-grams model is better suited for capturing short phrases which appear in the text, such as "heavy rainfall" and "thunderstorm winds".

To create a bag-of-n-grams model, use bagOfNgrams. You can input bagOfNgrams objects into other Text Analytics Toolbox functions such as wordcloud and fitlda.

Load and Extract Text Data

To reproduce the results of this example, set rng to 'default'.

rng('default')

Load the example data. The file weatherReports.csv contains weather reports, including a text description and categorical labels for each event. Remove the rows with empty reports.

filename = "weatherReports.csv";
data = readtable(filename,'TextType','String');
idx = strlength(data.event_narrative) == 0;
data(idx,:) = [];

Extract the text data from the table and view the first few reports.

textData = data.event_narrative;
textData(1:5)
ans = 5×1 string array
    "Large tree down between Plantersville and Nettleton."
    "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
    "NWS Columbia relayed a report of trees blown down along Tom Hall St."
    "Media reported two trees blown down along I-40 in the Old Fort area."
    "A few tree limbs greater than 6 inches down on HWY 18 in Roseland."

Prepare Text Data for Analysis

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessWeatherNarratives listed at the end of the example, performs the following steps:

  1. Convert the text data to lowercase using lower.

  2. Tokenize the text using tokenizedDocument.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Lemmatize the words using normalizeWords.

Use the example preprocessing function preprocessWeatherNarratives to prepare the text data.

documents = preprocessWeatherNarratives(textData);
documents(1:5)
ans = 
  5×1 tokenizedDocument:

   (1,1)   5 tokens: large tree down plantersville nettleton
   (2,1)  18 tokens: two foot deep standing water develop street winthrop unive…
   (3,1)   9 tokens: nws columbia relayed report tree blow down tom hall
   (4,1)  10 tokens: medium report two tree blow down i40 old fort area
   (5,1)   8 tokens: few tree limb great inches down hwy roseland

Create Word Cloud of Bigrams

Create a word cloud of bigrams by first creating a bag-of-n-grams model using bagOfNgrams, and then inputting the model to wordcloud.

To count the n-grams of length 2 (bigrams), use bagOfNgrams with the default options.

bag = bagOfNgrams(documents)
bag = 
  bagOfNgrams with properties:

          Counts: [28138×117043 double]
      Vocabulary: [1×18409 string]
          Ngrams: [117043×2 string]
    NgramLengths: 2
       NumNgrams: 117043
    NumDocuments: 28138

Visualize the bag-of-n-grams model using a word cloud.

figure
wordcloud(bag);
title("Weather Reports: Preprocessed Bigrams")

Fit Topic Model to Bag-of-N-Grams

A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.

Create an LDA topic model with 10 topics using fitlda. The function fits an LDA model by treating the n-grams as single words.

mdl = fitlda(bag,10);
Initial topic assignments sampled in 0.741989 seconds.
=====================================================================================
| Iteration  |  Time per  |  Relative  |  Training  |     Topic     |     Topic     |
|            | iteration  | change in  | perplexity | concentration | concentration |
|            | (seconds)  |   log(L)   |            |               |   iterations  |
=====================================================================================
|          0 |       2.81 |            |  2.043e+04 |         2.500 |             0 |
|          1 |       3.62 | 6.8345e-02 |  1.083e+04 |         2.500 |             0 |
|          2 |       3.54 | 1.9129e-03 |  1.064e+04 |         2.500 |             0 |
|          3 |       3.79 | 2.4671e-04 |  1.061e+04 |         2.500 |             0 |
|          4 |       3.81 | 8.5912e-05 |  1.060e+04 |         2.500 |             0 |
=====================================================================================

Visualize the first four topics as word clouds.

figure
for i = 1:4
    subplot(2,2,i)
    wordcloud(mdl,i);
    title("LDA Topic " + i)
end

The word clouds highlight commonly co-occuring bigrams in the LDA topics. The function plots the bigrams with sizes according to their probabilities for the specified LDA topics.

Analyze Text Using Longer Phrases

To analyze text using longer phrases, specify the 'NGramLengths' option in bagOfNgrams to be a larger value.

When working with longer phrases, it can be useful to keep stop words in the model. For example, to detect the phrase "is not happy", keep the stop words "is" and "not" in the model.

Preprocess the text. Erase the punctuation using erasePunctuation, and tokenize using tokenizedDocument.

cleanTextData = erasePunctuation(textData);
documents = tokenizedDocument(cleanTextData);

To count the n-grams of length 3 (trigrams), use bagOfNgrams and specify 'NGramLengths' to be 3.

bag = bagOfNgrams(documents,'NGramLengths',3);

Visualize the bag-of-n-grams model using a word cloud. The word cloud of trigrams better shows the context of the individual words.

figure
wordcloud(bag);
title("Weather Reports: Trigrams")

View the top 10 trigrams and their frequency counts using topkngrams.

tbl = topkngrams(bag,10)
tbl=10×3 table
                     Ngram                      Count    NgramLength
    ________________________________________    _____    ___________

    "inches"    "of"              "snow"        2075          3     
    "across"    "the"             "county"      1318          3     
    "were"      "blown"           "down"        1189          3     
    "wind"      "gust"            "of"           934          3     
    "A"         "tree"            "was"          860          3     
    "the"       "intersection"    "of"           812          3     
    "inches"    "of"              "rain"         739          3     
    "hail"      "was"             "reported"     648          3     
    "was"       "blown"           "down"         638          3     
    "and"       "power"           "lines"        631          3     

Example Preprocessing Function

The function preprocessWeatherNarratives performs the following steps in order:

  1. Convert the text data to lowercase using lower.

  2. Tokenize the text using tokenizedDocument.

  3. Erase punctuation using erasePunctuation.

  4. Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

  5. Remove words with 2 or fewer characters using removeShortWords.

  6. Remove words with 15 or more characters using removeLongWords.

  7. Lemmatize the words using normalizeWords.

function [documents] = preprocessWeatherNarratives(textData)

% Convert the text data to lowercase.
cleanTextData = lower(textData);

% Tokenize the text.
documents = tokenizedDocument(cleanTextData);

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or greater
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

% Lemmatize the words.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,'Style','lemma');
end

See Also

| | | | | | | | | | | |

Related Topics