This example shows how to use the Latent Dirichlet Allocation (LDA) topic model to analyze text data.
A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.
To reproduce the results of this example, set rng
to 'default'
.
rng('default')
Load the example data. The file weatherReports.csv
contains weather reports, including a text description and categorical labels for each event.
data = readtable("weatherReports.csv",'TextType','string'); head(data)
ans=8×16 table
Time event_id state event_type damage_property damage_crops begin_lat begin_lon end_lat end_lon event_narrative storm_duration begin_day end_day year end_timestamp
____________________ __________ ________________ ___________________ _______________ ____________ _________ _________ _______ _______ _________________________________________________________________________________________________________________________________________________________________________________________________ ______________ _________ _______ ____ ____________________
22-Jul-2016 16:10:00 6.4433e+05 "MISSISSIPPI" "Thunderstorm Wind" "" "0.00K" 34.14 -88.63 34.122 -88.626 "Large tree down between Plantersville and Nettleton." 00:05:00 22 22 2016 22-Jul-0016 16:15:00
15-Jul-2016 17:15:00 6.5182e+05 "SOUTH CAROLINA" "Heavy Rain" "2.00K" "0.00K" 34.94 -81.03 34.94 -81.03 "One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water." 00:00:00 15 15 2016 15-Jul-0016 17:15:00
15-Jul-2016 17:25:00 6.5183e+05 "SOUTH CAROLINA" "Thunderstorm Wind" "0.00K" "0.00K" 35.01 -80.93 35.01 -80.93 "NWS Columbia relayed a report of trees blown down along Tom Hall St." 00:00:00 15 15 2016 15-Jul-0016 17:25:00
16-Jul-2016 12:46:00 6.5183e+05 "NORTH CAROLINA" "Thunderstorm Wind" "0.00K" "0.00K" 35.64 -82.14 35.64 -82.14 "Media reported two trees blown down along I-40 in the Old Fort area." 00:00:00 16 16 2016 16-Jul-0016 12:46:00
15-Jul-2016 14:28:00 6.4332e+05 "MISSOURI" "Hail" "" "" 36.45 -89.97 36.45 -89.97 "" 00:07:00 15 15 2016 15-Jul-0016 14:35:00
15-Jul-2016 16:31:00 6.4332e+05 "ARKANSAS" "Thunderstorm Wind" "" "0.00K" 35.85 -90.1 35.838 -90.087 "A few tree limbs greater than 6 inches down on HWY 18 in Roseland." 00:09:00 15 15 2016 15-Jul-0016 16:40:00
15-Jul-2016 16:03:00 6.4343e+05 "TENNESSEE" "Thunderstorm Wind" "20.00K" "0.00K" 35.056 -89.937 35.05 -89.904 "Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins." 00:07:00 15 15 2016 15-Jul-0016 16:10:00
15-Jul-2016 17:27:00 6.4344e+05 "TENNESSEE" "Hail" "" "" 35.385 -89.78 35.385 -89.78 "Quarter size hail near Rosemark." 00:05:00 15 15 2016 15-Jul-0016 17:32:00
Extract the text data from the field event_narrative
.
textData = data.event_narrative; textData(1:10)
ans = 10×1 string array
"Large tree down between Plantersville and Nettleton."
"One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water."
"NWS Columbia relayed a report of trees blown down along Tom Hall St."
"Media reported two trees blown down along I-40 in the Old Fort area."
""
"A few tree limbs greater than 6 inches down on HWY 18 in Roseland."
"Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins."
"Quarter size hail near Rosemark."
"Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area."
"Powerlines down at Walnut Grove and Cherry Lane roads."
Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessText
, listed at the end of the example, performs the following steps in order:
Tokenize the text using tokenizedDocument
.
Lemmatize the words using normalizeWords
.
Erase punctuation using erasePunctuation
.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords
.
Remove words with 2 or fewer characters using removeShortWords
.
Remove words with 15 or more characters using removeLongWords
.
Use the preprocessing function preprocessText
to prepare the text data.
documents = preprocessText(textData); documents(1:5)
ans = 5×1 tokenizedDocument: 5 tokens: large tree down plantersville nettleton 18 tokens: two foot deep standing water develop street winthrop university campus inch rain fall less hour vehicle stall water 9 tokens: nws columbia relay report tree blow down tom hall 10 tokens: medium report two tree blow down i40 old fort area 0 tokens:
Create a bag-of-words model from the tokenized documents.
bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [36176×18469 double] Vocabulary: [1×18469 string] NumWords: 18469 NumDocuments: 36176
Remove words from the bag-of-words model that have do not appear more than two times in total. Remove any documents containing no words from the bag-of-words model.
bag = removeInfrequentWords(bag,2); bag = removeEmptyDocuments(bag)
bag = bagOfWords with properties: Counts: [28137×6974 double] Vocabulary: [1×6974 string] NumWords: 6974 NumDocuments: 28137
Fit an LDA model with 7 topics. For an example showing how to choose the number of topics, see Choose Number of Topics for LDA Model. To suppress verbose output, set 'Verbose'
to 0.
numTopics = 7;
mdl = fitlda(bag,numTopics,'Verbose',0);
If you have a large dataset, then the stochastic approximate variational Bayes solver is usually better suited as it can fit a good model in fewer passes of the data. The default solver for fitlda
(collapsed Gibbs sampling) can be more accurate at the cost of taking longer to run. To use stochastic approximate variational Bayes, set the 'Solver'
option to 'savb'
. For an example showing how to compare LDA solvers, see Compare LDA Solvers.
You can use word clouds to view the words with the highest probabilities in each topic. Visualize the first four topics using word clouds.
figure; for topicIdx = 1:4 subplot(2,2,topicIdx) wordcloud(mdl,topicIdx); title("Topic " + topicIdx) end
Use transform
to transform the documents into vectors of topic probabilities.
newDocument = tokenizedDocument("A tree is downed outside Apple Hill Drive, Natick"); topicMixture = transform(mdl,newDocument); figure bar(topicMixture) xlabel("Topic Index") ylabel("Probability") title("Document Topic Probabilities")
Visualize multiple topic mixtures using stacked bar charts. Visualize the topic mixtures of the first 5 input documents.
figure topicMixtures = transform(mdl,documents(1:5)); barh(topicMixtures(1:5,:),'stacked') xlim([0 1]) title("Topic Mixtures") xlabel("Topic Probability") ylabel("Document") legend("Topic " + string(1:numTopics),'Location','northeastoutside')
The function preprocessText
, performs the following steps in order:
Tokenize the text using tokenizedDocument
.
Lemmatize the words using normalizeWords
.
Erase punctuation using erasePunctuation
.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords
.
Remove words with 2 or fewer characters using removeShortWords
.
Remove words with 15 or more characters using removeLongWords
.
function documents = preprocessText(textData) % Tokenize the text. documents = tokenizedDocument(textData); % Lemmatize the words. documents = addPartOfSpeechDetails(documents); documents = normalizeWords(documents,'Style','lemma'); % Erase punctuation. documents = erasePunctuation(documents); % Remove a list of stop words. documents = removeStopWords(documents); % Remove words with 2 or fewer characters, and words with 15 or greater % characters. documents = removeShortWords(documents,2); documents = removeLongWords(documents,15); end
addPartOfSpeechDetails
| bagOfWords
| fitlda
| ldaModel
| removeEmptyDocuments
| removeInfrequentWords
| removeStopWords
| tokenizedDocument
| transform
| wordcloud