Use Dictionaries to Process Text Data
A dictionary is a data structure that associates unique keys with corresponding values. You can efficiently access data stored as values through the corresponding keys and extract collections of keys, values, or key-value pairs independently. This example creates a dictionary to extract information from a text file.
Create Dictionary from Text File
Create a dictionary that counts the occurrences of words in the text file sonnets.txt. The file contains “The Sonnets” by William Shakespeare.
Import and Preprocess Text Data
Import the text file and preprocess it by replacing punctuation marks with spaces. Convert the text into a string array of lowercase words.
sonnets = string(fileread("sonnets.txt")); p = ["." "?" "!" "," ";" ":" "'"]; sonnets = replace(sonnets,p," "); sonnets = lower(sonnets); words = split(sonnets);
Create Word Count Dictionary
Create an empty dictionary to store word counts. For each word, if it already exists in the dictionary, increase its value by one. Otherwise, add a new entry with the word as the key and 1 as the value.
d = configureDictionary("string","double"); for word = words' if isKey(d,word) d(word) = d(word)+1; else d(word) = 1; end end
Look Up Word Counts
After creating the dictionary from the text file, you can inspect the dictionary for word counts. The keys are the unique words in sonnets.txt, and the values are the corresponding counts of the words.
Look up a value in the dictionary by using the corresponding key. For example, look up the word count for the word “rose.”
d("rose")ans = 6
You can also look up multiple words at the same time.
d(["rose" "love"])
ans = 1×2
6 189
Alternatively, you can use the lookup function to access values corresponding to specific keys.
lookup(d,"rose")ans = 6
lookup(d,["rose" "love"])
ans = 1×2
6 189
Using the lookup function allows you to specify a fallback value when the key is not found.
lookup(d,["rose" "algorithm"],FallbackValue=0)
ans = 1×2
6 0
You can use numEntries to determine the number of key-value pairs stored in d.
N = numEntries(d)
N = 3265
Find Most Frequent Words
You can use the word count dictionary to find the most frequent words in sonnets.txt.
Extract and Sort Word Count Values
Extract all values from a dictionary using the values function. For example, extract all word counts from the dictionary.
v = values(d);
The values function returns an N-by-1 array that you can manipulate like any other array. The values in the returned array maintain the order in which the entries were added to the dictionary.
Sort the word counts in descending order.
[s,ind] = sort(v,"descend");Extract and Find Most Frequent Words
To find the most frequent words, first extract all keys from the dictionary using the keys function.
k = keys(d);
The keys function returns an N-by-1 array that you can manipulate like any other array. The keys in the returned array maintain the order in which the entries were added to the dictionary.
Display the five words with the highest counts using the sort index obtained earlier.
k(ind(1:5))
ans = 5×1 string
"and"
"the"
"to"
"my"
"of"
Extract Dictionary Entries
You can use the entries function to extract all key-value pairs stored in the dictionary. For example, extract all pairs of words and their corresponding counts from the dictionary as a table. Display the first five rows of the table.
T = entries(d); head(T,5);
Key Value
_____________ _____
"the" 436
"sonnets" 1
"by" 94
"william" 1
"shakespeare" 1
The rows in the returned table maintain the order in which the entries were added to the dictionary.
Use sortrows to sort the rows in T by word counts and display the top five most frequent words and their counts.
Ts = sortrows(T,"Value","descend"); head(Ts,5);
Key Value
_____ _____
"and" 490
"the" 436
"to" 409
"my" 371
"of" 370
Alternatively, you can use the sort index obtained earlier.
T(ind(1:5),:)
ans=5×2 table
Key Value
_____ _____
"and" 490
"the" 436
"to" 409
"my" 371
"of" 370
See Also
dictionary | configureDictionary | lookup | numEntries | values | keys | entries