how to extract a list of unique words from a set of one row strings

13 Ansichten (letzte 30 Tage)
Basically I have a set of 11 strings of words, and each string has no repeating words, but I need a list of every unique word in all 11 strings.
I've found that this works for one string at a time, but I can't get a list for all 11 strings this way.
A{1} = updatedDocuments(1,1)
B{1} = strjoin(unique(strtrim(strsplit(A{1}, ',')))', '')
Is it possible to index A{1} as updatedDocuments(1:11,1) or do something similar?

Akzeptierte Antwort

Madheswaran
Madheswaran am 14 Nov. 2024
Bearbeitet: Madheswaran am 15 Nov. 2024
I am assuming the following:
  • 'updatedDocuments' is an array of 'tokenizedDocument'
  • Each document contains text that is comma seperated and doesn't end with a comma
To get the unique words from the entire set of strings, you can follow the below approach:
% remove comma from the documents if you don't want comma to be
% included in 'uniqeWords'
updatedDocuments = removeWords(updatedDocuments, ",");
uniqueWords = updatedDocuments.Vocabulary;
If the 'updatedDocuments' is an cell array of char vector, you can follow the below approach:
updatedDocuments = strcat(updatedDocuments, ','); % Add comma at end of each cell
allWords = strjoin(updatedDocuments(1:11,1), ' '); % Join all words into a single string
allWords = strtrim(strsplit(allWords, ',')); % Split with comma as delimiter and trim
uniqueWords = unique(allWords); % unique words (1 x n cell where n is the number of unique words)
For more information, refer to the following documentations:
  1. https://mathworks.com/help/textanalytics/ref/tokenizeddocument.html
  2. https://mathworks.com/help/matlab/ref/double.unique.html
Hope this helps!
  3 Kommentare
Madheswaran
Madheswaran am 15 Nov. 2024
That is because I assumed 'updatedDocument' to be a cell array of character vectors. If 'updatedDocument' were an array of 'tokenizedDocument', resolving this issue would be straightforward. I have updated the answer by including a solution for when 'updatedDocument' is a 'tokenizedDocument', in addition to the existing explanation.
Let me know if that helps!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (1)

Paul
Paul am 14 Nov. 2024
If UpdatedDocuments is a 1D cell array of chars ...
UpdatedDocuments{1} = 'one,two,three,one';
UpdatedDocuments{2} = 'one,two,three,two';
UpdatedDocuments{3} = 'one,two,three,three';
result = cellfun(@(S) strjoin(unique(strtrim(strsplit(S, ','))),','),UpdatedDocuments,'Uni',false)
result = 1x3 cell array
{'one,three,two'} {'one,three,two'} {'one,three,two'}
  1 Kommentar
Paul
Paul am 15 Nov. 2024
The Vocabulary property of tokenizedDocument returns the uniqew words in the array
documents = tokenizedDocument([
"an example of a short sentence an example of a short sentence "
"a second short sentence a second short sentence"]);
documents
documents =
2x1 tokenizedDocument: 12 tokens: an example of a short sentence an example of a short sentence 8 tokens: a second short sentence a second short sentence
documents.Vocabulary
ans = 1x7 string array
"an" "example" "of" "a" "short" "sentence" "second"

Melden Sie sich an, um zu kommentieren.

Kategorien

Mehr zu Characters and Strings finden Sie in Help Center und File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by