Frequency words for each labels

4 Ansichten (letzte 30 Tage)
Rachele Franceschini
Rachele Franceschini am 7 Jul. 2022
I have one dataset with two columns: text and data. The data is made up two labels 0 and 1. I would like to calculate the frequency of each word for each labels. I mean, how many time, for example "damage" there is within class 1 and 0? How can I do? Furthermore, I don't understand if I have to, however, use tokens or no. Maybe I can use a cicle for? I don't know it.
Here there is a little image with a similar result. I would like a similar table.

Akzeptierte Antwort

Karim
Karim am 7 Jul. 2022
Bearbeitet: Karim am 7 Jul. 2022
Edit to make so that the code works with the latter added example data...
% read the file
data = readtable("dati_classificati.xlsx",'TextType','string');
% split each sentence into words, assuming that spaces are used as delimiter...
cell_text = arrayfun(@(x) data.text(x,:),1:size(data.text,1),'UniformOutput',false)';
cell_text = cellfun(@(x) split(x,' '), cell_text,'UniformOutput',false);
% count the number of words in each sentence
numWords = cellfun(@numel, cell_text);
% expand the labels to match the number of words for each sentence
expandedLabels = repelem( data.label ,numWords);
% gather the words in 1 big string array
expandedWords = vertcat(cell_text{:});
% list a few words to count the frequency...
MyWords = ["strada" "il" "Via" "donne" "della"];
% allocate a table for the results
varTypes = ["string","double","double"]; % data type for each column
varNames = ["Words","Ones","Zeros"]; % variable name for each column
MyResult = table('Size',[numel(MyWords) 3],'VariableTypes',varTypes,'VariableNames',varNames);
MyResult.Words = MyWords(:);
% count the labels for each word
for i = 1:numel(MyWords)
currLabels = expandedLabels( contains(expandedWords,MyResult.Words(i)) );
MyResult.Ones(i) = sum(currLabels==1);
MyResult.Zeros(i) = sum(currLabels==0);
end
% display the results
MyResult
MyResult = 5×3 table
Words Ones Zeros ________ ____ _____ "strada" 48 1 "il" 34 20 "Via" 53 0 "donne" 0 2 "della" 3 14
  9 Kommentare
Karim
Karim am 7 Jul. 2022
I modified the original answer accoring to the file you provided, see at the top. Note that i just used the raw text and only included a few words. But normally now you see how the concept works.
Rachele Franceschini
Rachele Franceschini am 7 Jul. 2022
VERY VERY thank you!!!!Thank you so much!!I tried also with pre-process and it is ok!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Kategorien

Mehr zu Timetables finden Sie in Help Center und File Exchange

Produkte


Version

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by