Latent Dirichlet Allocation: what are corpus topic probabilities

2 Ansichten (letzte 30 Tage)
Kim Maria Damiani
Kim Maria Damiani am 7 Nov. 2021
I think I am not understanding the definition of corpus topic probabilities in LDA.
In the corpus on which I perform LDA documents belong to one of 2 classes (independently of LDA), labelled "yes" and "no". I want to see if there is some difference in the document topic distribution among the two classes. I perform the sum of the document topic matrix on each column, weighted by the fraction of tokens in that document compared to the tokens in the whole class, for "yes" and "no" classes. The row vector I obtain weighting these 2 row vectors by the fraction of tokens in each class to the total number of tokens in the corpus is different from the corpus topic probabilities.
%Documents are split into "yes" if myindex=1, "no" otherwise
yes=mdl.DocumentTopicProbabilities(myindex,:).*doclength(docs(myindex));
no=mdl.DocumentTopicProbabilities(~myindex,:).*doclength(docs(~myindex));
sum_length_yes=sum(doclength(docs(myindex)))
sum_length_no=sum(doclength(docs(~myindex)))
plot_yes=sum(yes)./sum_length_yes
plot_no=sum(no)./sum_length_no
% document topic matrix weighted on words fraction
my_calculation= (plot_yes.*sum_length_yes+plot_no.*sum_length_no)./(sum_length_yes+sum_length_no)
% fraction of documents assigned to each topic
number=1:k
[~,topTopics] = max(mdl.DocumentTopicProbabilities,[],2);
x=sum(topTopics==number)
x =x/sum(x)
% unweighted mean of document topic matrix
mean(mdl.DocumentTopicProbabilities)
% both are different from corpus topic probabilities
corpus_topic=mdl.CorpusTopicProbabilities
The corpus topic probabilities are also different from the fraction of the documents assigned to each topic (taking the topic with maximum probability for each topic) among the whole corpus.
My question is how corpus topic probabilities are computed, so that I can understand if there is a sensible way to split the calculation for the 2 classes.

Antworten (0)

Kategorien

Mehr zu Numeric Types finden Sie in Help Center und File Exchange

Tags

Produkte


Version

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by