fitlsa

Description

A latent semantic analysis (LSA) model discovers relationships between documents and the words that they contain. An LSA model is a dimensionality reduction tool useful for running low-dimensional statistical models on high-dimensional word counts. If the model was fit using a bag-of-n-grams model, then the software treats the n-grams as individual words.

example

mdl = fitlsa(bag,numComponents) fits an LSA model with numComponents components to the bag-of-words or bag-of-n-grams model bag.

example

mdl = fitlsa(counts,numComponents) fits an LSA model to the documents represented by the matrix of word counts counts.

example

mdl = fitlsa(___,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Fit a Latent Semantic Analysis model to a collection of documents.

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);

Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents) 
bag = 
  bagOfWords with properties:

          Counts: [154x3092 double]
      Vocabulary: [1x3092 string]
        NumWords: 3092
    NumDocuments: 154

Fit an LSA model with 20 components.

numComponents = 20;
mdl = fitlsa(bag,numComponents)
mdl = 
  lsaModel with properties:

              NumComponents: 20
           ComponentWeights: [1x20 double]
             DocumentScores: [154x20 double]
                 WordScores: [3092x20 double]
                 Vocabulary: [1x3092 string]
    FeatureStrengthExponent: 2

Transform new documents into lower dimensional space using the LSA model.

newDocuments = tokenizedDocument([
    "what's in a name? a rose by any other name would smell as sweet."
    "if music be the food of love, play on."]);
dscores = transform(mdl,newDocuments)
dscores = 2×20

    0.1338    0.1623    0.1680   -0.0541   -0.2464   -0.0134    0.2604   -0.0205   -0.1127    0.0627    0.3311   -0.2327    0.1689   -0.2695    0.0228    0.1241    0.1198    0.2535   -0.0607    0.0305
    0.2547    0.5576   -0.0095    0.5660   -0.0643   -0.1236   -0.0082    0.0522    0.0690   -0.0330    0.0385    0.0803   -0.0373    0.0384   -0.0005    0.1943    0.0207    0.0278    0.0001   -0.0469

Load the example data. sonnetsCounts.mat contains a matrix of word counts corresponding to preprocessed versions of Shakespeare's sonnets.

load sonnetsCounts.mat
size(counts)
ans = 1×2

         154        3092

Fit LSA model with 20 components. Set the feature strength exponent to 4.

numComponents = 20;
exponent = 4;
mdl = fitlsa(counts,numComponents, ...
    'FeatureStrengthExponent',exponent)
mdl = 
  lsaModel with properties:

              NumComponents: 20
           ComponentWeights: [1x20 double]
             DocumentScores: [154x20 double]
                 WordScores: [3092x20 double]
                 Vocabulary: [1x3092 string]
    FeatureStrengthExponent: 4

Input Arguments

collapse all

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object. If bag is a bagOfNgrams object, then the function treats the n-grams as individual words.

Number of components, specified as a positive integer. This value must be less than the number of the input documents, and the vocabulary size of the input documents.

Example: 200

Frequency counts of words, specified as a matrix of nonnegative integers. If you specify 'DocumentsIn' to be 'rows', then the value counts(i,j) corresponds to the number of times the jth word of the vocabulary appears in the ith document. Otherwise, the value counts(i,j) corresponds to the number of times the ith word of the vocabulary appears in the jth document.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'FeatureStrengthExponent',4 sets the feature strength exponent to 4.

Orientation of documents in the word count matrix, specified as the comma-separated pair consisting of 'DocumentsIn' and one of the following:

  • 'rows' – Input is a matrix of word counts with rows corresponding to documents.

  • 'columns' – Input is a transposed matrix of word counts with columns corresponding to documents.

This option only applies if you specify the input documents as a matrix of word counts.

Note

If you orient your word count matrix so that documents correspond to columns and specify 'DocumentsIn','columns', then you might experience a significant reduction in optimization-execution time.

Initial feature strength exponent, specified as a nonnegative scalar. This value scales the feature component strengths for the documentScores, wordScores, and transform functions.

Example: 'FeatureStrengthExponent',4

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Output Arguments

collapse all

Output LSA model, returned as an lsaModel object.

Introduced in R2017b