Hybrid method to sentiment analysis column number error

Question

oliver am 12 Apr. 2023

0
Verknüpfen

Direkter Link zu dieser Frage

https://de.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error

Bearbeitet: Sanguk am 14 Apr. 2023

IMBD_reviews_smol.csv

I am trying to perofrm the hybrid approach to sentiment analysis using both vader sentiment scores and machine learning using smv. I am trying to concatenate the sentiment scores with the bag of words features and then predict the sentiment labels for the next test set before evaluating th performance. however my X data is 1 coumn off being correct for the test. The error i recieve is

"Error using classreg.learning.internal.numPredictorsCheck

X data must have 1662 column(s).

Error in classreg.learning.classif.CompactClassificationECOC/predict (line 335)

classreg.learning.internal.numPredictorsCheck(X,...

Error in assessment_hybrid (line 50)

YPred = predict(mdl, XTest);"

my code is

% Load the movie review dataset
filename = "IMBD_reviews_first5000.csv"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.review;
textDataTest = dataTest.review;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Melden Sie sich an, um zu kommentieren.

Melden Sie sich an, um diese Frage zu beantworten.

Answer 1

Drew am 12 Apr. 2023

1
Verknüpfen

Direkter Link zu dieser Antwort

https://de.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error#answer_1214554

Bearbeitet: Drew am 12 Apr. 2023

In MATLAB Online öffnen

Updating this answer based on the comment below:

You have built one svm classifier based on the bag-of-words features. It sounds like what you want to do is build another classifier that uses both the bag-of-words features and the vader sentiment score feature. So, after you concatentate the sentiment scores with the bag-of-words features, you should train a new model on those combined features, and then evaluate that model.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  

Original answer:

As the error message indicates, on line 50 of your code (shown below), the number of predictors in XTest does not match the number of predictors used to train the model.

YPred = predict(mdl, XTest);

After training the model, the lines below added one extra predictor, and hence the mismatch.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];

The number of predictors at model training time needs to match the number of predictors at model testing time.

2 Kommentare
Keine anzeigenKeine ausblenden

oliver am 12 Apr. 2023

do you know how i would change the code to prevent adding one extra predictor

Sanguk am 13 Apr. 2023

Bearbeitet: Sanguk am 14 Apr. 2023

In MATLAB Online öffnen

I get this error

"Error using horzcat

Dimensions of arrays being concatenated are not consistent.

Error in reun (line 44)

XTrain = [XTrain, sentimentScoresTrain];"

from running:

filename = "sentiment_irrelevantdrop_Chegg"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.text;
textDataTest = dataTest.text;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

btw, thank for your coding Oliver

Melden Sie sich an, um zu kommentieren.

Hybrid method to sentiment analysis column number error

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare
Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

Hybrid method to sentiment analysis column number error

0 Kommentare -2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

Akzeptierte Antwort

2 Kommentare Keine anzeigenKeine ausblenden

Weitere Antworten (0)

Siehe auch

Kategorien

Tags

Produkte

Version

Community Treasure Hunt

0 Kommentare
-2 ältere Kommentare anzeigen-2 ältere Kommentare ausblenden

2 Kommentare
Keine anzeigenKeine ausblenden