Filter löschen
Filter löschen

CNN for text classification

5 Ansichten (letzte 30 Tage)
christian
christian am 6 Jul. 2023
Kommentiert: christian am 8 Jul. 2023
Hi, I am working with textual data and the main objective is to classify text strings into different classes. I have referred to the example provided in the link Classify Text Data Using Convolutional Neural Network - MATLAB & Simulink - MathWorks Italia.
I have a question regarding the role of the wordembedding layer when the input is already represented as numbers due to the encoding process. Since the input is numerical, I am uncertain about how the wordembedding layer contributes to the classification task. Could you help me to understand this?
Thank you in advance for your reply !

Akzeptierte Antwort

Pratyush
Pratyush am 7 Jul. 2023
Bearbeitet: Pratyush am 7 Jul. 2023
Hi Christian. The reason for applying a word embedding layer to the texts is to provide a semantic meaning to those texts. You are right that the workds are already represented as numbers due to encoding. But those encodings are simply tokens and do not have any semantic sense.
For example we can represent the words "apple" and "fruit" using numbers. But there is no way to represent the similarity between the two words just through the encoded numbers. Word embedding provides a vector for each word such that if two words are semantically similar, the distance between the vectors is less and vice-versa.
So after the word embedding layer, the vectors assigned to words "fruit" and "apple" will be such that the distance between the vectors will be small. On the other hand if two words are semantically not very close, like "fruit" and "car" will be assigned vectors such that the distance between them will be large.
Therefore applying the word embedding layer makes the training easier for the network. Hope this resolves your doubt.
You may refer the below resources to know more about the Word Embedding Layer.
  3 Kommentare
Pratyush
Pratyush am 7 Jul. 2023
Yes you are right. The embedding layer will provide a semantically meaningful vector for each word even if they are in encoded format. Encoding alone does not provide meaningful semantic representations.
The vocabulary is nothing but the unique words or tokens in the training dataset. Let's say the training dataset has 1023 unique words. Now the word embedding layer will take the number of unique words as a parameter (which is 1023), and for each word index from 1 to 1023 it will try to learn a semantically meaningful vector. Here is an example from the documentation you had provided earlier.
wordEmbeddingLayer(embeddingDimension,numWords,Name="emb")]
The second parameter numWords was computed earlier as the number of unique words or tokens in the training dataset as shown below.
enc = wordEncoding(documentsTrain); % Encoding of words
numWords = enc.NumWords; % Total number of unique words
So the required information of the vocabulary is provided to the network as the numWords parameter.
christian
christian am 8 Jul. 2023
Thank you very much. Have a nice day!

Melden Sie sich an, um zu kommentieren.

Weitere Antworten (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by