Text classifica​tion/simil​arity measures when text data is very similar

4 Ansichten (letzte 30 Tage)
Ess Gee
Ess Gee am 19 Feb. 2021
Beantwortet: Anay am 3 Jul. 2025
I'm trying to use Matlab to perform a type of text classification/similarity measure, and I'm having some difficulty figuring out the right algorithm for my specific use case.
I have a list of 'gold standard' product names which can be quite similar to one another e.g.
productnames_examples = [
"Apple iPhone 11 128GB Black"
"Apple iPhone 11 128GB Blue"
"Apple iPhone 11 256GB Black"
"Apple iPhone 11 256GB Blue"
"Apple iPhone 11 Pro 128GB Black"
"Apple iPhone 11 Pro 128GB Blue"
"Apple iPhone 11 Pro 256GB Black"
"Apple iPhone 11 Pro 256GB Blue"
"Apple iPhone 12 128GB Black"
"Apple iPhone 12 128GB Blue"
"Apple iPhone 12 256GB Black"
"Apple iPhone 12 256GB Blue"
"Apple iPhone 12 Pro 128GB Black"
"Apple iPhone 12 Pro 128GB Blue"
"Apple iPhone 12 Pro 256GB Black"
"Apple iPhone 12 Pro 256GB Blue"]
I'm trying to match a long list of queries to these product names. Examples of the query values are:
querynames_examples = [
"Apple iPhone 11 256 Gb Black"
"Apple iPhone 12 256GB Pro Blue"
"Apple iPhone 12 Pro iOS 10 5.5 5G LTE 128GB Blue"
"Apple iPhone 11 Pro Black 5.5 256GB"
"iPhone 12 Pro Blue 256GB" ]
Edit distance algorithms (such as those found in the editDistance function) don't seem to be appropriate as edit distances won't help when the productnames string contents are as similar as they are in this example. For example, the query "Apple iPhone 12 256GB Pro Blue" is more likely to match to "Apple iPhone 12 256GB Blue" than the correct "Apple iPhone 12 Pro 256GB Blue"
I've also looked at the family of BM25 algorithms and again, it doesn't seem to be able to get past the similarity of the contents of the contents of productnames.
I've also looked at training a simple text classifier on word frequency counts using a bag-of-words model based on https://mathworks.com/help/textanalytics/ug/create-simple-text-model-for-classification.html, with some alterations (e.g. no minumum word length in the preprocessing so numeric values like 11 can be captured) and using pre-existing matched data as training data, but again I don't seem to be getting anything useful out of it.
Is there a function in Matlab that can be used to suitably match these queries?

Antworten (1)

Anay
Anay am 3 Jul. 2025
Hi Ess,
Based on the description of your requirement, I suggest you to go for “dense document embedding”. It uses neural networks with “attention layers” which helps to capture the semantic and contextual meaning of the text beyond just the individual words. To use dense document embeddings you would need to install the “Text Analytics Toolbox” add on.
Models like BM25 or classifiers trained on bag-of-words use sparse vectors which do not capture the semantic meaning of text or the order of words which may not be suitable for your application where product names can be quite similar to one another.
You can use the following code for reference:
documents = [
"Apple iPhone 11 128GB Black"
"Apple iPhone 11 128GB Blue"
"Apple iPhone 11 256GB Black"
"Apple iPhone 11 256GB Blue"
"Apple iPhone 11 Pro 128GB Black"
"Apple iPhone 11 Pro 128GB Blue"
"Apple iPhone 11 Pro 256GB Black"
"Apple iPhone 11 Pro 256GB Blue"
"Apple iPhone 12 128GB Black"
"Apple iPhone 12 128GB Blue"
"Apple iPhone 12 256GB Black"
"Apple iPhone 12 256GB Blue"
"Apple iPhone 12 Pro 128GB Black"
"Apple iPhone 12 Pro 128GB Blue"
"Apple iPhone 12 Pro 256GB Black"
"Apple iPhone 12 Pro 256GB Blue"];
emb = documentEmbedding(Model="all-MiniLM-L12-v2");
embeddedDocuments = embed(emb,documents);
query = "Apple iPhone 12 Pro iOS 10 5.5 5G LTE 128GB Blue";
embeddedQuery = embed(emb,query);
scores = cosineSimilarity(embeddedDocuments,embeddedQuery);
[~,idx] = sort(scores,"descend");
%display the doc which had the highest similarity score
disp("query: " + query)
disp("ranked documents (in decending order of similarity:")
disp(documents(idx));
I get this output:
In order to use the “all-MiniLM-L12-v2” model you need to install the “Text Analytics Toolbox Model for all-MiniLM-L12-v2 Network” support package. Follow the below link to it’s download page:
You can consider following below link to learn more about dense document embeddings:

Kategorien

Mehr zu MATLAB Mobile finden Sie in Help Center und File Exchange

Produkte

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by