What Is Lemmatization?

Lemmatization is a text normalization technique in natural language processing. Lemmatization uses vocabulary and morphological analysis to remove affixes of words. For example, “building has floors” reduces to “build have floor” upon lemmatization.

Lemmatization Applications

Lemmatization is often used for:

Information retrieval for expanding search criteria
Reducing dimensionality of problems in text classification, sentiment analysis, or topic modeling

Lemmatization vs. Stemming

A related approach to lemmatization, stemming, is based on simple heuristic rules. It often results in roots or word parts that are not actual words, whereas lemmatization always returns valid dictionary words.

Examples of lemmatization and stemming are shown below.

Actual Word	Lemmatization	Stemming
Requirement	Requirement	Requir
Applied	Apply	Appli

In MATLAB^®, lemmatization can be done using “normalizeWords” function with the style option of ‘lemma’. To learn more about using lemmatization and building predictive models with text data with MATLAB, see Text Analytics Toolbox™.

Examples and How To

Prepare Text Data for Analysis - Example
Create Simple Text Model for Classification - Example

Software Reference

Language Considerations - Documentation
Getting Started with Text Analytics Toolbox - Documentation
normalizeWords - Function
addLemmaDetails - Function
tokenizedDocument - Function

Getting Started with Text Analytics in MATLAB

Download white paper