19. NLP - basic techniques to analyse text data#

Natural Language Processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and humans using natural language. It allows computers to read, understand, interpret, and derive meaning from human languages in a valuable and structured manner. NLP stands at the intersection of computer science, artificial intelligence, and computational linguistics, aiming to build computational models of human language understanding for the development of various practical applications.

The importance of NLP in data science cannot be overstated. As we generate vast amounts of data in textual form each day, from social media posts and product reviews to emails and support tickets, NLP provides the tools and techniques necessary to make sense of this data. By transforming unstructured text into structured data, NLP allows us to analyze and extract insights from human language, providing valuable context to support decision-making processes.

NLP is a very large and active field. Here we will only cover a few basics to make the first steps towards applying modern NLP techniques in data science workflows.

19.1. Example areas for the use of NLP techniques#

Just to get an idea of how broad as well as how relevant NLP is, here some of the most common applications:

  1. Text Classification: This NLP application is tasked with assigning predefined categories or tags to text. With its automatic text analysis capabilities, it streamlines the process of organizing and categorizing text, proving essential for tasks such as spam detection, content filtering, and topic labeling.

  2. Sentiment Analysis: employs NLP to identify and quantify subjective information within a text. This technique measures the sentiment or emotional tone behind words, for instance to help businesses understand customer sentiments towards products, services, or brand topics.

  3. Summarization and Topic Modeling: These techniques involve distilling large volumes of text into concise summaries or extracting the main topics from a document or a collection of documents. By automatically identifying key points and themes, these applications can make a large corpus of text more accessible and digestible.

  4. Spell Checking: This is a commonly used application that proofreads text for spelling errors. By comparing words against a dictionary of correctly spelled words, spell checking tools can suggest corrections, thus enhancing the clarity and credibility of the written text.

  5. Machine Translation: This complex NLP task involves translating text from one language to another. With the ability to handle vast amounts of information, machine translation systems can greatly expedite multilingual communication and overcome language barriers. This technology is, for instance, used to handle large volumes of information that would be impractical to translate manually.

  6. Chatbots: NLP plays a pivotal role in the functioning of chatbots, enabling them to understand human language, interpret user queries, and respond in a conversational manner. By leveraging NLP, chatbots can deliver customer service, provide information, and even entertain, all in real time. As every reader will know, this field has seen several breakthroughs in recent years.

19.2. Python NLP libraries#

There are several Python libraries that are popularly used for modern Natural Language Processing (NLP) tasks. Here are a few of the most commonly used ones:

  1. NLTK (Natural Language Toolkit): This is a widely used library for symbolic and statistical NLP. It provides easy-to-use interfaces to over 50 corpora and lexical resources, such as WordNet. NLTK also includes text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. It’s excellent for teaching and working with the basics of NLP.

  2. SpaCy: This library is known for its advanced NLP capabilities and efficient performance. SpaCy is designed to handle large volumes of text, and its features include named entity recognition, part-of-speech tagging, dependency parsing, and sentence segmentation. Its flexibility and speed make it ideal for production-grade NLP tasks.

  3. Gensim: This is a robust open-source vector space modeling and topic modeling toolkit. Gensim is designed to handle large text collections using data streaming and incremental algorithms, which is different from most other scientific software packages that only target batch and in-memory processing. It’s especially good for tasks that involve topic modeling and document similarity analysis.

  4. TextBlob: This library simplifies text processing tasks by providing a consistent API for diving into common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. TextBlob is very beginner-friendly and is an excellent choice for basic NLP tasks and for people getting started with NLP in Python.

  5. Transformers (by Hugging Face): This library is based on the transformer architecture (like BERT, GPT-2, RoBERTa, XLM, etc) and has pre-trained models for many NLP tasks. It offers simple, yet powerful, APIs for performing tasks such as text classification, named entity recognition, translation, summarization, and more. It is a go-to library for state-of-the-art NLP (but clearly beyond the NLP basics that we will cover here).

Here, we will work with SpaCy and NLTK.

#!pip install nltk
#!pip install spacy
import os
from matplotlib import pyplot as plt

# NLP related libraries to work with text data
import nltk
import spacy

19.3. Tokenization, Stemming, Lemmatization#

Before we dive into any complex operations with text data, we must first prepare it by dividing it into smaller, more manageable units. Similar to how we break down a paragraph into sentences and sentences into words when reading, we apply the same concept in Natural Language Processing (NLP) - but in a slightly different manner. This process is called tokenization.

  1. Tokenization: This is the first step in many NLP pipelines. Tokenization is the process of breaking up the original raw text into smaller pieces, known as tokens. These tokens help us understand the context in which they’re used and draw meaning from them. Tokens are often words, but they can also be phrases, sentences, or other units, depending on the level of detail needed. For example, the sentence “This is an example sentence” would be tokenized into [‘This’, ‘is’, ‘an’, ‘example’, ‘sentence’].

Following tokenization, the resulting tokens often need to be normalized, a process that refines these tokens to improve their usefulness in further analysis.

  1. Token Normalization: Token normalization is a process that includes converting all text to the same case (usually lower), removing punctuation, and similar tasks. The process involves two major techniques: stemming and lemmatization.

    • Stemming: Stemming is the method of reducing inflected or derived words to their base or root form. For example, “running”, “runs”, and “ran” are all variations of the word “run”, so stemming reduces them all to “run”. However, stemming can sometimes be too crude, often cutting off the end of words in a way that leaves a base that isn’t a real word. For example, “argument” might be stemmed to “argu.”

    • Lemmatization: Lemmatization, on the other hand, is a more sophisticated process. While it has the same goal as stemming—to reduce a word to its base form—it uses a detailed lexicon and morphological analysis of words to achieve this. For example, lemmatization correctly identifies that the lemma for “better” is “good”. While it is more accurate than stemming, it is also more computationally intensive.

Tokenization, followed by token normalization, forms the initial preprocessing steps for most NLP tasks. They transform raw text data into a more digestible and analyzable format, preparing the ground for more advanced NLP techniques.

#nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download('punkt')

19.4. NLTK#

Here, we demonstrate the NLTK processes of tokenization, stemming, and lemmatization.

19.4.1. Tokenization#

Tokenization is the process of splitting a large paragraph or text into sentences or words. These sentences or words are known as tokens. This is a crucial step in NLP as we often deal with words in text data.

In this block of code, we import NLTK and define a string of text. The text contains a list of words with different grammatical forms. Using NLTK’s TreebankWordTokenizer, we break down the text into individual words, or “tokens”. The output is a list of these tokens.

text = "feet cats wolves talking talked?"

tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)
['feet', 'cats', 'wolves', 'talking', 'talked', '?']

19.4.2. Stemming#

Stemming is a process of reducing inflected (or sometimes derived) words to their word stem or root form—generally a written word form. The stem need not be identical to the morphological root of the word.

Here, we create a PorterStemmer object and use it to find the root stem of each word in our list of tokens. The result is a list of these stems. You’ll notice that the stems aren’t always valid words (like ‘wolv’ for ‘wolves’), as stemming operates on a rule-based approach without understanding the context.

stemmer = nltk.stem.PorterStemmer()
print([stemmer.stem(w) for w in tokens])
['feet', 'cat', 'wolv', 'talk', 'talk', '?']

19.4.3. Lemmatization#

Lemmatization is the process of reducing inflected words to their word base or dictionary form. It’s similar to stemming but is more accurate as it takes the context and meaning of the word into consideration.

Instead of the PorterStemmer, we use NLTK’s WordNetLemmatizer to find the dictionary base form (or lemma) of each word. This results in a list of lemmas. As you can see, lemmatization provides a more accurate root form (‘wolf’ for ‘wolves’) as compared to stemming.

stemmer = nltk.stem.WordNetLemmatizer()
print([stemmer.lemmatize(w) for w in tokens])
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/corpus/util.py:84, in LazyCorpusLoader.__load(self)
     83 try:
---> 84     root = nltk.data.find(f"{self.subdir}/{zip_name}")
     85 except LookupError:

File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/data.py:583, in find(resource_name, paths)
    582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)

LookupError: 
**********************************************************************
  Resource wordnet not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('wordnet')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/wordnet.zip/wordnet/

  Searched in:
    - '/home/runner/nltk_data'
    - '/home/runner/micromamba/envs/data_science/nltk_data'
    - '/home/runner/micromamba/envs/data_science/share/nltk_data'
    - '/home/runner/micromamba/envs/data_science/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


During handling of the above exception, another exception occurred:

LookupError                               Traceback (most recent call last)
Cell In[6], line 2
      1 stemmer = nltk.stem.WordNetLemmatizer()
----> 2 print([stemmer.lemmatize(w) for w in tokens])

Cell In[6], line 2, in <listcomp>(.0)
      1 stemmer = nltk.stem.WordNetLemmatizer()
----> 2 print([stemmer.lemmatize(w) for w in tokens])

File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/stem/wordnet.py:45, in WordNetLemmatizer.lemmatize(self, word, pos)
     33 def lemmatize(self, word: str, pos: str = "n") -> str:
     34     """Lemmatize `word` using WordNet's built-in morphy function.
     35     Returns the input word unchanged if it cannot be found in WordNet.
     36 
   (...)
     43     :return: The lemma of `word`, for the given `pos`.
     44     """
---> 45     lemmas = wn._morphy(word, pos)
     46     return min(lemmas, key=len) if lemmas else word

File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/corpus/util.py:121, in LazyCorpusLoader.__getattr__(self, attr)
    118 if attr == "__bases__":
    119     raise AttributeError("LazyCorpusLoader object has no attribute '__bases__'")
--> 121 self.__load()
    122 # This looks circular, but its not, since __load() changes our
    123 # __class__ to something new:
    124 return getattr(self, attr)

File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/corpus/util.py:86, in LazyCorpusLoader.__load(self)
     84             root = nltk.data.find(f"{self.subdir}/{zip_name}")
     85         except LookupError:
---> 86             raise e
     88 # Load the corpus.
     89 corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)

File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/corpus/util.py:81, in LazyCorpusLoader.__load(self)
     79 else:
     80     try:
---> 81         root = nltk.data.find(f"{self.subdir}/{self.__name}")
     82     except LookupError as e:
     83         try:

File ~/micromamba/envs/data_science/lib/python3.10/site-packages/nltk/data.py:583, in find(resource_name, paths)
    581 sep = "*" * 70
    582 resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
--> 583 raise LookupError(resource_not_found)

LookupError: 
**********************************************************************
  Resource wordnet not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('wordnet')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/wordnet

  Searched in:
    - '/home/runner/nltk_data'
    - '/home/runner/micromamba/envs/data_science/nltk_data'
    - '/home/runner/micromamba/envs/data_science/share/nltk_data'
    - '/home/runner/micromamba/envs/data_science/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

19.5. NLP for languages other than English#

Natural Language Processing (NLP) is a truly global discipline, extending its reach to languages far beyond just English.

However, it’s worth noting that the effectiveness and ease of applying NLP techniques may vary across languages. For instance, languages with complex morphology like Finnish or Turkish, or those with little word delimitation like Chinese, can present unique challenges. Furthermore, resources and pre-trained models, especially those for machine learning, are more readily available for some languages, particularly English, than for others.

19.5.1. Let’s try some German#

text = "Füsse Katzen Wölfe sprechen gesprochen?"  # Not an actual German sentence. Only some words for illustrative purposes.

tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
tokens

stemmer = nltk.stem.SnowballStemmer("german")
print([stemmer.stem(token) for token in tokens])
['fuss', 'katz', 'wolf', 'sprech', 'gesproch', '?']

19.6. Applying SpaCy Models for Lemmatization#

SpaCy is a highly versatile and efficient Python library for Natural Language Processing (NLP). It offers comprehensive and advanced functionalities, outperforming NLTK in terms of efficiency and speed. You can find extensive details in SpaCy’s official documentation.

Having familiarized ourselves with the concept of lemmatization, let’s now explore its practical application using SpaCy.

Initially, you need to ensure that SpaCy and the relevant language models are installed in your environment. In the case of English, en_core_web_sm is a suitable model, whereas for German, de_core_news_sm can be utilized. SpaCy offers a variety of models for different languages which you can explore on the SpaCy models page.

Installation of SpaCy and downloading of language models can be performed via the following terminal commands:

pip install spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm

Download the required language models first:

# Uncomment these lines to download the models
#!python -m spacy download en_core_web_sm
#!python -m spacy download de_core_news_sm

Now that the models are installed, we can load the desired one:

#nlp = spacy.load('de_core_news_lg')  # large german language model
nlp = spacy.load('de_core_news_sm')  # small german lanugage model 

Let’s now define a text and pass it through the loaded model:

text = "Füsse Katzen Wölfe sprechen gesprochen?"
doc = nlp(text)  # create NLP object
doc
Füsse Katzen Wölfe sprechen gesprochen?

19.6.1. Tokenization#

By passing the text through the loaded NLP model, SpaCy already performs tokenization and a host of other operations under the hood:

[token.text for token in doc]
['Füsse', 'Katzen', 'Wölfe', 'sprechen', 'gesprochen', '?']

19.6.2. Lemmatization#

Unlike NLTK, SpaCy has not option for stemming. But it provides many different language models (for many different languages) that allow for good lemmatization.

[token.lemma_ for token in doc]
['Fuß', 'Katze', 'Wölfe', 'sprechen', 'sprechen', '--']

Each word in the text is replaced with its base form or lemma, taking into account its usage in the sentence. This helps in text normalization, a critical step in text preprocessing for NLP tasks.

19.7. Apply tokenization and lemmatization#

“War of the worlds” von H.G. Wells

Im Folgenden arbeiten wir mit dem Text des Buches “War of the Worlds” von H.G. Wells (frei verfügbar über das Gutenberg Projekt, die Datei steht auf Moodle).

# Define the filename and open the file
filename = "../datasets/wells_war_of_the_worlds.txt"
with open(filename, "r", encoding="utf-8") as file:
    text = file.read()

# Perform some basic cleaning: replace newline characters with spaces
text = text.replace("\n", " ")
# How many characters?
len(text)
338168
# Have a look at the first part of the text
text[:1000]
'The Project Gutenberg eBook of The War of the Worlds, by H. G. Wells  This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.  Title: The War of the Worlds  Author: H. G. Wells  Release Date: July 1992 [eBook #36] [Most recently updated: November 27, 2021]  Language: English   *** START OF THE PROJECT GUTENBERG EBOOK THE WAR OF THE WORLDS ***  cover      The War of the Worlds  by H. G. Wells        ‘But who shall dwell in these worlds if they be inhabited?     . . . Are we or they Lords of the World? . . . And     how are all things made for man?’                     KEPLER (quoted in _The Anatomy of Melan'
# Load the English language model
nlp = spacy.load('en_core_web_sm') 

# Create an NLP object by processing the text
doc = nlp(text) 
# Tokenization: split the text into individual tokens (words)
tokens = [token.text for token in doc]
print(tokens[:20]) 
['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'War', 'of', 'the', 'Worlds', ',', 'by', 'H.', 'G.', 'Wells', ' ', 'This', 'eBook', 'is', 'for']

Now that we have all tokens of our book, we can obviously count the number of tokens (which is not the number of words!). But we can also look at how many different tokens there are by using the Python set() function.

# Print the total number of tokens and the number of unique tokens
print(f"Total tokens: {len(tokens)}")
print(f"Unique tokens: {len(set(tokens))}")
Total tokens: 71440
Unique tokens: 7292

Let us now do the same, but with lemmatization.

# Lemmatization: reduce each token to its base or root form
lemmas = [token.lemma_ for token in doc]
print(lemmas[:40])
['the', 'Project', 'Gutenberg', 'eBook', 'of', 'the', 'War', 'of', 'the', 'Worlds', ',', 'by', 'H.', 'G.', 'Wells', ' ', 'this', 'eBook', 'be', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'part', 'of', 'the', 'world', 'at', 'no', 'cost', 'and']

We can also select tokens more specifically by using one of many attributes or methods from SpaCy (see documentation).

For instance:

  • .is_punct returns True if a token is a punctuation.

  • .is_alpha returns Trueif a token contains alphabetic characters

  • .is_stop returns True if word belongs to a so called “stop list” (less important words, we will come to this later)

Since we here only want to count words:

lemmas = [token.lemma_ for token in doc if token.is_alpha]
print(lemmas[:40])
['the', 'Project', 'Gutenberg', 'eBook', 'of', 'the', 'War', 'of', 'the', 'Worlds', 'by', 'Wells', 'this', 'eBook', 'be', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'part', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restriction']
# Print the total number of lemmas and the number of unique lemmas
print(f"Total lemmas: {len(lemmas)}")
print(f"Unique lemmas: {len(set(lemmas))}")
Total lemmas: 60629
Unique lemmas: 5589

By doing this, we are effectively shrinking the size of the dataset we are working with, while still retaining the essential meaning. It’s worth noting that we also removed “stop words” - common words such as “and”, “the”, “a” - during lemmatization, which usually do not contain important information and are often removed in NLP.

In the following steps, we could now investigate which words are the most common ones, we could identify named entities (such as people or places) or use this text data to train a machine learning model (like a text classifier or a sentiment analysis model).

19.8. Mini-Exercise!#

Why do we get more tokens than lemmas? Have a look at both and find the answer!

19.9. Chapter Summary and Outlook#

Throughout this chapter, we delved into the world of Natural Language Processing (NLP), exploring several key techniques for handling and processing text data effectively:

  • Cleaning: This is often the first step in processing text data, involving tasks like removing URLs, Emojis, and special characters, or replacing unwanted line breaks ("\n").

  • Tokenization: This involves breaking down text into smaller parts called tokens. Tokens can be as small as individual words or can even correspond to sentences or paragraphs, depending on the level of analysis required.

  • Stemming: Words can appear in different forms depending on gender, number, person, tense, and so on. Stemming involves reducing these words to their root or stem form. For example, the word “finding” could be stemmed to “find”. This process is heuristic and sometimes may lead to non-meaningful stems.

  • Lemmatization: Similar to stemming, lemmatization aims to reduce words to their base form, but with a more sophisticated approach that takes vocabulary and morphological analysis into account. Lemmatization ensures that only the inflectional endings are removed, thus isolating the canonical form of a word known as a lemma. For example, “found” would be lemmatized to “find”.

  • Other Operations: These could include removing numbers, punctuation marks, symbols, and stop words (commonly used words like “and”, “the”, “a”, etc.), as well as converting text to lowercase for uniformity.

We’ve also discussed the application of these concepts using powerful Python libraries like NLTK and SpaCy, which provide intuitive and efficient tools for dealing with NLP tasks.