From Text to Tokens#
Before we can analyze text computationally, we need to transform it into a form that algorithms can work with. Raw text is usually messy: it may contain HTML tags, URLs, punctuation, inconsistent spelling, capitalization, emojis, line breaks, or other artifacts. Moreover, a text is initially just one long sequence of characters. Most NLP methods, however, operate on smaller units such as words, sentences, or subword pieces.
This chapter introduces the basic preprocessing steps that turn raw text into a more structured representation. We will start with raw text cleaning, then split text into tokens, clean these tokens further, and finally normalize them using techniques such as stemming or lemmatization. These steps form the basis for many later NLP workflows, including word frequency analysis, TF-IDF, text classification, topic modeling, and other machine-learning approaches for text data.
Python NLP Libraries#
There are several Python libraries that are commonly used for Natural Language Processing (NLP). They differ in their goals, complexity, and typical use cases. Some are especially useful for teaching and experimenting with basic concepts, while others are designed for efficient large-scale or production-oriented NLP workflows.
NLTK (Natural Language Toolkit): NLTK is a widely used library for symbolic and statistical NLP. It provides easy-to-use interfaces to many corpora and lexical resources, such as WordNet. NLTK also includes tools for tokenization, stemming, tagging, parsing, classification, and semantic analysis. It is especially useful for teaching and for understanding the basic building blocks of NLP.
spaCy: spaCy is designed for efficient and modern NLP workflows. It provides fast and robust tools for tokenization, sentence segmentation, part-of-speech tagging, named entity recognition, dependency parsing, and more. Because of its speed and well-designed API, spaCy is often used for larger-scale and production-oriented NLP tasks [Vasiliev, 2020].
Gensim: Gensim is an open-source library for vector space modeling, topic modeling, and document similarity analysis. It is designed to handle large text collections efficiently using streaming and incremental algorithms. Gensim is particularly useful for tasks such as topic modeling, word embeddings, and similarity search.
TextBlob: TextBlob provides a simple and beginner-friendly API for common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and basic text classification. It is useful for quick experiments and for getting started with NLP, although it is less flexible and less powerful than some more specialized libraries.
Transformers by Hugging Face: The Transformers library provides access to many pre-trained transformer-based language models, such as BERT, RoBERTa, GPT-style models, and many others. It supports tasks such as text classification, named entity recognition, translation, summarization, and question answering. This library is central to many state-of-the-art NLP applications, but it goes beyond the NLP basics covered in this chapter.
In this course, we will mainly work with spaCy and NLTK. NLTK is useful for understanding fundamental NLP concepts, while spaCy provides a modern and efficient framework for applying many of these concepts in practice.
NLP Preprocessing Workflow#
Text data can come from many different sources and file formats, such as .txt, .csv, .html, PDFs, social media exports, log files, or scraped web pages. Before we can systematically analyze such data or use it for machine-learning models, we usually need to preprocess it in a consistent way.
A typical NLP preprocessing workflow is shown in Fig. 40.
Fig. 40 A typical NLP preprocessing workflow consists of several stages: raw text cleaning, tokenization, token cleaning, and token normalization. These steps prepare text for later analysis or modeling.#
Raw Text Cleaning#
When we obtain text from real-world sources, it often contains elements that are not part of the actual linguistic content. For example, scraped web pages may contain HTML tags, social media posts may contain URLs and emojis, and PDF exports may contain irregular line breaks or broken words.
Typical raw text cleaning steps include:
removing or replacing HTML/XML tags,
removing or replacing URLs, email addresses, and file paths,
normalizing whitespace, such as multiple spaces, tabs, and line breaks,
handling special characters, emojis, or encoding problems,
standardizing quotation marks, apostrophes, or hyphens,
optionally converting text to lowercase.
The goal of raw text cleaning is not always to remove everything that looks unusual. Sometimes, URLs, emojis, hashtags, punctuation, or casing may carry important information. For example, emojis can be useful in sentiment analysis, and capitalization can be important for named entity recognition. Therefore, cleaning decisions should always depend on the task.
After raw text cleaning, we aim to have a more regular and manageable text string that can be passed to the next stage.
Tokenization#
After initial cleaning, the text is split into smaller units called tokens. A token is a unit of text that we want to treat as one element in our analysis. In many simple workflows, tokens correspond roughly to words. However, tokens can also be punctuation marks, numbers, hashtags, sentence boundaries, subwords, or even characters, depending on the tokenizer and the application.
For example, the sentence
This is an example sentence. It contains punctuation and mixed case.
could be tokenized as:
["This", "is", "an", "example", "sentence", ".", "It", "contains", "punctuation", "and", "mixed", "case", "."]
At first glance, tokenization may seem easy: we could simply split a string at every space. However, real text quickly becomes more complicated. Consider punctuation, abbreviations, contractions, hyphenated words, numbers, or domain-specific terms such as C++, U.S.A., don't, COVID-19, or data-driven. A good tokenizer needs to handle such cases more carefully than a simple whitespace split.
Depending on the task, we may use different types of tokenization:
Word-level tokenization splits text into word-like units and punctuation marks.
Sentence-level tokenization splits a text into sentences, which is useful for tasks such as summarization or sentence-level sentiment analysis.
Subword tokenization splits words into smaller units and is widely used in modern transformer-based language models. We will not focus on this in the current chapter.
In this chapter, we will mainly work with word-level tokenization.
Token Cleaning#
After tokenization, we often clean the resulting tokens further. This step works on individual tokens rather than on the original raw text.
Common token cleaning steps include:
removing punctuation-only tokens, such as
".",",", or"!",removing or replacing numbers,
removing very short or very long tokens,
filtering out tokens that contain unwanted characters,
removing stop words such as
"the","and","to", or"is".
Stop-word removal is common in tasks such as word frequency analysis or topic modeling, where very frequent function words may not be informative. However, it is not always useful. For example, in sentiment analysis, words such as "not" can be crucial, and in authorship analysis, small function words may carry important stylistic information.
After token cleaning, a token list may look like this:
["example", "sentence", "contains", "punctuation", "mixed", "case"]
This list is shorter and more focused than the original token sequence, but it may still contain different grammatical forms of the same word.
Token Normalization#
Token normalization aims to reduce variation between related word forms. For example, the words "connect", "connects", "connected", and "connecting" are different surface forms, but they are closely related in meaning. Depending on the analysis, we may want to treat them as variants of the same underlying word.
Two common normalization techniques are stemming and lemmatization.
Stemming cuts words down to a shorter base form, often using simple rules. For example:
connected → connect
connecting → connect
studies → studi
Stemming is fast and simple, but the resulting stems are not always valid words.
Lemmatization maps a word to its dictionary form, called a lemma. For example:
connected → connect
connecting → connect
studies → study
better → good
Lemmatization is usually linguistically more accurate than stemming, but it often requires more information, such as the part of speech of a word.
Whether normalization is useful depends on the task. For word counts, search, or topic modeling, it can help reduce vocabulary size and group related word forms. For other tasks, such as some forms of text generation or linguistic analysis, preserving the original word forms may be more appropriate.
How These Stages Fit Together#
A simplified preprocessing workflow can be summarized as follows:
Raw text cleaning prepares the original text by removing or standardizing unwanted elements such as markup, URLs, irregular whitespace, or encoding artifacts.
Tokenization splits the cleaned text into smaller units called tokens.
Token cleaning removes or filters tokens that are not useful for the current task, such as punctuation-only tokens or stop words.
Token normalization reduces variation between related word forms using methods such as stemming or lemmatization.
Together, these steps transform raw text into a more consistent and analyzable representation. The exact workflow is not fixed: each decision depends on the data, the language, and the goal of the analysis. In the following sections, we will implement these steps in Python and compare simple manual approaches with tools from established NLP libraries.
#!pip install nltk
#!pip install spacy
import os
from matplotlib import pyplot as plt
# NLP related libraries to work with text data
import nltk
import spacy
nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download('punkt')
[nltk_data] Downloading package wordnet to /home/runner/nltk_data...
True
NLTK#
Here, we briefly demonstrate the NLTK processes of tokenization, stemming, and lemmatization.
Tokenization#
In the following block of code, we define a simple string of text. The text contains a list of words with different grammatical forms. Using NLTK’s TreebankWordTokenizer, we break down the text into individual words, or “tokens”. The output is a list of these tokens.
text = "feet cats wolves talking talked?"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)
['feet', 'cats', 'wolves', 'talking', 'talked', '?']
Stemming#
Here, we create a PorterStemmer object and use it to find the root stem of each word in our list of tokens. The result is a list of these stems. You’ll notice that the stems aren’t always valid words (like ‘wolv’ for ‘wolves’), as stemming operates on a rule-based approach without understanding the context.
stemmer = nltk.stem.PorterStemmer()
print([stemmer.stem(w) for w in tokens])
['feet', 'cat', 'wolv', 'talk', 'talk', '?']
Lemmatization#
Instead of the PorterStemmer, we use NLTK’s WordNetLemmatizer to find the dictionary base form (or lemma) of each word. This results in a list of lemmas. As you can see, lemmatization provides a more accurate root form (‘wolf’ for ‘wolves’) as compared to stemming.
stemmer = nltk.stem.WordNetLemmatizer()
print([stemmer.lemmatize(w) for w in tokens])
['foot', 'cat', 'wolf', 'talking', 'talked', '?']
NLP for languages other than English#
Natural Language Processing (NLP) is a truly global discipline, extending its reach to languages far beyond just English.
However, it’s worth noting that the effectiveness and ease of applying NLP techniques may vary across languages. For instance, languages with complex morphology like Finnish or Turkish, or those with little word delimitation like Chinese, can present unique challenges. Furthermore, resources and pre-trained models, especially those for machine learning, are more readily available for some languages, particularly English, than for others.
Let’s try some German#
text = "Füsse Katzen Wölfe sprechen gesprochen?" # Not an actual German sentence. Only some words for illustrative purposes.
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
tokens
stemmer = nltk.stem.SnowballStemmer("german")
print([stemmer.stem(token) for token in tokens])
['fuss', 'katz', 'wolf', 'sprech', 'gesproch', '?']
Applying SpaCy Models for Lemmatization#
SpaCy is a highly versatile and efficient Python library for Natural Language Processing (NLP). It offers comprehensive and advanced functionalities, outperforming NLTK in terms of efficiency and speed. You can find extensive details in SpaCy’s official documentation.
Having familiarized ourselves with the concept of lemmatization, let’s now explore its practical application using SpaCy.
Initially, you need to ensure that SpaCy and the relevant language models are installed in your environment. In the case of English, en_core_web_sm is a suitable model, whereas for German, de_core_news_sm can be utilized. SpaCy offers a variety of models for different languages which you can explore on the SpaCy models page.
Installation of SpaCy and downloading of language models can be performed via the following terminal commands:
pip install spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
Download the required language models first:
try:
# Check if already installed
nlp = spacy.load("en_core_web_sm")
except:
# If not, download the model
!python -m spacy download en_core_web_sm
# There are many other models, here a small model for German:
#!python -m spacy download de_core_news_sm
Collecting en-core-web-sm==3.8.0
Downloading en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
?25l ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.8 MB ? eta -:--:--
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 260.6 MB/s 0:00:00
?25h
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Now that the models are installed, we can load the desired one:
nlp = spacy.load("en_core_web_sm")
Let’s now define a text and pass it through the loaded model:
text = "Feet cats wolves, speak, spoken?"
doc = nlp(text) # create NLP object
print(doc)
Feet cats wolves, speak, spoken?
# Or, try an example in German:
#nlp = spacy.load('de_core_news_lg') # large german language model
#nlp = spacy.load('de_core_news_sm') # small german lanugage model
#text = "Füsse Katzen Wölfe sprechen gesprochen?"
#doc = nlp(text) # create NLP object
#print(doc)
Tokenization#
By passing the text through the loaded NLP model, SpaCy already performs tokenization and a host of other operations under the hood:
[token.text for token in doc]
['Feet', 'cats', 'wolves', ',', 'speak', ',', 'spoken', '?']
Lemmatization#
Unlike NLTK, SpaCy has not option for stemming. But it provides many different language models (for many different languages) that allow for good lemmatization.
[token.lemma_ for token in doc]
['foot', 'cat', 'wolf', ',', 'speak', ',', 'speak', '?']
Each word in the text is replaced with its base form or lemma, taking into account its usage in the sentence. This helps in text normalization, a critical step in text preprocessing for NLP tasks.
Apply tokenization and lemmatization#
“War of the worlds” von H.G. Wells
In the following, we will work with the text of the book “War of the Worlds” from H.G. Wells which is freely available via the Gutenberg Project.
# Define the filename and open the file
filename = "../datasets/wells_war_of_the_worlds.txt"
with open(filename, "r", encoding="utf-8") as file:
text = file.read()
# Perform some basic cleaning: replace newline characters with spaces
text = text.replace("\n", " ")
# How many characters?
len(text)
338168
# Have a look at the first part of the text
text[:1000]
'The Project Gutenberg eBook of The War of the Worlds, by H. G. Wells This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. Title: The War of the Worlds Author: H. G. Wells Release Date: July 1992 [eBook #36] [Most recently updated: November 27, 2021] Language: English *** START OF THE PROJECT GUTENBERG EBOOK THE WAR OF THE WORLDS *** cover The War of the Worlds by H. G. Wells ‘But who shall dwell in these worlds if they be inhabited? . . . Are we or they Lords of the World? . . . And how are all things made for man?’ KEPLER (quoted in _The Anatomy of Melan'
# Load the English language model
nlp = spacy.load('en_core_web_sm')
# Create an NLP object by processing the text
doc = nlp(text)
# Tokenization: split the text into individual tokens (words)
tokens = [token.text for token in doc]
print(tokens[:20])
['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'War', 'of', 'the', 'Worlds', ',', 'by', 'H.', 'G.', 'Wells', ' ', 'This', 'eBook', 'is', 'for']
Now that we have all tokens of our book, we can obviously count the number of tokens (which is not the number of words!). But we can also look at how many different tokens there are by using the Python set() function.
# Print the total number of tokens and the number of unique tokens
print(f"Total tokens: {len(tokens)}")
print(f"Unique tokens: {len(set(tokens))}")
Total tokens: 71440
Unique tokens: 7292
Let us now do the same, but with lemmatization.
# Lemmatization: reduce each token to its base or root form
lemmas = [token.lemma_ for token in doc]
print(lemmas[:40])
['the', 'Project', 'Gutenberg', 'eBook', 'of', 'the', 'War', 'of', 'the', 'Worlds', ',', 'by', 'H.', 'G.', 'Wells', ' ', 'this', 'eBook', 'be', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'part', 'of', 'the', 'world', 'at', 'no', 'cost', 'and']
We can also select tokens more specifically by using one of many attributes or methods from SpaCy (see documentation).
For instance:
.is_punctreturnsTrueif a token is a punctuation..is_alphareturnsTrueif a token contains alphabetic characters.is_stopreturnsTrueif word belongs to a so called “stop list” (less important words, we will come to this later)
Since we here only want to count words:
lemmas = [token.lemma_ for token in doc if token.is_alpha]
print(lemmas[:40])
['the', 'Project', 'Gutenberg', 'eBook', 'of', 'the', 'War', 'of', 'the', 'Worlds', 'by', 'Wells', 'this', 'eBook', 'be', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'part', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restriction']
# Print the total number of lemmas and the number of unique lemmas
print(f"Total lemmas: {len(lemmas)}")
print(f"Unique lemmas: {len(set(lemmas))}")
Total lemmas: 60629
Unique lemmas: 5587
By doing this, we are effectively shrinking the size of the dataset we are working with, while still retaining the essential meaning. It’s worth noting that we also removed “stop words” - common words such as “and”, “the”, “a” - during lemmatization, which usually do not contain important information and are often removed in NLP.
In the following steps, we could now investigate which words are the most common ones, we could identify named entities (such as people or places) or use this text data to train a machine learning model (like a text classifier or a sentiment analysis model).
Mini-Exercise!#
Why do we get more tokens than lemmas? Have a look at both and find the answer!
Search specific word types or combinations#
With Spacy we can in principle also do more complex searches. We could, for instance search for nouns, verbs, or adjectives.
nouns = [token for token in doc if token.pos_ == "NOUN"]
print(nouns[:40])
[use, parts, world, cost, restrictions, terms, online, laws, country, Title, Author, eBook, Language, START, PROJECT, WORLDS, worlds, things, man, Contents, BOOK, MARTIANS, CHOBHAM, FIGHTING, DESTRUCTION, XIII, EXODUS, THUNDER, CHILD, BOOK, MARTIANS, FOOT, DAYS, DEATH, WORK, DAYS, IX, COMING, MARTIANS, years]
But we cannot only search all nouns or verbs, but also for specific combinations. As an example, we can search for all combinations of likeor love with a noun:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": {"IN": ["like", "love"]}},
{"POS": "NOUN"}]
matcher.add("like/love", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(string_id, start, end, span.text)
like/love 4797 4799 like end
like/love 6222 6224 like eyes
like/love 8394 8396 like object
like/love 8970 8972 like water
like/love 9644 9646 like puffs
like/love 12536 12538 like summer
like/love 16160 16162 like daylight
like/love 19145 19147 like parade
like/love 21471 21473 like distance
like/love 23134 23136 like thunderclaps
like/love 29624 29626 like men
like/love 32952 32954 like forms
like/love 33577 33579 like ghosts
like/love 37823 37825 like clerks
like/love 42739 42741 like generator
like/love 46982 46984 like mud
like/love 49375 49377 like branches
like/love 67580 67582 like sheep
Additionally, Spacy can be combined with regular expressions to create even more complex searches, but will not be shown here. Please consult Spacy’s documentation in case you want to build more complex search patterns.
Chapter Summary and Outlook#
Throughout this chapter, we delved into the world of Natural Language Processing (NLP), exploring several key techniques for handling and processing text data effectively:
Cleaning: This is often the first step in processing text data, involving tasks like removing URLs, Emojis, and special characters, or replacing unwanted line breaks (
"\n").Tokenization: This involves breaking down text into smaller parts called tokens. Tokens can be as small as individual words or can even correspond to sentences or paragraphs, depending on the level of analysis required.
Stemming: Words can appear in different forms depending on gender, number, person, tense, and so on. Stemming involves reducing these words to their root or stem form. For example, the word “finding” could be stemmed to “find”. This process is heuristic and sometimes may lead to non-meaningful stems.
Lemmatization: Similar to stemming, lemmatization aims to reduce words to their base form, but with a more sophisticated approach that takes vocabulary and morphological analysis into account. Lemmatization ensures that only the inflectional endings are removed, thus isolating the canonical form of a word known as a lemma. For example, “found” would be lemmatized to “find”.
Other Operations: These could include removing numbers, punctuation marks, symbols, and stop words (commonly used words like “and”, “the”, “a”, etc.), as well as converting text to lowercase for uniformity.
We’ve also discussed the application of these concepts using powerful Python libraries like NLTK and SpaCy, which provide intuitive and efficient tools for dealing with NLP tasks.