Introduction to Working with Text Data

21. Introduction to Working with Text Data#

Data Science is a versatile field that integrates techniques from different domains to extract meaningful information from data. Often, the data we work with is numerical, perhaps tabular and structured. However, a significant portion of the data available today is unstructured and in the form of text. This chapter focuses on why and how we work with such text data.

Text data can be found in various forms and domains. It’s ubiquitous in today’s digital world and holds immense value when it comes to generating insights and making informed decisions. In general, we can categorize the applications of text data in data science into two broad categories:

21.1. Data and Structures as Strings#

String data forms a large part of the information we handle in our day-to-day lives and in scientific research. It can include:

File and Folder names: These are often encoded as text and carry essential information about the content they represent. Proper management and manipulation of these names can significantly enhance our productivity and efficiency.
Categorical Data: Many real-world datasets contain categorical features such as Pokémon type, credit card usage, eye color, etc. These categories are typically represented as text data.
Names and Designations: In various fields like social sciences, human resources, and even technical domains, names and designations (for people, products, companies, etc.) are prevalent and crucial.

21.2. Human Language in Text Form#

Much of our communication today happens via written text. Such data is rich and diverse, and its analysis can provide a wealth of insights. It includes:

Social Media, Emails, Chats: A substantial amount of information is exchanged through social media posts, emails, and chat conversations. Analyzing this data can help understand user behavior, sentiments, trends, and more.
News and Press Releases, Print Media: These sources provide a constant stream of new information about the world. Text analysis on such data can reveal patterns, detect events, and even predict future occurrences.
The Internet: The Internet is predominantly text, from information-dense Wikipedia articles to YouTube video tags and titles. Web scraping and text mining can help extract valuable insights from this vast reservoir of data.
Protocols and Transcripts: Transcriptions of meetings, court proceedings, and more provide a formal record of events. Analyzing these can reveal decision patterns, biases, and other organizational dynamics.
Legislative Texts, Manuals: Legal texts and manuals contain structured and vital information. Text analysis can aid in understanding, summarizing, and even automating parts of these documents.

In the following sections of this chapter, we will delve into how we can effectively manage and process these text data to extract valuable insights. This includes techniques from simple string manipulation to advanced natural language processing. Understanding text data and how to work with it is an invaluable skill in the modern data science landscape.

But let us start with the most fundamental basics: string handling! Here is a simple short text to work with (the text was generated by ChatGPT with its unique sense of humor…):

my_text = """Once upon a time, in a far-off land of data, lived a strange yet
curious character known as Stringman. Stringman wasn't an ordinary
resident of this land. He had a unique ability to transform himself
into different forms.\n
One day, String-man decided to visit the 'Tuple Twins'. As he was
journeying, he had to pass through the eerie 'List Forest'.
Suddenly, a wild 'IndexError' appeared. It was well-known that
'IndexErrors' were not fond of anyone from the 'String' family. 
String man, being clever, quickly transformed into 'string-man'
and tricked the 'IndexError' saying, "You must be mistaken, I am
not a Stringman, but a mere hyphenated man."\n
Fooled by his disguise, the 'IndexError' let him pass. String-man
then continued his journey, relishing his victory over the 'IndexError'
and looking forward to meeting the 'Tuple Twins'.
"""

print(my_text)

Once upon a time, in a far-off land of data, lived a strange yet
curious character known as Stringman. Stringman wasn't an ordinary
resident of this land. He had a unique ability to transform himself
into different forms.

One day, String-man decided to visit the 'Tuple Twins'. As he was
journeying, he had to pass through the eerie 'List Forest'.
Suddenly, a wild 'IndexError' appeared. It was well-known that
'IndexErrors' were not fond of anyone from the 'String' family. 
String man, being clever, quickly transformed into 'string-man'
and tricked the 'IndexError' saying, "You must be mistaken, I am
not a Stringman, but a mere hyphenated man."

Fooled by his disguise, the 'IndexError' let him pass. String-man
then continued his journey, relishing his victory over the 'IndexError'
and looking forward to meeting the 'Tuple Twins'.

21.3. Basic Python string handling methods#

The Python string data type already comes with many basic string handling methods. Here, more as a repetition, some of the most common ones:

21.3.1. Python String Methods#

Method	Description
count()	Returns the number of times a specified value occurs in a string
encode()	Returns an encoded version of the string
endswith()	Returns true if the string ends with the specified value
index()	Searches the string for a specified value and returns the position of where it was found
islower()	Returns True if all characters in the string are lower case
isupper()	Returns True if all characters in the string are upper case
join()	Joins the elements of an iterable to the end of the string
lower()	Converts a string into lower case
strip()	Returns a trimmed version of the string
lstrip()	Returns a left trim version of the string
replace()	Returns a string where a specified value is replaced with a specified value
rstrip()	Returns a right trim version of the string
split()	Splits the string at the specified separator, and returns a list
splitlines()	Splits the string at line breaks and returns a list
startswith()	Returns true if the string starts with the specified value
strip()	Returns a trimmed version of the string
upper()	Converts a string into upper case

This is not a full introduction to basic string handling with Python. For more information, you can easily find plenty of material online, for instance the w3schools on Python strings. Here, we will simply go through a few common examples as a refresher.

my_text.lower()

'once upon a time, in a far-off land of data, lived a strange yet\ncurious character known as stringman. stringman wasn\'t an ordinary\nresident of this land. he had a unique ability to transform himself\ninto different forms.\n\none day, string-man decided to visit the \'tuple twins\'. as he was\njourneying, he had to pass through the eerie \'list forest\'.\nsuddenly, a wild \'indexerror\' appeared. it was well-known that\n\'indexerrors\' were not fond of anyone from the \'string\' family. \nstring man, being clever, quickly transformed into \'string-man\'\nand tricked the \'indexerror\' saying, "you must be mistaken, i am\nnot a stringman, but a mere hyphenated man."\n\nfooled by his disguise, the \'indexerror\' let him pass. string-man\nthen continued his journey, relishing his victory over the \'indexerror\'\nand looking forward to meeting the \'tuple twins\'.\n'

One of the first things to note when working with text data in Python is the presence of special character sequences like \' and \n. You might have noticed these appearing in our story about Stringman when we printed the lowercased text using my_text.lower(). These are known as escape sequences.

The \' sequence is used to include literal single quotes (') in our text string. This is necessary because Python interprets single quotes as marking the start or end of a string. Therefore, to use a single quote as part of the string itself (for instance, in contractions like “don’t” or to denote possession as in “Python’s”), we ‘escape’ it using a backslash (\) before the quote.

On the other hand, \n is a newline character that is used to start a new line. Python interprets this sequence as a single character that moves the cursor to the next line. This is why we see the text broken into multiple lines when we print the content of my_text.

Understanding and being able to handle such escape sequences is an essential part of working with text data, as they can affect the processing and analysis of the text.

21.3.2. Modify strings#

A very common method used to modify strings in Python is via the replace() methods. This is well suited to replace specific characters or sequences of characters.

s = "We don't always like all characters and sequences we have."
s.replace("i", "!").replace("e", "3")

"W3 don't always l!k3 all charact3rs and s3qu3nc3s w3 hav3."

21.3.2.1. Mini Quiz!#

Make a guess: What would the following code return?
print("abc".replace("ab", "cc").replace("c", "x"))
a) ccx
b) ab
c) xxx
d) abx

21.3.3. Count words#

Counting words is a very common task. It is even central to many data science methods. Python string methods include methods like count() which, like their name says, can do some of this counting for us. But we will quickly see, that often this is not good enough.

my_text.count("no")

Careful. This will not count all words “no”, but every single occurrence of the letter sequence no in our string, which would also include words like “none” or “nothing” etc.

my_text.count("stringman"), my_text.count("string-man")

(0, 1)

my_text.lower().count("stringman"), my_text.lower().count("string-man")

(3, 3)

21.3.4. Tokenize#

words = my_text.lower().split(" ")  # still many wrong words in there
print(words[:40])

['once', 'upon', 'a', 'time,', 'in', 'a', 'far-off', 'land', 'of', 'data,', 'lived', 'a', 'strange', 'yet\ncurious', 'character', 'known', 'as', 'stringman.', 'stringman', "wasn't", 'an', 'ordinary\nresident', 'of', 'this', 'land.', 'he', 'had', 'a', 'unique', 'ability', 'to', 'transform', 'himself\ninto', 'different', 'forms.\n\none', 'day,', 'string-man', 'decided', 'to', 'visit']

words = my_text.lower().replace(".", "").replace(",", "").replace("\n", " ").split(" ")
print(words[:40])

['once', 'upon', 'a', 'time', 'in', 'a', 'far-off', 'land', 'of', 'data', 'lived', 'a', 'strange', 'yet', 'curious', 'character', 'known', 'as', 'stringman', 'stringman', "wasn't", 'an', 'ordinary', 'resident', 'of', 'this', 'land', 'he', 'had', 'a', 'unique', 'ability', 'to', 'transform', 'himself', 'into', 'different', 'forms', '', 'one']

21.3.5. Common problem: different variations of a word#

words.count("stringman"), words.count("string-man")

(3, 2)

Solution 1:

variations = ["string-man", "stringman", "string man"]

# Solution 1
translations = {"string-man": "stringman",
               "string man": "stringman",
               "stringman": "stringman"}

print([translations[w] for w in variations])

['stringman', 'stringman', 'stringman']

Solution 2 - real bad:

We might be tempted to again simply use replace() here, but that won’t easily cover all the cases we have.

print([w.replace("-", "") for w in variations])

['stringman', 'stringman', 'string man']

21.3.5.1. Identify special words or tags#

By only using the above mentioned basic string method, we can achieve many basic tasks when handling text data in Python. For instance, we can divide text into individual words (more or less at least), and identify special terms or tags.

my_text2 = "The key to success is not in my blog post but in the tags #education #innovation #technology"

words = my_text2.lower().split(" ")
words = [w.strip(".,!? ") for w in words]
print(words)

['the', 'key', 'to', 'success', 'is', 'not', 'in', 'my', 'blog', 'post', 'but', 'in', 'the', 'tags', '#education', '#innovation', '#technology']

for w in words:
    if w.startswith("#"):
        print(f"Hashtag: {w}")

Hashtag: #education
Hashtag: #innovation
Hashtag: #technology

21.4. Regular Expressions (Regex)#

Using only basic string operations quickly becomes very limiting, or at least the required code will become highly complex once we need to do things far beyond two or three such operations. This is usually the point where people turn to regular expressions which are far more versatile then individual replace() or split() operations.

Regular Expressions, often shortened to “regex,” are a powerful tool for working with text data. They’re a sequence of characters forming a search pattern, primarily used for pattern matching with strings or string manipulation. In the world of Data Science, regular expressions find widespread use in text processing tasks.

21.4.1. What are Regular Expressions?#

At their core, regular expressions are a means to describe patterns within strings. They offer a flexible and concise way to identify strings of text such as particular words, patterns of characters, or a combination of these. This can be as simple as searching for a specific word or as complicated as extracting all email addresses from a text.

21.4.2. Uses of Regular Expressions#

Regular expressions are used for several text processing tasks:

Validation: They can check if the input data follows a certain format, such as an email address or a telephone number.
Search: You can use regex to locate specific strings or substrings within a larger piece of text.
Substitution: They can be used to replace certain patterns in a string.
Splitting: Regular expressions can define the delimiter to split a larger string into a list of smaller substrings.
Data Extraction: They are often used to scrape web data, where specific patterns need to be extracted from HTML code.

21.4.3. Pros and Cons of Regular Expressions#

Pros:

Versatile: Regular expressions can handle a multitude of string matching problems with concise expressions.
Portable: The principles of regular expressions can be applied across many programming languages, command-line tools (like grep, sed, awk), databases, etc.
Powerful: Regular expressions can match complex patterns and perform sophisticated string manipulations.

Cons:

Complexity: Regular expressions can become complex and hard to understand and maintain, especially for intricate patterns.
Readability: A complicated regular expression can be quite cryptic, and lack of standardization in regex flavors can cause compatibility issues.
Efficiency: Depending on the complexity, regular expressions might not be the most efficient way to perform string operations, especially for very large text data.

21.4.4. Using Regular Expressions in Python#

Python’s built-in re module allows us to use regular expressions (see documentation). The module provides several functions, including match(), search(), findall(), split(), sub(), and others, each designed to manipulate strings in different ways.

import re

search = re.search(r"string[- ]?man", my_text.lower())

search

<re.Match object; span=(92, 101), match='stringman'>

search.group()

'stringman'

21.4.4.1. Search vs. findall#

Unlike the name might suggest, re.search will only search until finding the first matching pattern.

If we want to find all instances matching the given regex pattern than we have to use re.findall:

results = re.findall("string[- ]?man", my_text.lower())

results

['stringman',
 'stringman',
 'string-man',
 'string man',
 'string-man',
 'stringman',
 'string-man']

In this context it is good to know that in Python we can work with “raw strings” by adding an “r” in front of the string. This essentially just means, that Python will not interpret escape characters (\).

r"\nmno"

'\\nmno'

"\nmno"

'\nmno'

print(r"\nmno")
print("\nmno")

\nmno

mno

There are many resources in the internet on how to build and use regular expressions. Here are some of the most common rules:

21.4.5. Text, Numbers, Classes#

. any character except “new line” (“\n”)
\d a digit, that is 1, 2, … 9
\D no digit, i.e., any character except 1, 2, … 9
\w a word character, which includes lower + upper case letters, digits, and underscore
\W no word character
\s a space or tab character
\S no space or tab character, almost all characters
[aAbBcC] any of the contained characters
[^aAbBcC] NONE of the contained characters

21.4.6. Positions in the String#

^ beginning of the string
$ end of the string
\b word boundary

21.4.7. Repetitions#

? zero or once, i.e., the expression is optional
+ at least once
* any number of times (including none)
{n} exactly n times
{min,max} at least min times and at most max times
{min,} at least min times
{,max} at most max times

21.4.8. Characters#

[A-Z] ABCDEFGHIJKLMNOPQRSTUVWXYZ
[a-z] abcdefghijklmnopqrstuvwxyz
[0-9] (or \d) 0123456789
\w Word = At least one letter, a digit, or underscore (short for [a-zA-Z0-9_]).
\W No word. Short for [^a-zA-Z0-9_]
\d Digit (short for [0-9]).
\D No digit (short for [^0-9]).

21.4.9. Groups and Branches#

| is an or –> A|B therefore looks for matches with A or B.
[] contains a set of characters. [abc] would therefore expect an a, b, or c.
(...) denotes a group. This will be given separately later, e.g., when we execute re.findall() or re.search().
(?:...) is a “non-capturing group”, i.e., a group that is not given separately later.
(?=...), (?!...) “Positive Lookahead”, “Negative Lookahead” describe patterns that must follow. However, these are not read out with.
(?<=...), (?<!...) “Positive Lookbehind”, “Negative Lookbehind” describe patterns that must precede. However, these are not read out with.

# Few more examples

re.findall(r"\.\w{2,3}", "dot.com oder dot.de")

['.com', '.de']

re.findall(r"\.\w{2,3}$", "dot.com or dot.de")

['.de']

nerd_string = "500GB for 100$ is not so great. \
I want 1000.0 Gigabytes for less than 99.90 $"

regex = r"[0-9\.]+\s*(GB|[Gg]igabytes)"
re.findall(regex, nerd_string)

['GB', 'Gigabytes']

regex = r"([0-9\.]+)\s*(GB|[Gg]igabytes)"
re.findall(regex, nerd_string)

[('500', 'GB'), ('1000.0', 'Gigabytes')]

import time

def smallest_chatbot_in_the_world():
    """
    A simple chatbot that responds to user complaints about problems or issues.
    """
    print("Hi. What's up?")
    user_input = input(">>> ").strip()
    
    # Mini chat-bot
    regex_problem = r"(problem|issues?) with ([a-zA-Z\s]+)"
    problem_mentioned = re.findall(regex_problem, user_input, re.IGNORECASE)

    if problem_mentioned:
        problem = problem_mentioned[0][1].strip()
        problem = re.sub(r"\b(my|our)\b", "your", problem, flags=re.IGNORECASE)
        print("Come on! ... *MIMIMI* ... ")
        time.sleep(1)
        print(f"Stop complaining about {problem}!!")
    else:
        print("Yeah, I know what you mean.")
    
    time.sleep(2)
    print("Let's try something better next time... gotta go.")

automated_run = True  # Set to False to run it yourself!

if automated_run:
    from unittest.mock import patch

    # Mock user input for testing
    with patch("builtins.input",
               return_value="All good."):
        smallest_chatbot_in_the_world()
else:
    smallest_chatbot_in_the_world()

Hi. What's up?
Yeah, I know what you mean.

Let's try something better next time... gotta go.

automated_run = True  # Set to False to run it yourself!

if automated_run:
    from unittest.mock import patch

    # Mock user input for testing
    with patch("builtins.input",
               return_value="I have a problem with my computer"):
        smallest_chatbot_in_the_world()
else:
    smallest_chatbot_in_the_world()

Hi. What's up?
Come on! ... *MIMIMI* ... 

Stop complaining about your computer!!

Let's try something better next time... gotta go.

user_input = "Hi there. I have a problem with my broken bike."
#user_input = "Hello... I have a real problem!"
#user_input = "Hello... I have a real problem with all the violence!"

regex_problem = r"(problem|issue) with ([a-zA-Z\s]*\b)"
problem_mentioned = re.findall(regex_problem, user_input)

# mini chat-bot
print(">>", user_input)
if len(problem_mentioned) > 0:
    problem = problem_mentioned[0][1]
    problem = re.sub(r"\b(my|our)\b", "your", problem)
    print(f"What exactly is wrong with {problem}?")
else:
    print("Yeah, I know what you mean.")

>> Hi there. I have a problem with my broken bike.
What exactly is wrong with your broken bike?

import random

reflections = {
    "am": "are",
    "was": "were",
    "i": "you",
    "i'd": "you would",
    "i've": "you have",
    "i'll": "you will",
    "my": "your",
    "are": "am",
    "you've": "I have",
    "you'll": "I will",
    "your": "my",
    "yours": "mine",
    "you": "me",
    "me": "you"
}
 
psychobabble = [
    [r'I need (.*)',
     ["Why do you need {0}?",
      "Would it really help you to get {0}?",
      "Are you sure you need {0}?"]],
 
    [r'Why don\'?t you ([^\?]*)\??',
     ["Do you really think I don't {0}?",
      "Perhaps eventually I will {0}.",
      "Do you really want me to {0}?"]],
 
    [r'Why can\'?t I ([^\?]*)\??',
     ["Do you think you should be able to {0}?",
      "If you could {0}, what would you do?",
      "I don't know -- why can't you {0}?",
      "Have you really tried?"]],
 
    [r'I can\'?t (.*)',
     ["How do you know you can't {0}?",
      "Perhaps you could {0} if you tried.",
      "What would it take for you to {0}?"]],
 
    [r'I am (.*)',
     ["Did you come to me because you are {0}?",
      "How long have you been {0}?",
      "How do you feel about being {0}?"]],

    [r'(.*)\?',
     ["Why do you ask that?",
      "Please consider whether you can answer your own question.",
      "Perhaps the answer lies within yourself?",
      "Why don't you tell me?"]],
]


def reflect(fragment):                      #These have to be here...
            tokens = fragment.lower().split()
            for i, token in enumerate(tokens):
                if token in reflections:
                    tokens[i] = reflections[token]
            return ' '.join(tokens)

def eliza_answer(user_input):
    for pattern, responses in psychobabble:
        match = re.search(pattern, str(user_input))
        if match:                                   #ELIZA Responses
            rspns = random.choice(responses)
            return rspns.format(*[reflect(g) for g in match.groups()])    
    else:                                           #ChatterBot Responses
        response = "..."
        return response

user_input = "I am so ashamed of who I was"
print(">>", user_input)
print(eliza_answer(user_input))

user_input = "Its just that I can't forget about it"
print(">>", user_input)
print(eliza_answer(user_input))

user_input = "How should I go on from here?"
print(">>", user_input)
print(eliza_answer(user_input))

>> I am so ashamed of who I was
How long have you been so ashamed of who you were?
>> Its just that I can't forget about it
Perhaps you could forget about it if you tried.
>> How should I go on from here?
Perhaps the answer lies within yourself?

Some eliza-like example code graylu21/ELIZA-ChatterBot

21.4.10. Developing Regex solutions#

Luckily, there are nice tools to help develop regular expressions more interactively. It remains a game of often tricky puzzles, but it often helps a lot to use those tools. For instance:

Try, for instance, to distinguish correct email from not-correct email:

dummy.dummy@something.com
this goes to all people@all
This-is-my-mail@my-mail.home.com

21.4.10.1. Let bots do the work#

To be honest, building regular expressions used to be a time-consuming and often unstructured trial and error process, at least for many people I know.

Large language models are often a good shortcut. As usual, they won’t always come up the the most optimal solution and sometimes there solutions are even wrong. But so are mine when it comes to regex. So, whenever something is getting a bit more complicated, or when you just want to have a quick test, language models are a decent way to go. Obviously, never just copy+paste their suggestion, but try it out on some test samples!

The following was just a quick result from ChatGPT 4o on the prompt

“Please provide a regex for Python for finding all instances in my_text which are regular english verbs in past tense. Exclude cases such as red”.

import re
pattern = r'\b\w{3,}ed\b'
matches = re.findall(pattern, my_text, re.IGNORECASE)

# Output the matches
print(matches)

['lived', 'decided', 'appeared', 'transformed', 'tricked', 'hyphenated', 'Fooled', 'continued']