How to handle text data in Pandas

This article explores techniques for cleaning, transforming, and analyzing text data in Pandas DataFrames.

Loading Text Data

Text data can be loaded into a Pandas DataFrame using functions like read_csv:


import pandas as pd

# Load text data from a CSV file
data = pd.read_csv("text_data.csv")
print(data.head())

Basic Text Operations

Pandas provides built-in string methods for text manipulation. These methods are accessed using the .str accessor:

  • data[‘column’].str.lower(): Converts text to lowercase.
  • data[‘column’].str.upper(): Converts text to uppercase.
  • data[‘column’].str.strip(): Removes leading and trailing whitespace.

# Convert text to lowercase
data['cleaned_text'] = data['text_column'].str.lower()

# Strip whitespace
data['cleaned_text'] = data['cleaned_text'].str.strip()

Handling Missing Text Data

Deal with missing values in text columns using Pandas functions:

  • data[‘column’].fillna(‘default_value’): Fill missing values with a default string.
  • data.dropna(subset=[‘column’]): Drop rows with missing text data.

# Fill missing values
data['text_column'] = data['text_column'].fillna('Unknown')

# Drop rows with missing text
data = data.dropna(subset=['text_column'])

Text Splitting and Joining

Split text into multiple parts or join text elements together:


# Split text into parts
data['split_column'] = data['text_column'].str.split(' ')

# Join text elements
data['joined_text'] = data['split_column'].str.join('-')

Extracting Substrings

Extract specific patterns or substrings from text using regular expressions:


# Extract email domain
data['email_domain'] = data['email_column'].str.extract(r'@(\w+\.\w+)')

Finding and Replacing Text

Find and replace text using the replace method:


# Replace specific words
data['text_column'] = data['text_column'].str.replace('old_word', 'new_word', regex=True)

Analyzing Text Data

Use Pandas and Python libraries for basic text analysis:

  • Count words or characters: data[‘column’].str.len().
  • Find occurrences of a substring: data[‘column’].str.contains(‘substring’).
  • Apply custom text analysis functions using apply.

# Count characters in each text entry
data['char_count'] = data['text_column'].str.len()

# Count occurrences of a word
data['word_count'] = data['text_column'].str.split().apply(len)

Advanced Text Handling with External Libraries

For more advanced text processing, integrate Pandas with libraries like NLTK or spaCy:


import nltk
from nltk.tokenize import word_tokenize

# Tokenize text column
data['tokens'] = data['text_column'].apply(word_tokenize)

Explore more in the Pandas Documentation.

Leave a Reply