This article explores techniques for cleaning, transforming, and analyzing text data in Pandas DataFrames.
Loading Text Data
Text data can be loaded into a Pandas DataFrame using functions like read_csv:
import pandas as pd
# Load text data from a CSV file
data = pd.read_csv("text_data.csv")
print(data.head())
Basic Text Operations
Pandas provides built-in string methods for text manipulation. These methods are accessed using the .str accessor:
- data[‘column’].str.lower(): Converts text to lowercase.
- data[‘column’].str.upper(): Converts text to uppercase.
- data[‘column’].str.strip(): Removes leading and trailing whitespace.
# Convert text to lowercase
data['cleaned_text'] = data['text_column'].str.lower()
# Strip whitespace
data['cleaned_text'] = data['cleaned_text'].str.strip()
Handling Missing Text Data
Deal with missing values in text columns using Pandas functions:
- data[‘column’].fillna(‘default_value’): Fill missing values with a default string.
- data.dropna(subset=[‘column’]): Drop rows with missing text data.
# Fill missing values
data['text_column'] = data['text_column'].fillna('Unknown')
# Drop rows with missing text
data = data.dropna(subset=['text_column'])
Text Splitting and Joining
Split text into multiple parts or join text elements together:
# Split text into parts
data['split_column'] = data['text_column'].str.split(' ')
# Join text elements
data['joined_text'] = data['split_column'].str.join('-')
Extracting Substrings
Extract specific patterns or substrings from text using regular expressions:
# Extract email domain
data['email_domain'] = data['email_column'].str.extract(r'@(\w+\.\w+)')
Finding and Replacing Text
Find and replace text using the replace method:
# Replace specific words
data['text_column'] = data['text_column'].str.replace('old_word', 'new_word', regex=True)
Analyzing Text Data
Use Pandas and Python libraries for basic text analysis:
- Count words or characters: data[‘column’].str.len().
- Find occurrences of a substring: data[‘column’].str.contains(‘substring’).
- Apply custom text analysis functions using apply.
# Count characters in each text entry
data['char_count'] = data['text_column'].str.len()
# Count occurrences of a word
data['word_count'] = data['text_column'].str.split().apply(len)
Advanced Text Handling with External Libraries
For more advanced text processing, integrate Pandas with libraries like NLTK
or spaCy
:
import nltk
from nltk.tokenize import word_tokenize
# Tokenize text column
data['tokens'] = data['text_column'].apply(word_tokenize)
Explore more in the Pandas Documentation.