Filtering data is a foundational task in data analysis with pandas, enabling users to focus on relevant subsets of their dataset. Beyond basic filtering with loc and iloc, Pandas offers powerful options for handling complex data filtering needs. Let me introduce advanced filtering techniques using regular expressions and custom functions, accompanied by practical code examples to enhance your data analysis workflow.
Using Regular Expressions for Text Data
When dealing with textual data, regular expressions (regex) are invaluable for complex pattern matching and data filtering. Pandas integrates seamlessly with regex through its string methods, allowing for efficient and flexible text data filtering.
import pandas as pd # Sample DataFrame data = {'Name': ['John Doe', 'Jane Smith', 'Alex Brown'], 'Email': ['john.doe@example.com', 'jane.smith@example.net', 'alex.brown@example.org']} df = pd.DataFrame(data) # Filtering emails from a specific domain filtered_df = df[df['Email'].str.contains(r'example\.com$', regex=True)] print(filtered_df)
Custom Filtering Functions with apply
For more customized data filtering, pandas’ apply method allows you to apply a function to each element, row, or column of the DataFrame. This method is particularly useful when the filtering criteria are too complex for standard methods.
# Custom function to filter based on email length def filter_email_length(email): return len(email) >= 20 # Applying custom filter function filtered_df = df[df['Email'].apply(filter_email_length)] print(filtered_df)