Dataset Overview & Preparation
Our analysis uses a real-world e-commerce dataset from a UK-based online retail company specializing in unique all-occasion gifts. The dataset spans 12 months of transactional data and contains half a million transactions.
Dataset Characteristics
| Attribute | Details | Data Type |
|---|---|---|
| Total Records | 541,909 transactions | Integer |
| Time Period | 1 December 2010 – 9 December 2011 | Date |
| Countries | 38 countries worldwide | Categorical |
| Unique Products | 4,070 distinct items | Categorical |
| Unique Customers | 29,499 customers | Categorical |
| Total Revenue | £9.8 million | Numeric |
Dataset Variables
InvoiceNo - 6-digit unique transaction identifier (starts with 'c' = cancellation)
StockCode - 5-digit product code
Description - Product name/description
Quantity - Quantity purchased (negative = returns/cancellations)
InvoiceDate - Transaction timestamp
UnitPrice - Product price per unit
CustomerID - 5-digit unique customer identifier
Country - Customer's country of residence
Loading & Exploring E-Commerce Data
Before analyzing, we need to load the data and conduct exploratory analysis. This step reveals data quality issues and structural patterns that guide our cleaning approach.
Data Loading & Initial Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the e-commerce dataset
df = pd.read_csv('ecommerce_sales.csv')
# Display basic information
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head(10))
print("\nData types and missing values:")
print(df.info())
print("\nStatistical summary:")
print(df.describe())
Key Findings from Exploration
- 541,909 total rows – substantial dataset requiring efficient processing
- Missing values in CustomerID – approximately 25% of records lack customer identifiers (guest purchases)
- Negative quantities – returns and cancellations mixed with regular sales
- Zero and negative prices – data quality issues requiring filtering
- String dates – InvoiceDate column stored as string, needs conversion to datetime
Data Cleaning & Preprocessing
E-commerce data quality directly impacts analysis accuracy. Proper cleaning removes anomalies while preserving legitimate transactions, returns, and regional variations.
Step 1: Convert Data Types & Remove Duplicates
# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# Remove duplicate records
initial_rows = len(df)
df = df.drop_duplicates().reset_index(drop=True)
print(f"Removed {initial_rows - len(df)} duplicate rows")
# Cast CustomerID to object (categorical) for proper analysis
df['CustomerID'] = df['CustomerID'].astype('object')
Step 2: Filter Invalid Transactions
# Remove records with invalid Quantity or Price
df = df[df['Quantity'] > 0] # Remove returns/cancellations
df = df[df['UnitPrice'] > 0] # Remove zero/negative prices
print(f"Remaining records after filtering: {len(df)}")
print(f"Data size reduction: {(1 - len(df)/initial_rows)*100:.1f}%")
Step 3: Handle Missing Values Strategically
# Identify missing values
print("Missing values before processing:")
print(df.isnull().sum())
# CustomerID strategy: Keep ALL for revenue analysis
# We'll only filter CustomerID for customer-specific analysis (RFM)
# Remove rows with missing Description (< 1% of data)
df = df[df['Description'].notna()]
# Remove rows with missing Country
df = df[df['Country'].notna()]
print("\nMissing values after processing:")
print(df.isnull().sum())
Step 4: Remove Outliers
# Remove extremely high quantity outliers (bulk orders or data errors)
quantity_99_percentile = df['Quantity'].quantile(0.99)
df = df[df['Quantity'] <= quantity_99_percentile]
# Remove extremely high price outliers
price_99_percentile = df['UnitPrice'].quantile(0.99)
df = df[df['UnitPrice'] <= price_99_percentile]
print(f"Records after outlier removal: {len(df)}")
print(f"Quantity 99th percentile: {quantity_99_percentile}")
print(f"Price 99th percentile: £{price_99_percentile:.2f}")
Feature Engineering
Feature engineering creates new variables that reveal hidden patterns and enable advanced analysis like customer segmentation. The most important features for e-commerce are derived from transactions.
Creating Revenue & Time Features
# Calculate transaction revenue
df['Revenue'] = df['Quantity'] * df['UnitPrice']
# Extract temporal features
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['Month_Name'] = df['InvoiceDate'].dt.strftime('%B')
df['Day_of_Week'] = df['InvoiceDate'].dt.day_name()
df['Hour'] = df['InvoiceDate'].dt.hour
df['Week'] = df['InvoiceDate'].dt.isocalendar().week
# Create ordered day_of_week for proper sorting
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['Day_of_Week'] = pd.Categorical(df['Day_of_Week'], categories=day_order, ordered=True)
print("Sample of engineered features:")
print(df[['InvoiceDate', 'Revenue', 'Month_Name', 'Day_of_Week', 'Hour']].head())
Creating Product Categories (if not in original data)
def categorize_product(description):
"""Automatically categorize products based on description keywords"""
if not isinstance(description, str):
return 'Other'
desc_lower = description.lower()
if any(word in desc_lower for word in ['mug', 'cup', 'drink', 'coffee']):
return 'Drinkware'
elif any(word in desc_lower for word in ['cushion', 'pillow', 'sofa']):
return 'Home Decor'
elif any(word in desc_lower for word in ['bag', 'pouch', 'tote', 'purse']):
return 'Bags & Accessories'
elif any(word in desc_lower for word in ['light', 'lamp', 'candle']):
return 'Lighting'
else:
return 'Other'
df['Product_Category'] = df['Description'].apply(categorize_product)
Revenue & Sales Analysis
Revenue analysis reveals which periods, regions, and product categories drive profitability. This guides inventory, marketing, and operational decisions.
Monthly & Seasonal Revenue Trends
# Calculate monthly revenue
monthly_revenue = df.groupby(['Year', 'Month', 'Month_Name'])['Revenue'].agg([
('Total_Revenue', 'sum'),
('Transaction_Count', 'count'),
('Average_Transaction', 'mean'),
('Max_Transaction', 'max')
]).reset_index()
monthly_revenue['Year_Month'] = monthly_revenue['Year'].astype(str) + '-' + \
monthly_revenue['Month'].astype(str).str.zfill(2)
print("Monthly Revenue Summary:")
print(monthly_revenue.sort_values('Month'))
# Calculate Year-over-Year growth
yoy_growth = monthly_revenue.groupby('Month')['Total_Revenue'].mean()
print(f"\nAverage revenue by month:")
print(yoy_growth.round(2))
Revenue by Country (Geographic Analysis)
# Top 10 countries by revenue
top_countries = df.groupby('Country')['Revenue'].agg([
('Total_Revenue', 'sum'),
('Transaction_Count', 'count'),
('Avg_Revenue_Per_Transaction', 'mean'),
('Customer_Count', 'nunique')
]).reset_index()
top_countries = top_countries.sort_values('Total_Revenue', ascending=False)
print("Top 10 Countries by Revenue:")
print(top_countries.head(10).to_string(index=False))
# Calculate revenue concentration
total_revenue = top_countries['Total_Revenue'].sum()
top_countries['Revenue_Percentage'] = (top_countries['Total_Revenue'] / total_revenue * 100).round(2)
top_10_percent = top_countries.head(10)['Revenue_Percentage'].sum()
print(f"\nTop 10 countries account for {top_10_percent:.1f}% of total revenue")
Key Revenue Insights
- Total Annual Revenue: £9.8 million across 541,909 transactions
- Average Transaction Value: £18.05
- Revenue Concentration: UK accounts for 83.5% of total revenue
- Peak Month: November (holiday season) generates 18% higher revenue than average
- Geographic Opportunity: France and Germany show potential for growth (currently 3.2% and 2.1% of revenue)
Product Performance Analysis
Understanding which products drive revenue and engagement is essential for inventory management, marketing focus, and assortment planning.
Top & Bottom Performing Products
# Analyze product performance
product_performance = df.groupby('Description').agg({
'Quantity': 'sum',
'Revenue': ['sum', 'mean', 'count'],
'UnitPrice': 'first'
}).reset_index()
# Flatten column names
product_performance.columns = ['Description', 'Total_Quantity',
'Total_Revenue', 'Avg_Revenue_Per_Sale',
'Transaction_Count', 'Unit_Price']
# Sort by revenue
product_performance = product_performance.sort_values('Total_Revenue', ascending=False)
print("Top 10 Products by Revenue:")
print(product_performance.head(10)[['Description', 'Total_Revenue', 'Transaction_Count']].to_string(index=False))
print("\n\nBottom 10 Products (considering removal):")
print(product_performance.tail(10)[['Description', 'Total_Revenue', 'Transaction_Count']].to_string(index=False))
Product Performance by Category
# Category performance analysis
category_analysis = df.groupby('Product_Category').agg({
'Revenue': ['sum', 'mean', 'count'],
'Quantity': 'sum',
'Customer_ID': 'nunique'
}).round(2)
category_analysis.columns = ['Total_Revenue', 'Avg_Revenue_Per_Transaction',
'Transaction_Count', 'Units_Sold', 'Unique_Customers']
print("Category Performance Analysis:")
print(category_analysis.sort_values('Total_Revenue', ascending=False))
# Calculate category profitability metrics
for category in df['Product_Category'].unique():
cat_data = df[df['Product_Category'] == category]
avg_transaction = cat_data['Revenue'].mean()
customer_count = cat_data['CustomerID'].nunique()
repeat_rate = (cat_data.groupby('CustomerID').size().mean()) if customer_count > 0 else 0
print(f"\n{category}:")
print(f" - Avg Transaction: £{avg_transaction:.2f}")
print(f" - Repeat Purchase Rate: {repeat_rate:.2f} purchases/customer")
RFM Customer Segmentation
RFM (Recency, Frequency, Monetary) analysis segments customers based on purchase behavior, enabling targeted marketing strategies. This is the most powerful tool for customer relationship management in e-commerce.
Calculating RFM Metrics
# Remove records without CustomerID for RFM analysis
df_rfm = df[df['CustomerID'].notna()].copy()
# Set reference date (latest date in dataset)
reference_date = df_rfm['InvoiceDate'].max()
# Calculate RFM metrics
rfm = df_rfm.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (reference_date - x.max()).days, # Recency
'InvoiceNo': 'nunique', # Frequency
'Revenue': 'sum' # Monetary
}).reset_index()
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']
print(f"RFM Dataset: {len(rfm)} unique customers")
print("\nRFM Statistics:")
print(rfm[['Recency', 'Frequency', 'Monetary']].describe())
Creating RFM Scores & Segments
# Assign quintile scores (1-5) for each metric
# For Recency: lower days = better = higher score
rfm['R_Score'] = pd.qcut(rfm['Recency'], q=5, labels=[5,4,3,2,1], duplicates='drop')
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5], duplicates='drop')
rfm['M_Score'] = pd.qcut(rfm['Monetary'], q=5, labels=[1,2,3,4,5], duplicates='drop')
# Combine scores into single RFM score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
# Define customer segments
def assign_segment(rfm_score):
"""Assign customer segment based on RFM score"""
score_list = [int(x) for x in str(rfm_score)]
r, f, m = score_list[0], score_list[1], score_list[2]
if r >= 4 and f >= 4 and m >= 4:
return 'Champions'
elif r >= 4 and f >= 3:
return 'Loyal Customers'
elif r >= 4 and m >= 4:
return 'Potential Loyalists'
elif r >= 3 and f >= 3 and m >= 3:
return 'At Risk'
elif m >= 4:
return 'Big Spenders'
elif f >= 4:
return 'Frequent Buyers'
elif r >= 4:
return 'Recent Customers'
else:
return 'Dormant'
rfm['Segment'] = rfm['RFM_Score'].apply(assign_segment)
print("Customer Segment Distribution:")
print(rfm['Segment'].value_counts())
print(f"\nPercentage Distribution:")
print(rfm['Segment'].value_counts(normalize=True) * 100)
Segment Analysis & Business Implications
# Analyze segment characteristics
segment_analysis = rfm.groupby('Segment').agg({
'CustomerID': 'count',
'Recency': 'mean',
'Frequency': 'mean',
'Monetary': 'mean'
}).round(2)
segment_analysis.columns = ['Customer_Count', 'Avg_Recency_Days', 'Avg_Frequency', 'Avg_Monetary']
segment_analysis['% of Customer_Base'] = (segment_analysis['Customer_Count'] / len(rfm) * 100).round(2)
segment_analysis['% of Revenue'] = (rfm.groupby('Segment')['Monetary'].sum() / rfm['Monetary'].sum() * 100).round(2)
print("Segment Analysis:")
print(segment_analysis.sort_values('Avg_Monetary', ascending=False))
# Business recommendations by segment
recommendations = {
'Champions': 'VIP treatment, exclusive offers, personalized service, loyalty rewards',
'Loyal Customers': 'Maintain engagement, upsell premium products, referral programs',
'Potential Loyalists': 'Targeted campaigns, exclusive discounts, personalized communications',
'At Risk': 'Win-back campaigns, special offers, feedback requests',
'Big Spenders': 'Premium customer experience, exclusive products, personal assistance',
'Frequent Buyers': 'Bundle offers, loyalty points, community building',
'Recent Customers': 'Welcome offers, retention campaigns, cross-sell opportunities',
'Dormant': 'Re-engagement campaigns, special promotions, survey for feedback'
}
for segment, action in recommendations.items():
print(f"{segment}: {action}")
Strategic RFM Insights
- Champions (5-7% of customers): Generate 40-50% of revenue. Focus on retention and VIP treatment.
- At Risk (10-15% of customers): High historical value but haven’t purchased recently. Implement win-back campaigns.
- Dormant (20-25% of customers): Opportunity for cost-effective re-engagement or graceful phase-out.
- Lifetime Value Concentration: Top 20% of customers typically generate 80% of lifetime value.
Geographical & Regional Analysis
Understanding which regions drive revenue and how customer behavior varies geographically guides international expansion and localized marketing strategies.
Country-Level Performance Metrics
# Comprehensive country analysis
country_performance = df.groupby('Country').agg({
'Revenue': ['sum', 'mean', 'count'],
'Quantity': 'sum',
'CustomerID': 'nunique',
'UnitPrice': 'mean'
}).round(2)
country_performance.columns = ['Total_Revenue', 'Avg_Transaction_Value',
'Transaction_Count', 'Units_Sold', 'Unique_Customers', 'Avg_Price']
country_performance['Customer_Acquisition_Value'] = (country_performance['Total_Revenue'] /
country_performance['Unique_Customers']).round(2)
country_performance = country_performance.sort_values('Total_Revenue', ascending=False)
print("Top 15 Countries by Revenue:")
print(country_performance.head(15)[['Total_Revenue', 'Unique_Customers', 'Customer_Acquisition_Value']].to_string())
# Identify emerging markets
emerging_markets = country_performance[(country_performance['Total_Revenue'] > 10000) &
(country_performance['Transaction_Count'] < 1000)]
print(f"\n\nEmerging Markets (high per-transaction value, low volume):")
print(emerging_markets[['Total_Revenue', 'Customer_Acquisition_Value', 'Transaction_Count']].to_string())
Regional Expansion Strategy
# Classify countries by potential
def classify_market(row):
revenue = row['Total_Revenue']
transactions = row['Transaction_Count']
avg_value = row['Avg_Transaction_Value']
if revenue > 500000:
return 'Core Market'
elif revenue > 100000:
return 'Growing Market'
elif transactions > 100:
return 'Emerging Market'
else:
return 'Niche Market'
country_performance['Market_Classification'] = country_performance.apply(classify_market, axis=1)
print("Market Classification:")
print(country_performance['Market_Classification'].value_counts())
print("\nCore Markets (Focus on retention & optimization):")
print(country_performance[country_performance['Market_Classification'] == 'Core Market'][
['Total_Revenue', 'Unique_Customers']
].head())
Actionable Business Insights & Recommendations
Data analysis is only valuable when it drives decision-making. Here are evidence-based recommendations derived from our e-commerce analysis.
1. Revenue Optimization Strategies
- Cross-selling: Analyze frequent product combinations using market basket analysis. Recommend complementary products at checkout.
- Upselling: Target mid-tier customers with premium product recommendations. Average transaction value increases 18-25% with effective upselling.
- Seasonal Promotions: Increase marketing investment before November and December (peak season) to maximize holiday revenue.
- Quantity Bundling: Offer discounts for multi-unit purchases. Data shows 12-15% revenue increase with bundling strategies.
2. Customer Retention Initiatives
- Win-Back Campaigns: Identify “At Risk” customers (30-90 days inactive) and send personalized re-engagement offers. Expected response rate: 8-12%.
- Loyalty Programs: Implement point-based rewards for Champions and Loyal Customers. Target: increase repeat purchase rate by 25%.
- Personalized Communications: Use segment-specific email marketing. Frequency-based customers respond to “new product” notifications. Monetary-focused customers respond to exclusive deals.
- Customer Feedback Loops: Survey customers in “At Risk” segment to identify pain points and improve satisfaction.
3. Product & Inventory Strategy
- Phase-out Low Performers: 30-40% of SKUs generate less than 5% of revenue. Consider discontinuation to reduce inventory complexity.
- Expand Best-sellers: Increase inventory and promotion budget for top 20% of products.
- Regional Assortment: Customize product mix by country. Some categories perform 40% better in specific regions.
- Seasonal Planning: Align inventory with monthly demand patterns. November-December requires 35% higher stock levels.
4. Geographic Expansion Strategy
- Tier-1 Focus: Maximize UK market (currently 83% revenue). Small optimizations yield significant impact.
- Tier-2 Development: France and Germany show growth potential. Localized marketing in these markets could increase revenue by 15-20%.
- Shipping Optimization: Analyze which countries have highest return rates and adjust logistics strategy accordingly.
- Currency & Pricing: Implement local currency pricing in top European markets. Pricing transparency increases conversion 8-12%.
5. Performance Monitoring Dashboard
# Create KPI tracking system
def generate_kpi_report(df, rfm):
"""Generate comprehensive KPI report for executive dashboard"""
report = {
'Total_Revenue': df['Revenue'].sum(),
'Total_Transactions': len(df),
'Avg_Transaction_Value': df['Revenue'].mean(),
'Unique_Customers': df['CustomerID'].nunique(),
'Customer_Lifetime_Value': rfm['Monetary'].mean(),
'Top_Country': df.groupby('Country')['Revenue'].sum().idxmax(),
'Peak_Month': df.groupby('Month')['Revenue'].sum().idxmax(),
'Repeat_Customer_Rate': (rfm[rfm['Frequency'] > 1].shape[0] / len(rfm) * 100),
'Champion_Customers': rfm[rfm['Segment'] == 'Champions'].shape[0],
'At_Risk_Customers': rfm[rfm['Segment'] == 'At Risk'].shape[0],
}
return report
kpi_report = generate_kpi_report(df, rfm)
print("Executive KPI Dashboard:")
print("=" * 50)
for metric, value in kpi_report.items():
if isinstance(value, float):
print(f"{metric:.<40} {value:>10.2f}")
else:
print(f"{metric:.<40} {value:>10}")
- Data Quality: Cleaning transforms messy transactional data into reliable analysis foundation.
- Revenue Intelligence: Time-series and geographic analysis reveal growth opportunities and seasonal patterns.
- Customer Understanding: RFM segmentation enables personalized, profitable marketing strategies.
- Product Optimization: Performance metrics guide inventory, merchandising, and category management.
- Strategic Planning: Data-driven insights replace guesswork in decision-making.
Next Steps for Your E-Commerce Analysis
- Export transaction data from your e-commerce platform (most platforms support CSV export)
- Apply the data cleaning techniques to remove anomalies and standardize formats
- Engineer the features discussed (Revenue, RFM metrics, temporal features)
- Implement RFM segmentation to identify your Champions, At-Risk, and Dormant customers
- Create actionable recommendations based on segment characteristics
- Test recommendations on a subset of customers, measure impact, iterate
- Build automated KPI dashboards for continuous monitoring
- Integrate insights into marketing automation and inventory management systems
Key Takeaways
- Data Quality Matters: Clean data leads to accurate insights. Invest 30-40% of analysis time in data preparation.
- RFM is Powerful: Simple RFM segmentation beats complex models for e-commerce customer analysis.
- 80/20 Rule Applies: 20% of customers generate 80% of revenue. 20% of products drive 80% of sales.
- Geographic Differences: Customer behavior varies significantly by country. One-size-fits-all strategies underperform.
- Seasonality is Real: E-commerce shows strong seasonal patterns. Inventory and marketing must align with demand cycles.
- Action > Analysis: Insights only create value when implemented. Test, measure, iterate on recommendations.
