Scroll →

Customer Segmentation

With rising customer acquisition costs and declining profit margins, Amazon sought to optimize its marketing strategy by targeting potential Prime members. This project analyzes a pilot dataset of 10,000 customers, exploring monetary value, recency, and frequency to uncover actionable insights for boosting ROI and refining predictive models for subscription campaigns.
Timeline
Nov 2024
Field
Marketing Analytics
Dataset Description

The pilot dataset for this project was provided as a case study at UCL's School of Management, developed by Dr. Wei Miao. It represents a sample of 10,000 Amazon customers selected for a pilot marketing campaign aimed at promoting Prime Membership.

The dataset includes the following features:

  • user_id: Unique identifier for each customer.
  • gender: Gender of the customer, represented as "F" for female and "M" for male.
  • first: Number of days since the customer's first purchase.
  • last: Number of days since the customer's most recent purchase.
  • electronics: Total spending by the customer on electronics in the past year.
  • nonelectronics: Total spending by the customer on non-electronics in the past year.
  • home, sports, clothes, health, books, digital, toys: Number of purchases made by the customer in each respective product category over the past year.
  • subscribe: Indicates whether the customer subscribed to Amazon Prime for one month during the pilot campaign ("yes" or "no").
  • city: The city where the customer resides.
Exploring the Dataset

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

Loading the Dataset

df = pd.read_csv('Amazon.csv')

Exploring the Dataset

df.head()
User ID Gender First Last Electronics Non-Electronics Home Sports Clothes Health Books Digital Toys Subscribe City
10001 M 49 29 109 248 3 2 2 0 1 0 2 no London
10002 M 39 27 35 103 0 1 0 1 0 0 1 no London
10003 F 19 15 25 147 0 0 2 0 0 0 0 no London
10004 F 7 7 15 257 0 0 0 0 1 0 0 no London
10005 F 15 15 15 134 0 0 1 0 0 0 0 no Birmingham

df.info()
# Column Non-Null Count Dtype
0 user_id 10000 non-null int64
1 gender 10000 non-null object
2 first 10000 non-null int64
3 last 10000 non-null int64
4 electronics 10000 non-null int64
5 nonelectronics 10000 non-null int64
6 home 10000 non-null int64
7 sports 10000 non-null int64
8 clothes 10000 non-null int64
9 health 10000 non-null int64
10 books 10000 non-null int64
11 digital 10000 non-null int64
12 toys 10000 non-null int64
13 subscribe 10000 non-null object
14 city 10000 non-null object

df.shape

(10000 , 15)

Exploratory Data Analysis (EDA)
Gender Distribution
fig_gender_dist = px.bar(
    df['gender'].value_counts().reset_index(),
    y='count',
    x='gender',
    labels={'count': 'Count', 'gender': 'Gender'},
    title='Gender Distribution',
    color_discrete_sequence=['#4c6ef5'],
    )

This bar chart shows the count of customers based on their gender (F for female, M for male), Female customers (F) represent a significantly larger portion of the customer base compared to male customers (M).

The count of female customers is approximately double that of male customers, indicating that women might be a key demographic for this business. Marketing campaigns could be tailored to further target female customers since they form the majority of the customer base.

Electronics Purchase Amount Distribution
fig_electronics_box = px.box(
    df,
    y='electronics',
    title='Electronics Purchase Amount Distribution',
    color_discrete_sequence=['#4c6ef5'],)
    
fig_electronics_box.update_layout(yaxis_title='Electronics Purchase Amount (GBP)')

The box plot shows the distribution of electronics spending amounts across all customers.

The median spending on electronics is much lower around £28, compared to non-electronics.
The IQR for electronics spending is relatively narrow, ranging from approximately £15 to £70, indicating that most customers spend small amounts on electronics. There is a single notable outlier with spending above £150, which suggests a very rare high-value electronics purchase. Electronics spending appears to be a less significant driver of revenue compared to non-electronics, which might warrant further investigation into the types of electronics products offered.

Non-Electronics Purchase Amount Distribution
fig_nonelectronics_box = px.box(df, y='nonelectronics',
                                title='Non-Electronics Purchase Amount Distribution'
                                , color_discrete_sequence=['#4c6ef5'])
                                
fig_nonelectronics_box.update_layout(yaxis_title='non-Electronics Purchase Amount (GBP)')

The box plot shows the distribution of non-electronics spending amounts across all customers. Insights:

The median spending on non-electronics is around £160, with the interquartile range (IQR) falling between approximately £100 and £250. There are no significant outliers in this category, as all data points lie within the whiskers.

Subscription Proportions
sub_counts = df['subscribe'].value_counts().reset_index()
sub_counts.columns = ['Subscribe', 'Count']

fig_sub = px.pie(sub_counts, names='Subscribe', values='Count',
                 title='Subscription Proportions of Customers',
                 color_discrete_sequence=['#4c6ef5'])
                 
fig_sub.update_traces(opacity=0.9)

The pie chart illustrates the proportion of customers who are subscribed and not subscribed to Amazon Prime. Only 8.38% of the customer base has a Prime subscription, indicating that the subscription penetration is relatively low. This presents an opportunity for the business to explore strategies to increase subscriptions, such as offering incentives, promotions, or exclusive benefits for Prime members.

Location Distribution
location_counts = df['city'].value_counts().reset_index()
location_counts.columns = ['City', 'Count']

fig_location = px.bar(
    location_counts,
    x='City',
    y='Count',
    text='Count',
    title='Customer Count by City',
    color_discrete_sequence=['#4c6ef5'])

This bar chart illustrates the distribution of customers across different cities. In this graph, London has the highest number of customers (5,012), significantly outnumbering other cities.

Subscribers Contribution to Revenue
df['monetary_value'] = df['electronics'] + df['nonelectronics']
subscription_data = df.groupby('subscribe')['monetary_value'].sum().reset_index()

fig_subscription = px.bar(
    subscription_data,
    x='subscribe',
    y='monetary_value',
    text='monetary_value',
    title='Subscribers Impact on Total Purchase Amount',
    color_discrete_sequence=['#4c6ef5'])
    
fig_subscription.update_layout(xaxis_title='Subscription Status',
                               yaxis_title='Total Purchase Amount (GBP)')
Do subscribers tend to spend more?
fig_subscription_vs_spending = px.box(
    df,
    x='subscribe',
    y='monetary_value',
    title='Effect of subscription on Spending',
    color='subscribe',
    color_discrete_sequence=['#fd7e14', '#4c6ef5'])

fig_subscription_vs_spending.update_layout(xaxis_title='Subscription Status'
        , yaxis_title='Purchase Amount (GBP)')

This box plot compares the purchase amounts for Prime subscribers and non-subscribers. The median purchase amount for subscribers is significantly higher than that of non-subscribers. It is also interesting to see that Subscribers not only spend more on average but also show less variation in their spending patterns.

Impact of Subscription on Recency
fig_sub_recency = px.box(
    df,
    x='subscribe',
    y='last',
    labels={'subscribe': 'Amazon Prime Subscription',
            'last': 'Days Since Last Purchase'},
    title='Impact of Subscription on Recency',
    color_discrete_sequence=['#4c6ef5'])

This box plot shows the distribution of days since last purchase for customers who are subscribe and not subscribed to Amazon Prime.The median recency for Prime subscribers is lower than for non-subscribers, suggesting that subscribers tend to make purchases more frequently.

Electronics Spending by Gender
fig_gender_box1 = px.box(
    df,
    x='gender',
    y='electronics',
    title='Electronics Spending by Gender',
    labels={'gender': 'Gender', 'electronics': 'Spending on Electronics'},
    color_discrete_sequence=['#4c6ef5'])

This box plot compares the distribution of spending on electronics products between male and female customers. Male customers have a higher median spending on electronics (around £45) compared to females (around £25). Electronics marketing efforts could be tailored more toward male customers, as they tend to spend more and have a higher variability in spending.

Non-Electronics Spending by Gender
fig_gender_box2 = px.box(
    df,
    x='gender',
    y='nonelectronics',
    title='Non-Electronics Spending by Gender',
    labels={'gender': 'Gender',
            'nonelectronics': 'Spending on Non-Electronics'},
    color_discrete_sequence=['#4c6ef5'])

This box plot compares the distribution of spending on non-electronics products between male and female customers. The median spending on non-electronics is very similar for both genders, around £160.

Purchase Frequency
df['frequency'] = df['home'] + df['sports'] + df['toys'] + df['digital']
                + df['health'] + df['books'] + df['clothes']
                
fig_sub_frequency = px.box(
    df,
    x='subscribe',
    y='frequency',
    labels={'subscribe': 'Amazon Prime Subscription',
            'frequency': 'Number of Purchases in the Past Year'},
    title='Impact of Subscription on Frequency',
    color_discrete_sequence=['#4c6ef5'])

This box plot compares the number of purchases made in the past year between Prime subscribers  and non-subscribers. Prime subscribers have a higher median number of purchases compared to non-subscribers.

Prime Subscription Percentage by City
fig_city_sub = px.bar(
    df.groupby('city')['subscribe'].value_counts(normalize=True).reset_index(name='percentage'),
    x='city',
    y='percentage',
    color='subscribe',
    title='Prime Subscription Percentage by City',
    labels={'city': 'City', 'percentage': 'Subscription Percentage'},
    color_discrete_sequence=['#fd7e14', '#4c6ef5'])

This stacked bar chart shows the proportion of customers subscribed and not subscribed to Amazon Prime, broken down by city. All cities show a very small percentage of Prime subscribers, with most customers being non-subscribers. Cities like London and Birmingham, which have the largest customer bases, represent key areas to focus subscription growth efforts.

Machine Learning

Customer segmentation is a crucial step in understanding and targeting distinct customer groups, allowing businesses to tailor their marketing strategies and improve customer satisfaction. By leveraging machine learning techniques, we can uncover patterns in customer behavior that may not be immediately apparent through traditional analysis. In this section, we use clustering to group customers based on their spending habits, recency of activity, and frequency of purchases. These groups provide actionable insights that can help optimize marketing efforts, enhance customer experiences, and increase profitability.

To achieve this, we first preprocess the data to ensure that the features are on a comparable scale, as clustering algorithms are sensitive to differences in magnitude. We focus on three key features:

  • Monetary Value: The amount a customer has spent.
  • Recency (Last): The number of days since the customer’s last purchase.
  • Frequency: The number of purchases a customer has made.
from sklearn.preprocessing import StandardScaler

# Select features for clustering
features = df[['monetary_value', 'last', 'frequency']]

# Handle missing values if any
features = features.dropna()

# Standardize the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

silhouette_scores = []
k_range = range(2, 11) 

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_features)
    score = silhouette_score(scaled_features, kmeans.labels_)
    silhouette_scores.append(score)

fig_kmeans = px.line(
    x=list(k_range),
    y=silhouette_scores,
    labels={'x': 'Number of Clusters', 'y': 'Silhouette Score'},
    title='Silhouette Method for Optimal K',
    color_discrete_sequence=['#4c6ef5'])

Next we have to determine the optimal number of clusters. The Silhouette Method offers a more insightful evaluation by measuring how well data points fit within their assigned clusters compared to neighboring clusters.

The graph shows that the optimal number of clusters is 2, as it has the highest silhouette score. This means dividing the data into two groups provides the best separation and clarity between clusters. Beyond 2, the scores drop, indicating less distinct grouping.

# Choose the optimal number of clusters
optimal_k = 2

# Fit the K-Means model
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(scaled_features)

# Add cluster labels to the original dataset
df['Cluster'] = clusters
# Group by cluster and calculate mean values for each feature
cluster_analysis = df.groupby('Cluster')[['monetary_value', 'last', 'frequency']].mean()
print(cluster_analysis)
Cluster Monetary Value Last Frequency Count
0 173.16 12.19 1.96 7263
1 301.03 12.44 8.87 2737

The table reveals two distinct customer clusters. Cluster 0, comprising 7,263 customers, represents occasional buyers with lower monetary value (173.16) and purchase frequency (1.96). Cluster 1, with 2,737 customers, consists of high-value, frequent buyers who spend significantly more (301.03) and purchase more often (8.87). Despite the difference in spending behavior, both clusters have similar recency, around 12 days. These insights highlight the larger proportion of occasional buyers, emphasizing the need for tailored strategies to engage and retain these two groups effectively.

df["Cluster"] = df["Cluster"].astype(str)
fig_clusters = px.scatter(
    df,
    x="monetary_value",
    y="frequency",
    color="Cluster",
    title="Customer Segments",
    labels={"monetary_value": "Monetary Value", "frequency": "Frequency", "Cluster": "Cluster"},
    color_discrete_sequence=['#4c6ef5','#fd7e14'])


fig_clusters.update_layout(
    xaxis_title="Monetary Value",
    yaxis_title="Frequency",
    legend_title="Cluster")