

Customer Segmentation
The pilot dataset for this project was provided as a case study at UCL's School of Management, developed by Dr. Wei Miao. It represents a sample of 10,000 Amazon customers selected for a pilot marketing campaign aimed at promoting Prime Membership.
The dataset includes the following features:
- user_id: Unique identifier for each customer.
- gender: Gender of the customer, represented as "F" for female and "M" for male.
- first: Number of days since the customer's first purchase.
- last: Number of days since the customer's most recent purchase.
- electronics: Total spending by the customer on electronics in the past year.
- nonelectronics: Total spending by the customer on non-electronics in the past year.
- home, sports, clothes, health, books, digital, toys: Number of purchases made by the customer in each respective product category over the past year.
- subscribe: Indicates whether the customer subscribed to Amazon Prime for one month during the pilot campaign ("yes" or "no").
- city: The city where the customer resides.
Importing Libraries
Loading the Dataset
Exploring the Dataset
(10000 , 15)
This bar chart shows the count of customers based on their gender (F for female, M for male), Female customers (F) represent a significantly larger portion of the customer base compared to male customers (M).
The count of female customers is approximately double that of male customers, indicating that women might be a key demographic for this business. Marketing campaigns could be tailored to further target female customers since they form the majority of the customer base.
The box plot shows the distribution of electronics spending amounts across all customers.
The median spending on electronics is much lower around £28, compared to non-electronics.
The IQR for electronics spending is relatively narrow, ranging from approximately £15 to £70, indicating that most customers spend small amounts on electronics. There is a single notable outlier with spending above £150, which suggests a very rare high-value electronics purchase. Electronics spending appears to be a less significant driver of revenue compared to non-electronics, which might warrant further investigation into the types of electronics products offered.
The box plot shows the distribution of non-electronics spending amounts across all customers. Insights:
The median spending on non-electronics is around £160, with the interquartile range (IQR) falling between approximately £100 and £250. There are no significant outliers in this category, as all data points lie within the whiskers.
The pie chart illustrates the proportion of customers who are subscribed and not subscribed to Amazon Prime. Only 8.38% of the customer base has a Prime subscription, indicating that the subscription penetration is relatively low. This presents an opportunity for the business to explore strategies to increase subscriptions, such as offering incentives, promotions, or exclusive benefits for Prime members.
This bar chart illustrates the distribution of customers across different cities. In this graph, London has the highest number of customers (5,012), significantly outnumbering other cities.
This box plot compares the purchase amounts for Prime subscribers and non-subscribers. The median purchase amount for subscribers is significantly higher than that of non-subscribers. It is also interesting to see that Subscribers not only spend more on average but also show less variation in their spending patterns.
This box plot shows the distribution of days since last purchase for customers who are subscribe and not subscribed to Amazon Prime.The median recency for Prime subscribers is lower than for non-subscribers, suggesting that subscribers tend to make purchases more frequently.
This box plot compares the distribution of spending on electronics products between male and female customers. Male customers have a higher median spending on electronics (around £45) compared to females (around £25). Electronics marketing efforts could be tailored more toward male customers, as they tend to spend more and have a higher variability in spending.
This box plot compares the distribution of spending on non-electronics products between male and female customers. The median spending on non-electronics is very similar for both genders, around £160.
This box plot compares the number of purchases made in the past year between Prime subscribers and non-subscribers. Prime subscribers have a higher median number of purchases compared to non-subscribers.
This stacked bar chart shows the proportion of customers subscribed and not subscribed to Amazon Prime, broken down by city. All cities show a very small percentage of Prime subscribers, with most customers being non-subscribers. Cities like London and Birmingham, which have the largest customer bases, represent key areas to focus subscription growth efforts.
Customer segmentation is a crucial step in understanding and targeting distinct customer groups, allowing businesses to tailor their marketing strategies and improve customer satisfaction. By leveraging machine learning techniques, we can uncover patterns in customer behavior that may not be immediately apparent through traditional analysis. In this section, we use clustering to group customers based on their spending habits, recency of activity, and frequency of purchases. These groups provide actionable insights that can help optimize marketing efforts, enhance customer experiences, and increase profitability.
To achieve this, we first preprocess the data to ensure that the features are on a comparable scale, as clustering algorithms are sensitive to differences in magnitude. We focus on three key features:
- Monetary Value: The amount a customer has spent.
- Recency (Last): The number of days since the customer’s last purchase.
- Frequency: The number of purchases a customer has made.
Next we have to determine the optimal number of clusters. The Silhouette Method offers a more insightful evaluation by measuring how well data points fit within their assigned clusters compared to neighboring clusters.
The graph shows that the optimal number of clusters is 2, as it has the highest silhouette score. This means dividing the data into two groups provides the best separation and clarity between clusters. Beyond 2, the scores drop, indicating less distinct grouping.
The table reveals two distinct customer clusters. Cluster 0, comprising 7,263 customers, represents occasional buyers with lower monetary value (173.16) and purchase frequency (1.96). Cluster 1, with 2,737 customers, consists of high-value, frequent buyers who spend significantly more (301.03) and purchase more often (8.87). Despite the difference in spending behavior, both clusters have similar recency, around 12 days. These insights highlight the larger proportion of occasional buyers, emphasizing the need for tailored strategies to engage and retain these two groups effectively.