用 K-Means聚类算法(K-Means Clustering)分析客户

vlambda
2022-04-01

用 K-Means聚类算法(K-Means Clustering)分析客户

今日份知识你摄入了么？

市场团队一直在尽最大努力更多地了解他们的客户是谁。通过了解更多用户，团队将更好地了解如何根据客户行为创建营销活动、促销、特别优惠等等。

大数据应用

数据应用学院被评为2016北美Top Data Camp, 是最专业一站式数据科学咨询服务机构，你的数据科学求职咨询专家！

4411篇原创内容

在本文中，我将演示如何使用 K-Means 聚类算法，根据商城数据集（数据链接）中的收入和支出得分对客户进行细分的。

商场客户细分的聚类模型（Clustering Model）

目标：根据客户收入和支出分数，创建客户档案

指导方针：

1. 数据准备、清理和整理
2. 探索性数据分析
3. 开发聚类模型

数据描述：

1. CustomerID : 每个客户的唯一ID
2. Genre：用户的性别
3. Age：用户当前的年龄
4. Annual Income (k$) : 用户的年收入 (千美元)
5. Spending Score（1-100）：用户消费习惯（分数越高表示消费越多，反之亦然）

数据准备、清理和整理

#Import Library and Load Fileimport pandas as pdimport numpy as npdf = pd.read_csv('/kaggle/input/mall-customers/Mall_Customers.csv')df.info() #checking data types and total null values

数据框摘要图

从输出结果中，我们可以看到数据框中有 5 列和 200 行，数据中没有空值。

让我们检查一下数据框中是否有任何重复的行。

#Checking If any duplicated valuesprint(f'Total Duplicated Rows : {df.duplicated().sum()}')

重复行数

继续，我们来检查一下从 0 到 100 的每个数字列的百分位总结。

#Let's see the percentile from each numerical columns from the dataset
def percentile(df, column) : print(f'{column} Percentile Summary :') for a in range(0,101,10) : print(f'- {a}th Percentile : {round(np.percentile(df[column],a),2)}') #Percentile for Agepercentile(df, 'Age')#Annual Income Percentilepercentile(df,'Annual Income (k$)')#Spending Score Percentilepercentile(df,'Spending Score (1-100)')

数字列百分位总结

#Count Each Gender totalgender_total = df['Genre'].value_counts().reset_index()gender_total['perc_genre'] = round(gender_total['Genre']/sum(gender_total['Genre']),2)*100gender_total

顾客性别数量

上文中，我们检查了 null、重复值、并显示了数字列的百分位数、和分类列中每个唯一值的总值。

接下来，我们将开始探索上面的一些数据，以更好地了解我们的数据集。

探索性数据分析

import matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as px
num_cols = ['Age','Annual Income (k$)','Spending Score (1-100)']def plot_stats(df, col_list) :for a in num_cols :fig,ax = plt.subplots(1,2, figsize = (9,6))
sns.distplot(df[a], ax = ax[0])sns.boxplot(df[a], ax = ax[1])
ax[0].axvline(df[a].mean(), linestyle = '--', linewidth = 2, color = 'green')ax[0].axvline(df[a].median(), linestyle = '--', linewidth = 2 , color = 'red')
ax[0].set_ylabel('Frequency')ax[0].set_title('Distribution Plot')
ax[1].set_title('Box Plot')
plt.suptitle(a)plt.show()
plot_stats(df, num_cols)

数值列的分布和箱线图

“Age”和“Annual Income(k$)”呈正偏态，我们想用第 10 和第 90 个百分位数替换异常值，来标准化（normalize）数据。

#Flooring and Capping by replacing outliers with 10th and 90th Percentile
#Age 10th Percentile and 90th Percentiletenth_percentile_age = np.percentile(df['Age'], 10)ninetieth_percentile_age = np.percentile(df['Age'], 90)
df['Age'] = np.where(df['Age'] < tenth_percentile_age, tenth_percentile_age, df['Age'])df['Age'] = np.where(df['Age'] > ninetieth_percentile_age, ninetieth_percentile_age, df['Age'])
#Annual Income 10th Percentile and 90th Percentiletenth_percentile_annualincome = np.percentile(df['Annual Income (k$)'], 10)ninetieth_percentile_annualincome = np.percentile(df['Annual Income (k$)'], 90)df['Annual Income (k$)'] = np.where(df['Annual Income (k$)'] < tenth_percentile_annualincome, tenth_percentile_annualincome, df['Annual Income (k$)'])df['Annual Income (k$)'] = np.where(df['Annual Income (k$)'] > ninetieth_percentile_annualincome, ninetieth_percentile_annualincome, df['Annual Income (k$)'])plot_stats(df, num_cols) #Checking Distribution after replacing outliers with 10th and 90th Percentile

替换异常值后的分布和箱线图

数据进行了标准化之后，从上图中我们可以看出，列上没有检测到异常值。

大多数顾客是女性（56%），根据年龄组，去购物中心的人大多是年轻人（20-35 岁年龄组）。

根据上面的散点图，我们可以看到大多数客户的平均收入和平均支出得分。除此之外，我们的数据集中还有 4 个基于收入和支出得分的独立组。

散点图可以总共分为以下几类：

1. 高收入低支出
2. 高收入高支出
3. 平均收入平均支出
4. 低收入低支出
5. 低收入高支出
接下来，我们将用上面的 5 个类别来标记我们的数据。

开发聚类模型

from sklearn.preprocessing import MinMaxScalerfrom sklearn.decomposition import PCAfrom sklearn.cluster import KMeans#Normalize Numeric Featuresscaled_features = MinMaxScaler().fit_transform(df.iloc[:,3:5])
#Get 2 Principal Componentspca = PCA(n_components = 2).fit(scaled_features)features_2d = pca.transform(scaled_features)#5 Centroids Modelmodel = KMeans(n_clusters = 5, init= 'k-means++', n_init = 100, max_iter = 1000, random_state=16)
#Fit to the data and predict the cluster assignments to each data pointsfeature = df.iloc[:,3:5]km_clusters = model.fit_predict(feature.values)km_clusters