KMeans聚类实例分析(汽车产品聚类分析

KMeans聚类实例分析（汽车产品聚类分析

天池⽐赛——汽车产品聚类分析（KMeans+PCA

前⾔

这是天池中⼀个关于产品聚类分析的⽐赛，题⽬给了⼀个车购买表，整体数据量不⼤，分析起来⽐较简单，还是⽐较有代表性的。

⽬录

题⽬要求：赛题以竞品分析为背景，通过数据的聚类，为汽车提供聚类分类。对于指定的车型，可以通过聚类分析到其竞品车型。下⾯直接开始分析（整体代码是运⾏在notebook中的）江铃宝典07款

博客⾥有⼀些图⽚可能看不太清，还有⼀些展⽰的部分博客上不⽅⾯放出来，如果想看的⽐较仔细的，可以从下⾯这个链接，fork我的notebook

零、Notebook中引⼊包和绘图设置

import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler

from sklearn.cluster import KMeans

ics import silhouette_score

from pylab import *

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

# 去除警告

warnings.filterwarnings("ignore")

# 正常画图

%matplotlib inline

%config InlineBackend.figure_format = 'svg'

# 正常显⽰负号

# 中⽂正常显⽰

pd.set_option('display.max_columns', None)

pd.set_option('display.width', 500)

⼀、分析df_car_price_dictionary⽂件

题⽬中还给了个对应名词的解释表，⽽且是个不太规则的表，需要处理⼀下，具体处理如下代码

df_car_price_dictionary = pd.read_excel('./data/car_price_dictionary.xlsx', skiprows=2)

df_car_price = pd.read_csv('./data/car_price.csv')

# 清除掉数据中的缺失值

df_car_price_dictionary.dropna(axis='index', how='all', inplace=True)

df_car_price_dictionary.dropna(axis='columns', how='all', inplace=True)

df_car_price_dictionary.drop('Unnamed: 13', axis=1, inplace=True)

df_car_price_dictionary.drop('DATA DICTONARY', axis=1,inplace=True)

df_car_price_dictionary.drop(28, axis=0, inplace=True)

# #修改列名

df_car_lumns=['名词','解释']

df_car_price_dictionary.set_index('名词', inplace=True)

df_car_price_dictionary

解释

名词

Car_ID Unique id of each observation (Interger)

Symboling Its assigned insurance risk rating, A

carCompany Name of car company (Categorical)

fueltype Car fuel type i.e gas or diesel (Categorical)

aspiration Aspiration used in a car (Categorical)

doornumber Number of doors in a car (Categorical)

carbody body of car (Categorical)

电动客车drivewheel type of drive wheel (Categorical)

enginelocation Location of car engine (Categorical)

wheelbase Weelbase of car (Numeric)

carlength Length of car (Numeric)

carwidth Width of car (Numeric)

carheight height of car (Numeric)

curbweight The weight of a car without occupants

enginetype Type of engine. (Categorical)

cylindernumber cylinder placed in the car (Categorical)

enginesize Size of car (Numeric)

大众汽车问题fuelsystem Fuel system of car (Categorical)

boreratio Boreratio of car (Numeric)

stroke Stroke or volume inside the engine (Numeric)

compressionratio compression ratio of car (Numeric)

horsepower Horsepower (Numeric)

peakrpm car peak rpm (Numeric)

citympg Mileage in city (Numeric)

highwaympg Mileage on highway (Numeric)

price(Dependent variable)Price of car (Numeric)

⼆.分析car_price⽂件

分析题⽬给的car_price⽂件，主要是对其中3类进⾏分析，⼀类是关于名字ID的，⼀类是字符型的（dataframe中的object）还有⼀类是数值型，每⼀种都有不同的处理⽅法，需要我们每个逐⼀处理

先整体分析⼀下car_price

df_car_price.info()

df_car_price.duplicated().sum()

"""

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 car_ID 205 non-null int64

1 symboling 205 non-null int64

2 CarName 205 non-null object

3 fueltype 205 non-null object

4 aspiration 20

5 non-null object

5 doornumber 205 non-null object

6 carbody 205 non-null object

7 drivewheel 205 non-null object

8 enginelocation 205 non-null object

9 wheelbase 205 non-null float64

10 carlength 205 non-null float64

11 carwidth 205 non-null float64

12 carheight 205 non-null float64

13 curbweight 205 non-null int64

14 enginetype 205 non-null object

15 cylindernumber 205 non-null object

16 enginesize 205 non-null int64

17 fuelsystem 205 non-null object

18 boreratio 205 non-null float64

19 stroke 205 non-null float64

20 compressionratio 205 non-null float64

21 horsepower 205 non-null int64

22 peakrpm 205 non-null int64

23 citympg 205 non-null int64

24 highwaympg 205 non-null int64

25 price 205 non-null float64

"""

数据没有缺失项和重复项

2.1 分析字符类

# 选出obejct类

帕萨特1.8t御尊版

df_object = df_car_price.select_dtypes(include='object').drop(columns='CarName')

# 看object类中的类型有啥

for object in lumns:

print(object, df_object[object].unique())

fueltype ['gas' 'diesel']

aspiration ['std' 'turbo']

doornumber ['two' 'four']

carbody ['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']

drivewheel ['rwd' 'fwd' '4wd']

enginelocation ['front' 'rear']

enginetype ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']

cylindernumber ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']

fuelsystem ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']

我们这⾥分析⼀下字符类⾥的取值，我们从上⾯的名词解释，可以分析出来，除了cylindernumber（⽓缸数）这个点的数字，越⼤越好，可以做为数值型处理，其他都为表⽰类型，可以使⽤one-hot编码

这⾥我们先直接对cylindernumber 进⾏编码，然后把他归为数值型数据处理

enc_cylindernumber = {'two':2, 'three':3, 'four':4, 'five':5, 'six':6, 'eight':8, 'twelve': 12}

for index, value in enumerate(df_object['cylindernumber']):

df_object['cylindernumber'][index] = enc_cylindernumber[value]

df_numeric['cylindernumber'] = df_object['cylindernumber'].astype('int32')

df_object.drop(columns='cylindernumber',inplace=True)

2.2 分析数值型数据

分析数值型数据，主要是分析数字型数据有⽆异常值和分析数据间的相关性，为后续聚类降维做准备

1.8t是什么意思

从箱线图可以看出，整体上数据没有什么离异点，有异常点的值为price等特征，在汽车⾏业属于正常现象（有⾼端车和低端车），所有可以认为数据正常。

然后我们可以去⽤热⼒图去分析数据的相关性

df_numeric_corr = ()

plt.figure(figsize=(10,10))

mask = np.zeros_like(df_numeric_corr, dtype=np.bool)

# 将mask右上三⾓(列号》=⾏号)设置为True

iu_indices_from(mask)] = True

sns.heatmap(df_numeric_corr,annot=True, mask=mask)

plt.show()

由于我们需要对数据进⾏聚类，⼀些强相关的属性可以融合成为⼀个属性，如carlength（车长），如carwidth（车宽），wheelbase（底盘长度）和curbweight（车净重量），在分析时就可以选择其中⼀个进⾏分析即可

相似的还有enginesize（引擎尺⼨）和horsepower（马⼒）和price（价格）

citympg（城市⾥程）和highwaympg（⾼速⾥程）

当然具体是不是强相关的都当做⼀个属性来处理，得具体分析看效果

长安奔奔汽车报价

2.3 分析车名和ID

因为我们最后是为volkswagen做竞品分析的，所以⽅便我们做检索，我们要对车名的数据做了解，分析，看看数据有没有问题和处理成我

们⽅便查看的情况。

KMeans聚类实例分析(汽车产品聚类分析

发布评论取消回复

最近发表

热门文章

标签列表