天池⽐赛——汽车产品聚类分析(KMeans+PCA
前⾔
⽬录
题⽬要求:赛题以竞品分析为背景,通过数据的聚类,为汽车提供聚类分类。对于指定的车型,可以通过聚类分析到其竞品车型。下⾯直接开始分析(整体代码是运⾏在notebook中的)江铃宝典07款
博客⾥有⼀些图⽚可能看不太清,还有⼀些展⽰的部分博客上不⽅⾯放出来,如果想看的⽐较仔细的,可以从下⾯这个链接,fork我的notebook
零、Notebook中引⼊包和绘图设置
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans
ics import silhouette_score
from pylab import *
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# 去除警告
warnings.filterwarnings("ignore")
# 正常画图
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
# 正常显⽰负号
# 中⽂正常显⽰
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
⼀、分析df_car_price_dictionary⽂件
题⽬中还给了个对应名词的解释表,⽽且是个不太规则的表,需要处理⼀下,具体处理如下代码
df_car_price_dictionary = pd.read_excel('./data/car_price_dictionary.xlsx', skiprows=2)
df_car_price = pd.read_csv('./data/car_price.csv')
# 清除掉数据中的缺失值
df_car_price_dictionary.dropna(axis='index', how='all', inplace=True)
df_car_price_dictionary.dropna(axis='columns', how='all', inplace=True)
df_car_price_dictionary.drop('Unnamed: 13', axis=1, inplace=True)
df_car_price_dictionary.drop('DATA DICTONARY', axis=1,inplace=True)
df_car_price_dictionary.drop(28, axis=0, inplace=True)
# #修改列名
df_car_lumns=['名词','解释']
df_car_price_dictionary.set_index('名词', inplace=True)
df_car_price_dictionary
解释
名词
Car_ID Unique id of each observation (Interger)
Symboling Its assigned insurance risk rating, A
carCompany Name of car company (Categorical)
fueltype Car fuel type i.e gas or diesel (Categorical)
aspiration Aspiration used in a car (Categorical)
doornumber Number of doors in a car (Categorical)
carbody body of car (Categorical)
电动客车drivewheel type of drive wheel (Categorical)
enginelocation Location of car engine (Categorical)
wheelbase Weelbase of car (Numeric)
carlength Length of car (Numeric)
carwidth Width of car (Numeric)
carheight height of car (Numeric)
curbweight The weight of a car without occupants
enginetype Type of engine. (Categorical)
cylindernumber cylinder placed in the car (Categorical)
enginesize Size of car (Numeric)
大众汽车问题fuelsystem Fuel system of car (Categorical)
boreratio Boreratio of car (Numeric)
stroke Stroke or volume inside the engine (Numeric)
compressionratio compression ratio of car (Numeric)
horsepower Horsepower (Numeric)
peakrpm car peak rpm (Numeric)
citympg Mileage in city (Numeric)
highwaympg Mileage on highway (Numeric)
price(Dependent variable)Price of car (Numeric)
⼆.分析car_price⽂件
分析题⽬给的car_price⽂件,主要是对其中3类进⾏分析,⼀类是关于名字ID的,⼀类是字符型的(dataframe中的object)还有⼀类是数值型,每⼀种都有不同的处理⽅法,需要我们每个逐⼀处理
先整体分析⼀下car_price
df_car_price.info()
df_car_price.duplicated().sum()
"""
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 car_ID 205 non-null int64
1 symboling 205 non-null int64
2 CarName 205 non-null object
3 fueltype 205 non-null object
4 aspiration 20
5 non-null object
5 doornumber 205 non-null object
6 carbody 205 non-null object
7 drivewheel 205 non-null object
8 enginelocation 205 non-null object
9 wheelbase 205 non-null float64
10 carlength 205 non-null float64
11 carwidth 205 non-null float64
12 carheight 205 non-null float64
13 curbweight 205 non-null int64
14 enginetype 205 non-null object
15 cylindernumber 205 non-null object
16 enginesize 205 non-null int64
17 fuelsystem 205 non-null object
18 boreratio 205 non-null float64
19 stroke 205 non-null float64
20 compressionratio 205 non-null float64
21 horsepower 205 non-null int64
22 peakrpm 205 non-null int64
23 citympg 205 non-null int64
24 highwaympg 205 non-null int64
25 price 205 non-null float64
"""
数据没有缺失项和重复项
2.1 分析字符类
# 选出obejct类
帕萨特1.8t御尊版
df_object = df_car_price.select_dtypes(include='object').drop(columns='CarName')
# 看object类中的类型有啥
for object in lumns:
print(object, df_object[object].unique())
fueltype ['gas' 'diesel']
aspiration ['std' 'turbo']
doornumber ['two' 'four']
carbody ['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']
drivewheel ['rwd' 'fwd' '4wd']
enginelocation ['front' 'rear']
enginetype ['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf' 'dohcv']
cylindernumber ['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']
fuelsystem ['mpfi' '2bbl' 'mfi' '1bbl' 'spfi' '4bbl' 'idi' 'spdi']
我们这⾥分析⼀下字符类⾥的取值,我们从上⾯的名词解释,可以分析出来,除了cylindernumber(⽓缸数)这个点的数字,越⼤越好,可以做为数值型处理,其他都为表⽰类型,可以使⽤one-hot编码
这⾥我们先直接对cylindernumber 进⾏编码,然后把他归为数值型数据处理
enc_cylindernumber = {'two':2, 'three':3, 'four':4, 'five':5, 'six':6, 'eight':8, 'twelve': 12}
for index, value in enumerate(df_object['cylindernumber']):
df_object['cylindernumber'][index] = enc_cylindernumber[value]
df_numeric['cylindernumber'] = df_object['cylindernumber'].astype('int32')
df_object.drop(columns='cylindernumber',inplace=True)
2.2 分析数值型数据
分析数值型数据,主要是分析数字型数据有⽆异常值和分析数据间的相关性,为后续聚类降维做准备
1.8t是什么意思从箱线图可以看出,整体上数据没有什么离异点,有异常点的值为price等特征,在汽车⾏业属于正常现象(有⾼端车和低端车),所有可以认为数据正常。
然后我们可以去⽤热⼒图去分析数据的相关性
df_numeric_corr = ()
plt.figure(figsize=(10,10))
mask = np.zeros_like(df_numeric_corr, dtype=np.bool)
# 将mask右上三⾓(列号》=⾏号)设置为True
iu_indices_from(mask)] = True
sns.heatmap(df_numeric_corr,annot=True, mask=mask)
plt.show()
由于我们需要对数据进⾏聚类,⼀些强相关的属性可以融合成为⼀个属性,如carlength(车长),如carwidth(车宽),wheelbase(底盘长度)和curbweight(车净重量),在分析时就可以选择其中⼀个进⾏分析即可
相似的还有enginesize(引擎尺⼨)和horsepower(马⼒)和price(价格)
citympg(城市⾥程)和highwaympg(⾼速⾥程)
当然具体是不是强相关的都当做⼀个属性来处理,得具体分析看效果
长安奔奔汽车报价2.3 分析车名和ID
因为我们最后是为volkswagen做竞品分析的,所以⽅便我们做检索,我们要对车名的数据做了解,分析,看看数据有没有问题和处理成我
们⽅便查看的情况。
发布评论