小标
2018-12-24
来源 :
阅读 1353
评论 0
摘要:本文主要向大家介绍了【云计算】运用SVC预测用户是否会购买,通过具体的内容向大家展现,希望对大家学习云计算有所帮助。
本文主要向大家介绍了【云计算】运用SVC预测用户是否会购买,通过具体的内容向大家展现,希望对大家学习云计算有所帮助。
导入模块
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
导入数据—查看数据结构
data = pd.read_csv("tree_data_full.csv")
#查看数据的框架
print("dataShape ", data.shape)
data['ID']=data['ID'].astype('object')
print("dataDescribe ", data.describe())
print("-"*40)
##去重
print("-"*40)
print("重复的行",data.shape[0]-data.drop_duplicates().shape[0])
没有重复的行
发现education ,age_detail ,marital ,home_owner 有 null
name | countna |
|---|---|
| education | 741 |
| age_detail | 6709 |
| marital | 14027 |
| home_owner | 3377 |
数据处理
删除字段
del data['ID']
del data['age']
根据实际变量的意义,填写null
1- 将性别中的’F’和’U’取值合并
data['gender'] = data['gender'].replace('U', 'F')
2- 将教育背景的缺失值赋以’0.
data.education.fillna(0, inplace= True)
3- 将婚姻状态的缺失值赋以’Single’(单身)
data['marital'].fillna('Single',inplace=True)
4- 将有无子女的’0’,’N’,’U’取值合并
poc_uniq = data.poc.unique()
for i in poc_uniq:
if i=='Y':
data['poc'] = data['poc'].replace('Y', 'Y')
else:
data['poc'] = data['poc'].replace(i, 'N')
5-将房屋自有的缺失值赋以’Renter’(租客)
data['home_owner'] = data['home_owner'].replace(np.nan, 'Renter')
6-将年龄段NA 用0填写
data['age_detail'].fillna(0,inplace=True)
7-将字符型的家庭收入变量转换为数值型
data['home_income'] = data['home_income'].map({'U': 0, 'A': 1,'B':
'C': 3,'D': 4, 'E': 5, 'F': 6, 'G': 7, 'H': 8, 'I': 9, 'J': 10, 'K': 11 ,'L': 12})
8- 将根据居住区域归纳的描述消费心理的变量转换为数值型
data['mosaic_group'] = data['mosaic_group'].map({'U': 0, 'A': 12,'B': 11,'C': 10,'D': 9, 'E': 8, 'F': 7, 'G': 6, 'H': 5, 'I': 4, 'J': 3, 'K': 2 ,'L': 1})
mosaic_group_uniq = data.mosaic_group.unique()
mosaic_group_uniq = data.mosaic_group.unique()
for i in mosaic_group_uniq:
if i=='A':
data['mosaic_group'] = data['mosaic_group'].replace('A', 1)
elif i=='B':
data['mosaic_group'] = data['mosaic_group'].replace('B', 1)
elif i=='C':
data['mosaic_group'] = data['mosaic_group'].replace('C', 1)
else:
data['mosaic_group'] = data['mosaic_group'].replace(i, 0)
# 9- 将房屋价值的缺失值赋值为0
data['home_value'].fillna(0, inplace=True)
# 10- 将new_car 转化为object
data.new_car = data.new_car.astype("object")
11- 变量分类
#连续型
varNum = []
#二分类
varChar2 = []
#多分类
varCharM = []
varList = data.columns.tolist()
for var in varList:
if len(data.loc[:, var].unique()) ==2:
varChar2.append(var)
elif data.loc[:, var].dtypes == "int64" or data.loc[:, var].dtypes == "float64":
varNum.append(var)
else:
varCharM.append(var)
varNum
[‘home_value’, ‘age_detail’, ‘home_income’]
varChar2
[‘target_flag’, ‘gender’, ‘buy_online’, ‘mosaic_group’, ‘marital’, ‘poc’, ‘home_owner’]
varCharM
[‘education’, ‘occupation’, ‘mortgage’, ‘region’, ‘new_car’]
12- 多类型的变量变成 数值型
DIC1 =[] #用于储存变化后的对应值
def Chage(data,var):
var2 = data[var].unique()
a = list(range(len(var2)))
data[var] = data[var].replace(var2, a)
DIC1.append([var2, a])
for var in varCharM:
Chage(data, var)
13- 对二分类变量重新编码
for var in varChar2:
Chage(data, var)
定义roc_ks_curve
def roc_ks_curve(y_test, y_best_pred):
fpr, tpr, thresholds = roc_curve(y_test, y_best_pred, drop_intermediate= True)
plt.subplot(2,1,1)
plt.plot(fpr, tpr,label= "auc")
plt.legend()
plt.plot([0, 1], [0, 1])
plt.xlim(0,1)
plt.ylim(0,1)
plt.title("roc")
plt.show()
plt.subplot(2,1,2)
plt.title("KS")
plt.plot(1- thresholds, tpr, "r:", label='tpr')
plt.plot(1- thresholds, fpr,"b--", label= "fpr")
plt.plot(1- thresholds,tpr- fpr,"g", label= "tpr-fpr")
plt.xlim(0,1)
plt.ylim(0,1)
plt.legend()
plt.show()
建立模型
from sklearn.metrics import roc_curve,roc_auc_score,confusion_matrix
from sklearn.model_selection import train_test_split
X = data.iloc[:, 1: ]
y = data.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1, stratify = y)
from sklearn.svm import SVC
'''
C=1.0, kernel='rbf', degree=3, gamma='auto',
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, decision_function_shape='ovr',
random_state=None
'''
svc_clf = SVC(C=1.0, kernel= "rbf", degree= 3, gamma=1, probability= True, shrinking= True,
cache_size= 200, verbose= 10, max_iter= -1, decision_function_shape= "ovr")
svc_clf.fit(X_train, y_train)
from sklearn.svm import SVC
'''
C=1.0, kernel='rbf', degree=3, gamma='auto',
coef0=0.0, shrinking=True, probability=False,
tol=1e-3, cache_size=200, class_weight=None,
verbose=False, max_iter=-1, decision_function_shape='ovr',
random_state=None
'''
svc_clf = SVC(C=1.0, kernel= "rbf", degree= 3, gamma=1, probability= True, shrinking= True,
cache_size= 200, verbose= 10, max_iter= -1, decision_function_shape= "ovr")
svc_clf.fit(X_train, y_train)
y_test_pred = svc_clf.predict_proba(X_test)
y_pred = svc_clf.predict(X_test)
from sklearn.metrics import confusion_matrix,roc_curve,roc_auc_score
print("auc = %.2f%%" % (100*roc_auc_score(y_test, y_test_pred[:,1])))
roc_ks_curve(y_test, y_pred)
auc = 52.80%
本文由职坐标整理并发布,希望对同学们有所帮助。了解更多详情请关注职坐标大数据云计算大数据安全频道!
喜欢 | 0
不喜欢 | 0
您输入的评论内容中包含违禁敏感词
我知道了

请输入正确的手机号码
请输入正确的验证码
您今天的短信下发次数太多了,明天再试试吧!
我们会在第一时间安排职业规划师联系您!
您也可以联系我们的职业规划师咨询:
版权所有 职坐标-一站式AI+学习就业服务平台 沪ICP备13042190号-4
上海海同信息科技有限公司 Copyright ©2015 www.zhizuobiao.com,All Rights Reserved.
沪公网安备 31011502005948号