朴素贝叶斯模型做电商评论分析

vlambda
2021-08-08

朴素贝叶斯模型做电商评论分析

导入数据，筛选出我们需要的字段

import pandas as pd

evaluation = pd.read_excel('./data/jd_comment_data.xlsx', engine='openpyxl')evaluation.head(2)# 将"此用户未填写评价内容"的用户删除evaluation = evaluation.drop(evaluation[evaluation['评价内容(content)']=="此用户未填写评价内容"].index)# 运用正则表达式，将评论当中的数字和英文字母消失evaluation['评价内容(content)'] = evaluation['评价内容(content)'].str.replace('[0-9a-zA-Z]','')# 筛选出我们需要的重要的信息importance_features = ['评价时间(publish_time)','评价内容(content)','评价者(author_name)','评分（总分5分）(score)']evaluation = evaluation[importance_features]

消除警报，使用jieba分词

import warningswarnings.filterwarnings('ignore')

‍

import jieba# 加载自定义词库jieba.load_userdict('./data/all_words.txt')# 读入停止词with open('./data/mystopwords.txt', encoding='UTF-8') as words: stop_words = [i.strip() for i in words.readlines()]# 构造切词的自定义函数，并在切词过程中删除停止词def cut_word(sentence): words = [i for i in jieba.lcut(sentence) if i not in stop_words] # 切完的词用空格隔开 result = ' '.join(words) return(result)# 对评论内容进行批量分词words = evaluation['评价内容(content)'].apply(cut_word)# 显示下前5行切词的效果words[:5]

将文本编码，数字化

from sklearn.feature_extraction.text import CountVectorizer# 计算每个词在个评论内容中的次数，并将稀疏度为99%以上的词删除counts = CountVectorizer(min_df=0.01)# 文档词条矩阵dtm_counts = counts.fit_transform(words).toarray()# 矩阵列的名称columns = counts.get_feature_names()# 将矩阵转换为数据框--即X变量X = pd.DataFrame(dtm_counts, columns=columns)# 情感标签变量y = evaluation['评分（总分5分）(score)']

构建模型

from sklearn import model_selectionfrom sklearn import naive_bayesfrom sklearn import metricsimport matplotlib.pyplot as plt import seaborn as sns

# 将数据拆分为训练集和测试集X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=1)# 构建伯努利贝叶斯分类器bnb = naive_bayes.BernoulliNB()# 模型在训练数据集合上的拟合bnb.fit(X_train, y_train)# 模型在测试数据上的测试bnb_pred = bnb.predict(X_test)# 构建混淆矩阵图cm = pd.crosstab(bnb_pred, y_test)# 绘制混淆矩阵图sns.heatmap(cm, annot=True, cmap='GnBu', fmt='d')# 去除x轴和y轴标签plt.xlabel('Real')plt.ylabel('Predict')# 显示图形plt.show()

# 模型的预测准确率print('模型的准确率为:\n',metrics.accuracy_score(y_test, bnb_pred))print('模型的评估报告:\n',metrics.classification_report(y_test, bnb_pred))

模型的准确率为:
 0.92931748604358
模型的评估报告:
               precision    recall  f1-score   support

           1       0.21      0.10      0.14       253
           2       0.00      0.00      0.00        55
           3       0.00      0.00      0.00       135
           4       0.09      0.02      0.04       223
           5       0.94      0.99      0.96     10440

    accuracy                           0.93     11106
   macro avg       0.25      0.22      0.23     11106
weighted avg       0.89      0.93      0.91     11106

检验模型

# 计算正例五星好评对应的概率，用于生成ROC曲线的数据y_score = bnb.predict_proba(X_test)[:,0]fpr, tpr, thresholds = metrics.roc_curve(np.array(list(y_test)), y_score, pos_label=2)# 计算AUC值roc_auc = metrics.auc(fpr, tpr)

# 绘制面积图plt.stackplot(fpr, tpr, color='steelblue', alpha = 0.5, edgecolor = 'black')# 添加边际线plt.plot(fpr, tpr, color='black', lw = 1)# 添加对角线plt.plot([0,1],[0,1], color = 'red', linestyle = '--')# 添加文本信息plt.text(0.5,0.3,'ROC curve (area = %0.2f)' % roc_auc)# 添加x轴与y轴标签plt.xlabel('1-Specificity')plt.ylabel('Sensitivity')# 显示图形plt.show()