随机森林:如何将更多的特征添加到稀疏矩阵中,并确定特征重要性中的项?

2022-04-22 00:00:00 python random-forest

问题描述

我需要在随机森林模型中使用词袋(BOW)生成的功能和额外功能(如Grp&;Rating)。

  1. 由于弓是稀疏矩阵,如何添加额外的特征来创建新的稀疏矩阵?目前,我将稀疏矩阵转换为稠密矩阵,并连接额外的特征以创建DF(例如DF2)。有没有办法将额外的特征添加到弓稀疏矩阵中?

  2. 如果我们使用稀疏矩阵作为X系列,我如何识别特征重要性中的项?目前我使用的是df2列。

谢谢

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier


bards_words =["The fool doth think he is wise,",
"man fool"]

vect = CountVectorizer()

bow=vect.fit_transform(bards_words)

vocab=vect.vocabulary_

new_vocab = dict([(value, key) for key, value in vocab.items()])

df0 = pd.DataFrame(bow.toarray())
df0.rename(columns=new_vocab , inplace=True)

df1 = pd.DataFrame({'Grp': ['3' , '10'],
                   'Rating': ['1', '2']
                   })



df2=pd.concat([df0, df1], axis=1)

X_train=df2.values

forest = RandomForestClassifier(n_estimators = 500, random_state=0) 
forest = forest.fit(X_train, y_train)
feature_importances = pd.DataFrame(forest.feature_importances_, index = df2.columns, columns=['importance']).sort_values('importance', ascending=False)


解决方案

仅使用稀疏数据结构。目前,您将稀疏矩阵转换为密集矩阵,再将数据帧转换为另一个数据帧,再将另一个数据帧转换为密集矩阵。这效率不高。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from scipy import sparse
import numpy as np
import pandas as pd

bards_words =["The fool doth think he is wise,",
"man fool"]

df1 = pd.DataFrame({'Grp': ['3' , '10'],
                   'Rating': ['1', '2']
                   })

vect = CountVectorizer()
bow=vect.fit_transform(bards_words)

# Stack the two df1 columns onto the left of the sparse matrix
bow = sparse.hstack((sparse.csr_matrix(df1.astype(int).values), bow))

# Keep track of features
features = np.concatenate((df1.columns.values, vect.get_feature_names()))

>>> features
array(['Grp', 'Rating', 'doth', 'fool', 'he', 'is', 'man', 'the', 'think',
       'wise'], dtype=object)

>>> bow.A
array([[ 3,  1,  1,  1,  1,  1,  0,  1,  1,  1],
       [10,  2,  0,  1,  0,  0,  1,  0,  0,  0]])

# Do your random forest
forest = RandomForestClassifier(n_estimators = 500, random_state=0) 
forest = forest.fit(bow, y_train)
feature_importances = pd.DataFrame(forest.feature_importances_, index = features, columns=['importance']).sort_values('importance', ascending=False)

相关文章