随机森林:如何将更多的特征添加到稀疏矩阵中,并确定特征重要性中的项?
问题描述
我需要在随机森林模型中使用词袋(BOW)生成的功能和额外功能(如Grp&;Rating)。
由于弓是稀疏矩阵,如何添加额外的特征来创建新的稀疏矩阵?目前,我将稀疏矩阵转换为稠密矩阵,并连接额外的特征以创建DF(例如DF2)。有没有办法将额外的特征添加到弓稀疏矩阵中?
如果我们使用稀疏矩阵作为X系列,我如何识别特征重要性中的项?目前我使用的是df2列。
谢谢
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
bards_words =["The fool doth think he is wise,",
"man fool"]
vect = CountVectorizer()
bow=vect.fit_transform(bards_words)
vocab=vect.vocabulary_
new_vocab = dict([(value, key) for key, value in vocab.items()])
df0 = pd.DataFrame(bow.toarray())
df0.rename(columns=new_vocab , inplace=True)
df1 = pd.DataFrame({'Grp': ['3' , '10'],
'Rating': ['1', '2']
})
df2=pd.concat([df0, df1], axis=1)
X_train=df2.values
forest = RandomForestClassifier(n_estimators = 500, random_state=0)
forest = forest.fit(X_train, y_train)
feature_importances = pd.DataFrame(forest.feature_importances_, index = df2.columns, columns=['importance']).sort_values('importance', ascending=False)
解决方案
仅使用稀疏数据结构。目前,您将稀疏矩阵转换为密集矩阵,再将数据帧转换为另一个数据帧,再将另一个数据帧转换为密集矩阵。这效率不高。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from scipy import sparse
import numpy as np
import pandas as pd
bards_words =["The fool doth think he is wise,",
"man fool"]
df1 = pd.DataFrame({'Grp': ['3' , '10'],
'Rating': ['1', '2']
})
vect = CountVectorizer()
bow=vect.fit_transform(bards_words)
# Stack the two df1 columns onto the left of the sparse matrix
bow = sparse.hstack((sparse.csr_matrix(df1.astype(int).values), bow))
# Keep track of features
features = np.concatenate((df1.columns.values, vect.get_feature_names()))
>>> features
array(['Grp', 'Rating', 'doth', 'fool', 'he', 'is', 'man', 'the', 'think',
'wise'], dtype=object)
>>> bow.A
array([[ 3, 1, 1, 1, 1, 1, 0, 1, 1, 1],
[10, 2, 0, 1, 0, 0, 1, 0, 0, 0]])
# Do your random forest
forest = RandomForestClassifier(n_estimators = 500, random_state=0)
forest = forest.fit(bow, y_train)
feature_importances = pd.DataFrame(forest.feature_importances_, index = features, columns=['importance']).sort_values('importance', ascending=False)
相关文章