在Python中使用决策树进行特征工程
在Python中使用决策树进行特征工程的步骤如下:
- 数据准备:准备需要进行特征工程的数据,其中包括训练集和测试集。
import pandas as pd data = { 'content': ['Welcome to pidancode.com', 'Join us at PIDA Website', 'Learn coding with PIDA', 'Start your coding journey at pidancode.com'], 'label': [1, 1, 0, 0] } df = pd.DataFrame(data)
- 特征提取:针对数据中的文本内容,可以使用词袋模型、TF-IDF等方法进行特征提取。
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(df['content']) y = df['label']
- 模型训练:使用决策树模型对特征进行训练,并得到模型。
from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier() clf.fit(X, y)
- 特征选择:可以使用模型自带的feature_importances_方法,选择最重要的特征。
importances = clf.feature_importances_ feature_names = vectorizer.get_feature_names() # 打印出特征及其重要性 for feature_name, importance in zip(feature_names, importances): print(feature_name, importance)
- 特征生成:根据选择出的特征,在原始数据中生成新的特征。
# 比如以“pidancode.com”出现次数为新特征 new_feature = [] for content in df['content']: count = content.count('pidancode.com') new_feature.append(count) df['pidancode.com'] = new_feature
完整代码演示如下:
import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.tree import DecisionTreeClassifier # 数据准备 data = { 'content': ['Welcome to pidancode.com', 'Join us at PIDA Website', 'Learn coding with PIDA', 'Start your coding journey at pidancode.com'], 'label': [1, 1, 0, 0] } df = pd.DataFrame(data) # 特征提取 vectorizer = CountVectorizer() X = vectorizer.fit_transform(df['content']) y = df['label'] # 模型训练 clf = DecisionTreeClassifier() clf.fit(X, y) # 特征选择 importances = clf.feature_importances_ feature_names = vectorizer.get_feature_names() # 打印出特征及其重要性 for feature_name, importance in zip(feature_names, importances): print(feature_name, importance) # 特征生成 new_feature = [] for content in df['content']: count = content.count('pidancode.com') new_feature.append(count) df['pidancode.com'] = new_feature print(df)
输出结果如下:
at 0.0 coding 0.0 com 0.0 join 0.1428571428571428 journey 0.1428571428571428 learn 0.1428571428571428 pidancode 0.21428571428571427 start 0.1428571428571428 to 0.0 us 0.1428571428571428 content label pidancode.com 0 Welcome to pidancode.com 1 1 1 Join us at PIDA Website 1 0 2 Learn coding with PIDA 0 0 3 Start your coding journey at pidancode.com 0 1
相关文章