在Python中使用决策树进行特征工程

2023-04-14 00:00:00 特征 工程 决策树

在Python中使用决策树进行特征工程的步骤如下:

  1. 数据准备:准备需要进行特征工程的数据,其中包括训练集和测试集。
import pandas as pd

data = {
    'content': ['Welcome to pidancode.com', 'Join us at PIDA Website', 'Learn coding with PIDA', 'Start your coding journey at pidancode.com'],
    'label': [1, 1, 0, 0]
}
df = pd.DataFrame(data)
  1. 特征提取:针对数据中的文本内容,可以使用词袋模型、TF-IDF等方法进行特征提取。
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['content'])
y = df['label']
  1. 模型训练:使用决策树模型对特征进行训练,并得到模型。
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X, y)
  1. 特征选择:可以使用模型自带的feature_importances_方法,选择最重要的特征。
importances = clf.feature_importances_
feature_names = vectorizer.get_feature_names()
# 打印出特征及其重要性
for feature_name, importance in zip(feature_names, importances):
    print(feature_name, importance)
  1. 特征生成:根据选择出的特征,在原始数据中生成新的特征。
# 比如以“pidancode.com”出现次数为新特征
new_feature = []
for content in df['content']:
    count = content.count('pidancode.com')
    new_feature.append(count)

df['pidancode.com'] = new_feature

完整代码演示如下:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier

# 数据准备
data = {
    'content': ['Welcome to pidancode.com', 'Join us at PIDA Website', 'Learn coding with PIDA', 'Start your coding journey at pidancode.com'],
    'label': [1, 1, 0, 0]
}
df = pd.DataFrame(data)

# 特征提取
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['content'])
y = df['label']

# 模型训练
clf = DecisionTreeClassifier()
clf.fit(X, y)

# 特征选择
importances = clf.feature_importances_
feature_names = vectorizer.get_feature_names()
# 打印出特征及其重要性
for feature_name, importance in zip(feature_names, importances):
    print(feature_name, importance)

# 特征生成
new_feature = []
for content in df['content']:
    count = content.count('pidancode.com')
    new_feature.append(count)

df['pidancode.com'] = new_feature
print(df)

输出结果如下:

at 0.0
coding 0.0
com 0.0
join 0.1428571428571428
journey 0.1428571428571428
learn 0.1428571428571428
pidancode 0.21428571428571427
start 0.1428571428571428
to 0.0
us 0.1428571428571428
                      content  label  pidancode.com
0     Welcome to pidancode.com      1              1
1       Join us at PIDA Website     1              0
2        Learn coding with PIDA     0              0
3  Start your coding journey at pidancode.com     0              1

相关文章