在Python中使用决策树进行异常值检测
在Python中,可以使用scikit-learn库中的DecisionTreeRegressor来实现决策树异常值检测。具体步骤如下:
1.导入需要的库和数据集
from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import load_boston data = load_boston() X = data['data'] y = data['target']
2.构造决策树模型,并训练
clf = DecisionTreeRegressor(random_state=0) clf.fit(X, y)
3.计算每个样本在模型中的残差值
residuals = y - clf.predict(X)
4.根据residuals的分布情况选择异常值的阈值
可以绘制residuals的分布直方图或者箱线图,选择合适的阈值来标记异常值。例如,当residuals的绝对值大于2.5时,可以将对应的样本标记为异常值。
import matplotlib.pyplot as plt plt.figure(figsize=(6, 4)) plt.hist(residuals, bins=50) plt.xlabel('Residuals') plt.ylabel('Frequency') plt.title('Residuals distribution') plt.show() plt.figure(figsize=(6, 4)) plt.boxplot(residuals) plt.xlabel('Residuals') plt.title('Residuals boxplot') plt.show()
5.标记异常值
threshold = 2.5 outliers = X[abs(residuals) >= threshold]
完整代码演示:
from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import load_boston import matplotlib.pyplot as plt # Load data data = load_boston() X = data['data'] y = data['target'] # Fit decision tree clf = DecisionTreeRegressor(random_state=0) clf.fit(X, y) # Calculate residuals residuals = y - clf.predict(X) # Plot residuals distribution plt.figure(figsize=(6, 4)) plt.hist(residuals, bins=50) plt.xlabel('Residuals') plt.ylabel('Frequency') plt.title('Residuals distribution') plt.show() # Plot residuals boxplot plt.figure(figsize=(6, 4)) plt.boxplot(residuals) plt.xlabel('Residuals') plt.title('Residuals boxplot') plt.show() # Set threshold for outliers and mark them threshold = 2.5 outliers = X[abs(residuals) >= threshold] print('Outliers:', outliers)
相关文章