特征选择方法

基于变量的数量进行分类:

  • Univariate methods, variable ranking: consider the input variables (features, attributes) one by one.

  • Multivariate methods, variable subset selection: consider whole groups of variables together.

基于在选择过程中使用的机器学习方法进行分类:

  • 过滤式(filter): selects a subset of variables independently of the model that shall subsequently use them.
  • 包裹式(wrapper): selects a subset of variables taking into account the model that shall use them.
  • 嵌入式(embedded): the feature selection method is built in the ML model (or rather its training algorithm) itself (e.g. decision trees).

常用的filter方法

过滤方法使用特定的评估标准,如距离,信息,依赖性和一致性等,对特征进行排序,从而进行进行变量选择,称之为”过滤“。过滤方法通常用作数据预处理步骤。 特征的选择独立于任何机器学习算法。 特征基于统计分数给出排名,统计分数倾向与结果变量相关的特征。

1. F Test (ANOVA)

Scikit-learn 提供了 Selecting K best 个特征,使用F-test.

sklearn.feature_selection.f_regression

对于回归问题。

sklearn.feature_selection.f_classif

2. Mutual Information

F-test方法只能捕获标签与特征间的线性关系,互信息可以很好的处理非线性关系。

sklearn.feature_selection.mututal_info_regression 
sklearn.feature_selection.mututal_info_classif

常用的Wrapper方法

选择一个特征子集,然后评估其建模性能。

1. Forward Search

2. Recursive Feature Elimination

Wrapper 方法通过贪婪搜索选择最好的特征集。缺点是需要训练的大量模型,计算量大。

常用的Embedded方法

1. LASSO Linear Regression

2. Tree based models

参考资料:

[1] Why, How and When to apply Feature Selection

[2] Beginner's Guide to Feature Selection in Python

[3] Feature selection and extraction

[4] Feature selection – Part I: univariate selection

[5] What is the difference between filter, wrapper, and embedded methods for feature selection?

发表评论

电子邮件地址不会被公开。 必填项已用*标注