# 4.3. 数据预处理¶

``sklearn.preprocessing``包为用户提供了多个工具函数和类，用于将原始特征转换成更适于项目后期学习的特征表示。

## 4.3.1. 标准化、去均值、方差缩放(variance scaling)¶

```>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled
array([[ 0.  ..., -1.22...,  1.33...],
[ 1.22...,  0.  ..., -0.26...],
[-1.22...,  1.22..., -1.06...]])
```

```>>> X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

>>> X_scaled.std(axis=0)
array([ 1.,  1.,  1.])
```

`preprocessing``模块也提供了一个实用类:class:`StandardScaler` ,它使用 ``Transformer` 接口在训练集上计算均值和标准差，以便于在后续的测试集上进行相同的缩放. This class is hence suitable for use in the early steps of a `sklearn.pipeline.Pipeline`:

```>>> scaler = preprocessing.StandardScaler().fit(X)
>>> scaler
StandardScaler(copy=True, with_mean=True, with_std=True)

>>> scaler.mean_
array([ 1. ...,  0. ...,  0.33...])

>>> scaler.scale_
array([ 0.81...,  0.81...,  1.24...])

>>> scaler.transform(X)
array([[ 0.  ..., -1.22...,  1.33...],
[ 1.22...,  0.  ..., -0.26...],
[-1.22...,  1.22..., -1.06...]])
```

```>>> scaler.transform([[-1.,  1., 0.]])
array([[-2.44...,  1.22..., -0.26...]])
```

### 4.3.1.1. 特征缩放至特定范围¶

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

```>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
[ 1.        ,  0.5       ,  0.33333333],
[ 0.        ,  1.        ,  0.        ]])
```

```>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])
```

```>>> min_max_scaler.scale_
array([ 0.5       ,  0.5       ,  0.33...])

>>> min_max_scaler.min_
array([ 0.        ,  0.5       ,  0.33...])
```

```X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std / (max - min) + min
```

`MaxAbsScaler` 工作原理非常相似,但是它只通过除以每个特征的最大值将训练数据特征缩放至 `[-1, 1]`。这就意味着，训练数据应该是已经零中心化或者是稀疏数据。 例子::用先前例子的数据实现最大绝对值缩放操作:

```>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
>>> X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^
array([[ 0.5, -1. ,  1. ],
[ 1. ,  0. ,  0. ],
[ 0. ,  1. , -0.5]])
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
>>> X_test_maxabs
array([[-1.5, -1. ,  2. ]])
>>> max_abs_scaler.scale_
array([ 2.,  1.,  2.])
```

### 4.3.1.3. 含异常值数据缩放¶

Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normalize/standardize/rescale the data?

`scale` and :class:`StandardScaler`可以对一维数组进行使用。这对于在回归当中的目标值或响应变量进行缩放时非常有效的。

## 4.3.2. 规范化¶

```>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')

>>> X_normalized
array([[ 0.40..., -0.40...,  0.81...],
[ 1.  ...,  0.  ...,  0.  ...],
[ 0.  ...,  0.70..., -0.70...]])
```

```preprocessing``模块也提供了实用类 :class:`Normalizer` ，通过使用接口 ``Transformer``` 来实现相同的操作。(在这里``fit``方法并没有作用: 因为规范化类在面对不同的样本数据时是无状态独立的)。

This class is hence suitable for use in the early steps of a `sklearn.pipeline.Pipeline`:

```>>> normalizer = preprocessing.Normalizer().fit(X)  # fit函数没有任何效果
>>> normalizer
Normalizer(copy=True, norm='l2')
```

```>>> normalizer.transform(X)
array([[ 0.40..., -0.40...,  0.81...],
[ 1.  ...,  0.  ...,  0.  ...],
[ 0.  ...,  0.70..., -0.70...]])

>>> normalizer.transform([[-1.,  1., 0.]])
array([[-0.70...,  0.70...,  0.  ...]])
```

`normalize` and :class:`Normalizer`**既可以接收稠密数组也可以接收scipy.sparse的稀疏矩阵作为输入。**

## 4.3.3. 二值化¶

### 4.3.3.1. 特征二值化¶

```>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]

>>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)

>>> binarizer.transform(X)
array([[ 1.,  0.,  1.],
[ 1.,  0.,  0.],
[ 0.,  1.,  0.]])
```

```>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[ 0.,  0.,  1.],
[ 1.,  0.,  0.],
[ 0.,  0.,  0.]])
```

## 4.3.4. 分类特征编码¶

```>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<... 'float'>,
handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])
```

## 4.3.5. 缺失值处理（Imputation）¶

:class:`Imputer`类提供了缺失数值处理的基本策略，比如使用缺失数值所在行或列的均值、中位数、众数来替代缺失值。该类也兼容不同的缺失值编码。

```>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))
[[ 4.          2.        ]
[ 6.          3.666...]
[ 7.          6.        ]]
```

`Imputer` 类也支持稀疏矩阵:

```>>> import scipy.sparse as sp
>>> X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit(X)
Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)
>>> X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
>>> print(imp.transform(X_test))
[[ 4.          2.        ]
[ 6.          3.666...]
[ 7.          6.        ]]
```

`Imputer` can be used in a Pipeline as a way to build a composite estimator that supports imputation. See Imputing missing values before building an estimator

## 4.3.6. 多项式特征生成¶

```>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[  1.,   0.,   1.,   0.,   0.,   1.],
[  1.,   2.,   3.,   4.,   6.,   9.],
[  1.,   4.,   5.,  16.,  20.,  25.]])
```

```>>> X = np.arange(9).reshape(3, 3)
>>> X
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> poly = PolynomialFeatures(degree=3, interaction_only=True)
>>> poly.fit_transform(X)
array([[   1.,    0.,    1.,    2.,    0.,    0.,    2.,    0.],
[   1.,    3.,    4.,    5.,   12.,   15.,   20.,   60.],
[   1.,    6.,    7.,    8.,   42.,   48.,   56.,  336.]])
```

See Polynomial interpolation for Ridge regression using created polynomial features.

## 4.3.7. 装换器定制¶

```>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[ 0.        ,  0.69314718],
[ 1.09861229,  1.38629436]])
```