Transformers¶

One great feature of scikit-learn is the concept of the Pipeline alongside transformers

By default, scikit-learn’s transformers will convert a pandas DataFrame to numpy arrays - losing valuable column information in the process. We have implemented a number of transformers that accept a pandas DataFrame and return a pandas DataFrame.

Select¶

A column selector - Provide a list of columns to be passed on in the pipeline

Example¶

Pass a list of column names to be selected

>>> from ml_tooling.transformers import Select
>>> import pandas as pd
>>> df = pd.DataFrame({
...    "id": [1, 2, 3, 4],
...    "status": ["OK", "Error", "OK", "Error"],
...    "sales": [2000, 3000, 4000, 5000]
... })
>>> select = Select(['id', 'status'])
>>> select.fit_transform(df)
   id status
0   1     OK
1   2  Error
2   3     OK
3   4  Error

FillNA¶

Fills NA values with given value or strategy. Either a value or a strategy has to be supplied.

Examples¶

You can pass any value to replace NaNs with

>>> from ml_tooling.transformers import FillNA
>>> import numpy as np
>>> df = pd.DataFrame({
...        "id": [1, 2, 3, 4],
...        "sales": [2000, 3000, 4000, np.nan]
... })
>>> fill_na = FillNA(value = 0)
>>> fill_na.fit_transform(df)
   id   sales
0   1  2000.0
1   2  3000.0
2   3  4000.0
3   4     0.0

You can also use one of the built-in strategies.

mean
median
most_freq
max
min

>>> fill_na = FillNA(strategy='mean')
>>> fill_na.fit_transform(df)
   id   sales
0   1  2000.0
1   2  3000.0
2   3  4000.0
3   4  3000.0

In addition, FillNa will indicate if a value in a column was missing if you set indicate_nan=True. This creates a new column of 1 and 0 indicating missing values

>>> fill_na = FillNA(strategy='mean', indicate_nan=True)
>>> fill_na.fit_transform(df)
   id   sales  id_is_nan  sales_is_nan
0   1  2000.0          0             0
1   2  3000.0          0             0
2   3  4000.0          0             0
3   4  3000.0          0             1

ToCategorical¶

Performs one-hot encoding of categorical values through pandas.Categorical. All categorical values not found in training data will be set to 0

Example¶

>>> from ml_tooling.transformers import ToCategorical
>>> df = pd.DataFrame({
...    "status": ["OK", "Error", "OK", "Error"]
... })
>>> onehot = ToCategorical()
>>> onehot.fit_transform(df)
   status_Error  status_OK
0             0          1
1             1          0
2             0          1
3             1          0

FuncTransformer¶

Applies a given function to each column

Example¶

We can use any arbitrary function that accepts a pandas.Series - under the hood, FuncTransformer uses apply()

>>> from ml_tooling.transformers import FuncTransformer
>>> df = pd.DataFrame({
...    "status": ["OK", "Error", "OK", "Error"]
... })
>>> uppercase = FuncTransformer(lambda x: x.str.upper())
>>> uppercase.fit_transform(df)
  status
0     OK
1  ERROR
2     OK
3  ERROR

FuncTransformer also supports passing keyword arguments to the function

>>> from ml_tooling.transformers import FuncTransformer
>>> def custom_func(input, word1, word2):
...    result = ""
...    if input == "OK":
...       result = word1
...    elif input == "Error":
...       result = word2
...    return result
>>> def wrapper(df, word1, word2):
...   return df.apply(custom_func,args=(word1,word2))
>>> df = pd.DataFrame({
...     "status": ["OK", "Error", "OK", "Error"]
... })
>>> kwargs = {'word1': 'Okay','word2': 'Fail'}
>>> wordchange = FuncTransformer(wrapper,**kwargs)
>>> wordchange.fit_transform(df)
  status
0   Okay
1   Fail
2   Okay
3   Fail

Binner¶

Bins numerical data into supplied bins. Bins are passed on to pandas.cut()

Example¶

Here we want to bin our sales data into 3 buckets

>>> from ml_tooling.transformers import Binner
>>> df = pd.DataFrame({
...    "sales": [1500, 2000, 2250, 7830]
... })
>>> binned = Binner(bins=[0, 1000, 2000, 8000])
>>> binned.fit_transform(df)
          sales
0  (1000, 2000]
1  (1000, 2000]
2  (2000, 8000]
3  (2000, 8000]

Renamer¶

Renames columns to be equal to the passed list - must be in order

Example¶

>>> from ml_tooling.transformers import Renamer
>>> df = pd.DataFrame({
...     "Total Sales": [1500, 2000, 2250, 7830]
... })
>>> rename = Renamer(['sales'])
>>> rename.fit_transform(df)
   sales
0   1500
1   2000
2   2250
3   7830

DateEncoder¶

Adds year, month, day and week columns based on a datefield. Each date type can be toggled in the initializer

Example¶

>>> from ml_tooling.transformers import DateEncoder
>>> df = pd.DataFrame({
...     "sales_date": [pd.to_datetime('2018-01-01'), pd.to_datetime('2018-02-02')]
... })
>>> dates = DateEncoder(week=False)
>>> dates.fit_transform(df)
   sales_date_day  sales_date_month  sales_date_year
0               1                 1             2018
1               2                 2             2018

FreqFeature¶

Converts a column into a normalized frequency

Example¶

>>> from ml_tooling.transformers import FreqFeature
>>> df = pd.DataFrame({
...     "sales_category": ['Sale', 'Sale', 'Not Sale']
... })
>>> freq = FreqFeature()
>>> freq.fit_transform(df)
   sales_category
0        0.666667
1        0.666667
2        0.333333

DFFeatureUnion¶

A FeatureUnion equivalent for DataFrames. Concatenates the result of multiple transformers

Example¶

>>> from ml_tooling.transformers import FreqFeature, Binner, Select, DFFeatureUnion
>>> from sklearn.pipeline import Pipeline
>>> df = pd.DataFrame({
...     "sales_category": ['Sale', 'Sale', 'Not Sale', 'Not Sale'],
...     "sales": [1500, 2000, 2250, 7830]
... })
>>> freq = Pipeline([
...     ('select', Select('sales_category')),
...     ('freq', FreqFeature())
... ])
>>> binned = Pipeline([
...     ('select', Select('sales')),
...     ('bin', Binner(bins=[0, 1000, 2000, 8000]))
...     ])
>>> union = DFFeatureUnion([
...    ('sales_category', freq),
...    ('sales', binned)
... ])
>>> union.fit_transform(df)
   sales_category         sales
0             0.5  (1000, 2000]
1             0.5  (1000, 2000]
2             0.5  (2000, 8000]
3             0.5  (2000, 8000]

DFRowFunc¶

Row-wise operation on pandas.DataFrame. Strategy can either be one of the predefined or a callable. If some elements in the row are NaN these elements are ignored for the built-in strategies. The built-in strategies are ‘sum’, ‘min’ and ‘max’

Example¶

>>> from ml_tooling.transformers import DFRowFunc
>>> df = pd.DataFrame({
...    "number_1": [1, np.nan, 3, 4],
...    "number_2": [1, 3, 2, 4]
... })
>>> rowfunc = DFRowFunc(strategy = 'sum')
>>> rowfunc.fit_transform(df)
     0
0  2.0
1  3.0
2  5.0
3  8.0

You can also use any callable that takes a pandas.Series

>>> rowfunc = DFRowFunc(strategy = np.mean)
>>> rowfunc.fit_transform(df)
     0
0  1.0
1  3.0
2  2.5
3  4.0

Binarize¶

Convenience transformer which returns 1 where the column value is equal to given value else 0.

Example¶

>>> from ml_tooling.transformers import Binarize
>>> df = pd.DataFrame({
...     "number_1": [1, np.nan, 3, 4],
...     "number_2": [1, 3, 2, 4]
... })
>>> binarize = Binarize(value = 3)
>>> binarize.fit_transform(df)
   number_1  number_2
0         0         0
1         0         1
2         1         0
3         0         0

RareFeatureEncoder¶

Replaces categories with a value, if they occur less than a threshold. - Using pandas.Series.value_counts(). The fill value can be any value and the threshold can be either a percent or int value.

The column names needs to be identical when using Train & Test dataset

The Transformer does not count NaN.

Example¶

>>> from ml_tooling.transformers import RareFeatureEncoder
>>> df = pd.DataFrame({
...         "categorical_a": [1, "a", "a", 2, "b", np.nan],
...         "categorical_b": [1, 2, 2, 3, 3, 3],
...         "categorical_c": [1, "a", "a", 2, "b", "b"],
... })

>>> rare = RareFeatureEncoder(threshold=2, fill_rare="Rare")
>>> rare.fit_transform(df)
  categorical_a categorical_b categorical_c
0          Rare          Rare          Rare
1             a             2             a
2             a             2             a
3          Rare             3          Rare
4          Rare             3             b
5           NaN             3             b