Model¶

The Model baseclass contains all the neat functionality of ML Tooling.

In order to take advantage of this functionality, simply wrap a model that follows the scikit-learn API using the Model class.

Creating your model¶

The first thing to do after creating a dataset object is to create a model object. This is done by supplying an estimator to the Model.

>>> from ml_tooling import Model
>>> from sklearn.linear_model import LinearRegression
>>>
>>> linear = Model(LinearRegression())
>>> linear
<Model: LinearRegression>

Scoring your model¶

In order to evaluate the performance of the model use the score_estimator() method. This will train the estimator on the training split of our bostondata Dataset and evaluate it on the test split. If no training split has been created from the data the method will create one using the default configuration values. It returns an instance of Result which we can then introspect further.

>>> result = linear.score_estimator(bostondata)
>>> result
<Result LinearRegression: {'r2': 0.68}>

Testing multiple estimators¶

To test which estimator performs best, use the test_estimators() method. This method trains each estimator on the train split and evaluates the performance on the test split. It returns a new Model instance with the best-performing estimator with the best estimator and a ResultGroup.

>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.ensemble import RandomForestRegressor
>>> best_model, results = Model.test_estimators(
...     bostondata,
...     [LinearRegression(), RandomForestRegressor(n_estimators=10, random_state=1337)],
...     metrics='r2')
>>> results
ResultGroup(results=[<Result RandomForestRegressor: {'r2': 0.82}>, <Result LinearRegression: {'r2': 0.68}>])

Training your model¶

When the best model has been found use train_estimator() to train the model on the full dataset set.

Note

This should be the last step before saving the model for production.

>>> linear.train_estimator(bostondata)
<Model: LinearRegression>

Predicting with your model¶

To make a prediction use the method make_prediction(). This will call the load_prediction_data() defined in your dataset.

>>> customer_id = 42
>>> linear.make_prediction(bostondata, customer_id)
   Prediction
0   25.203866

make_prediction() also has a parameter proba which will return the underlying probabilities if working on a classification problem

Defining a Feature Pipeline¶

It is very common to define a feature preprocessing pipeline to preprocess your data before passing it to the estimator. Using a Pipeline ensures that the preprocessing is “learned” on the training split and only applied on the validation split. When passing a feature_pipeline in the, Model will automatically create a Pipeline with two steps: features and estimator.

>>> from ml_tooling import Model
>>> from ml_tooling.transformers import DFStandardScaler
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LinearRegression
>>>
>>> feature_pipeline = Pipeline([("scaler", DFStandardScaler())])
>>> model = Model(LinearRegression(), feature_pipeline=feature_pipeline)
>>> model.estimator
Pipeline(steps=[('features', Pipeline(steps=[('scaler', DFStandardScaler())])),
                ('estimator', LinearRegression())])

Performing a gridsearch¶

To find the best hyperparameters for an estimator you can use gridsearch(), passing a dictionary of hyperparameters to try.

>>> best_estimator, results = linear.gridsearch(bostondata, { "normalize": [False, True] })
>>> results
ResultGroup(results=[<Result LinearRegression: {'r2': 0.72}>, <Result LinearRegression: {'r2': 0.72}>])

The input hyperparameters have a similar format to GridSearchCV, so if we are gridsearching using a Pipeline, we can pass hyperparameters using the same syntax.

>>> from sklearn.pipeline import Pipeline
>>> from ml_tooling.transformers import DFStandardScaler
>>> from ml_tooling import Model
>>>
>>> feature_pipe = Pipeline([('scale', DFStandardScaler())])
>>> pipe_model = Model(LinearRegression(), feature_pipeline=feature_pipe)
>>> best_estimator, results = pipe_model.gridsearch(bostondata, { "estimator__normalize": [False, True]})
>>> results
ResultGroup(results=[<Result LinearRegression: {'r2': 0.72}>, <Result LinearRegression: {'r2': 0.72}>])

Using the logging capability of Model log() method, we can save each result to a yaml file.

>>> with linear.log("./bostondata_linear"):
...     best_estimator, results = linear.gridsearch(bostondata, { "normalize": [False, True] })

This will generate a yaml file for each

created_time: 2019-10-31 17:32:08.233522
estimator:
- classname: LinearRegression
module: sklearn.linear_model.base
params:
    copy_X: true
    fit_intercept: true
    n_jobs: null
    normalize: true
estimator_path: null
git_hash: afa6def92a1e8a0ac571bec254129818bb337c49
metrics:
    r2: 0.7160133196648374
model_name: BostonData_LinearRegression
versions:
    ml_tooling: 0.9.1
    pandas: 0.25.2
    sklearn: 0.21.3

Performing a randomized search¶

Similar to the interface of the above mentioned gridsearch, you can make a more efficient but less rigorous search of the parameter space with a randomized search.

>>> from sklearn.ensemble import RandomForestRegressor
>>> from ml_tooling.search import Real
>>> rand_forest = Model(RandomForestRegressor())
>>>
>>> search_space = {
...     "max_depth": [1, 3],
...     "min_weight_fraction_leaf": Real(0, 0.5),
... }
>>> best_estimator, results = rand_forest.randomsearch(bostondata, search_space, n_iter=2)
>>> results 
ResultGroup(results=[<Result RandomForestRegressor: {'r2': 0.83}>, <Result RandomForestRegressor: {'r2': 0.56}>])

Here we specify the number of iterations n_iter=2 just for demonstration purposes, n_iter is the number of times we sample from the parameter space to try. ML-Tooling uses skopt’s Spaces to define a sampling space. You can import them from ml_tooling.search or from skopt directly.

When a list is given in the search space, a linear distribution is used by default, but you may also pass other distributions. ML-Tooling supports Real, Integer and Categorical. Each of these also support prior distributions, if more granular distributions are required.

Performing a Bayesian Search¶

ML-Tooling also supports Bayesian search - a stepwise search, where we build a surrogate model to estimate the effect of changing a given hyperparameter on the error. This surrogate model allows us to take steps in directions where the model thinks it can improve the error. Bayesian search is implemented using skopt and is a drop-in replacement for randomsearch().

>>> from sklearn.ensemble import RandomForestRegressor
    >>> from ml_tooling.search import Real
    >>> rand_forest = Model(RandomForestRegressor())
    >>>
    >>> search_space = {
    ...     "max_depth": [1, 3],
    ...     "min_weight_fraction_leaf": Real(0, 0.5),
    ... }
    >>> best_estimator, results = rand_forest.bayesiansearch(bostondata, search_space, n_iter=2)
    >>> results 
    ResultGroup(results=[<Result RandomForestRegressor: {'r2': 0.83}>, <Result RandomForestRegressor: {'r2': 0.56}>])
>>> from ml_tooling.search import Real
>>> rand_forest = Model(RandomForestRegressor())
>>>
>>> search_space = {
...     "max_depth": [1, 3],
...     "min_weight_fraction_leaf": Real(0, 0.5),
... }
>>> best_estimator, results = rand_forest.bayesiansearch(bostondata, search_space, n_iter=2)
>>> results 
ResultGroup(results=[<Result RandomForestRegressor: {'r2': 0.83}>, <Result RandomForestRegressor: {'r2': 0.56}>])

Storage¶

In order to store our estimators for later use or comparison, we use a Storage class and pass it to save_estimator().

>>> from ml_tooling.storage import FileStorage
>>>
>>> estimator_dir = './estimator_dir'
>>> storage = FileStorage(estimator_dir)
>>> estimator_path = linear.save_estimator(storage)
>>> estimator_path.name 
'LinearRegression_2019-10-23_13:23:22.058684.pkl' 

The model creates a filename for the model estimator based on the current date and time and the estimator name.

We can also load the model from a storage by specifying the filename to load in the Storage directory.

>>> loaded_linear = linear.load_estimator(estimator_path.name, storage=storage)
>>> loaded_linear
<Model: LinearRegression>

Saving an estimator ready for production¶

You have a trained estimator ready to be saved for use in production on your filesystem.

Now users of your model package can always find your estimator through load_production_estimator() using the module name.

By default, if no storage is specified, ML-Tooling will save models in your current working directory in a folder called estimators

Configuration¶

To change the default configuration values, modify the config attributes directly:

>>> linear.config.RANDOM_STATE = 2

Logging¶

We also have the ability to log our experiments using the Model.log() context manager. The results will be saved in

>>> with linear.log('test_dir'):
...     linear.score_estimator(bostondata)
<Result LinearRegression: {'r2': 0.68}>

This will write a yaml file specifying attributes of the model, results, git-hash of the model and other pertinent information.