Model

The Model baseclass contains all the neat functionality of ML Tooling.

In order to take advantage of this functionality, simply wrap a model that follows the scikit-learn API using the Model class.

See also

Refer to Model for a full overview of methods

We will be using scikit-learn’s built-in Boston houseprices dataset to demonstrate how to use ML Tooling. We use the method load_demo_dataset() to load the dataset.

We then simply wrap a LinearRegression using our Model class and we are ready to begin!

>>> from ml_tooling.data import load_demo_dataset
>>>
>>> bostondata = load_demo_dataset("boston")
>>> # Remember to setup a train test split!
>>> bostondata.create_train_test()
<BostonData - Dataset>

Creating your model

The first thing to do after creating a dataset object is to create a model object. This is done by supplying an estimator to the Model.

>>> from ml_tooling import Model
>>> from sklearn.linear_model import LinearRegression
>>>
>>> linear = Model(LinearRegression())
>>> linear
<Model: LinearRegression>

Scoring your model

In order to evaluate the performance of the model use the score_estimator() method. This will train the estimator on the training split of our bostondata Dataset and evaluate it on the test split. If no training split has been created from the data the method will create one using the default configuration values. It returns an instance of Result which we can then introspect further.

>>> result = linear.score_estimator(bostondata)
>>> result
<Result LinearRegression: {'r2': 0.68}>

Testing multiple estimators

To test which estimator performs best, use the test_estimators() method. This method trains each estimator on the train split and evaluates the performance on the test split. It returns a new Model instance with the best-performing estimator with the best estimator and a ResultGroup.

>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.ensemble import RandomForestRegressor
>>> best_model, results = Model.test_estimators(
...     bostondata,
...     [LinearRegression(), RandomForestRegressor(n_estimators=10, random_state=1337)],
...     metrics='r2')
>>> results
ResultGroup(results=[<Result RandomForestRegressor: {'r2': 0.82}>, <Result LinearRegression: {'r2': 0.68}>])

Training your model

When the best model has been found use train_estimator() to train the model on the full dataset set.

Note

This should be the last step before saving the model for production.

>>> linear.train_estimator(bostondata)
<Model: LinearRegression>

Predicting with your model

To make a prediction use the method make_prediction(). This will call the load_prediction_data() defined in your dataset.

>>> customer_id = 42
>>> linear.make_prediction(bostondata, customer_id)
   Prediction
0   25.203866

make_prediction() also has a parameter proba which will return the underlying probabilities if working on a classification problem

Defining a Feature Pipeline

It is very common to define a feature preprocessing pipeline to preprocess your data before passing it to the estimator. Using a Pipeline ensures that the preprocessing is “learned” on the training split and only applied on the validation split. When passing a feature_pipeline in the, Model will automatically create a Pipeline with two steps: features and estimator.

>>> from ml_tooling import Model
>>> from ml_tooling.transformers import DFStandardScaler
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LinearRegression
>>>
>>> feature_pipeline = Pipeline([("scaler", DFStandardScaler())])
>>> model = Model(LinearRegression(), feature_pipeline=feature_pipeline)
>>> model.estimator
Pipeline(steps=[('features', Pipeline(steps=[('scaler', DFStandardScaler())])),
                ('estimator', LinearRegression())])

Performing a gridsearch

To find the best hyperparameters for an estimator you can use gridsearch(), passing a dictionary of hyperparameters to try.

>>> best_estimator, results = linear.gridsearch(bostondata, { "normalize": [False, True] })
>>> results
ResultGroup(results=[<Result LinearRegression: {'r2': 0.72}>, <Result LinearRegression: {'r2': 0.72}>])

The input hyperparameters have a similar format to GridSearchCV, so if we are gridsearching using a Pipeline, we can pass hyperparameters using the same syntax.

>>> from sklearn.pipeline import Pipeline
>>> from ml_tooling.transformers import DFStandardScaler
>>> from ml_tooling import Model
>>>
>>> feature_pipe = Pipeline([('scale', DFStandardScaler())])
>>> pipe_model = Model(LinearRegression(), feature_pipeline=feature_pipe)
>>> best_estimator, results = pipe_model.gridsearch(bostondata, { "estimator__normalize": [False, True]})
>>> results
ResultGroup(results=[<Result LinearRegression: {'r2': 0.72}>, <Result LinearRegression: {'r2': 0.72}>])

Using the logging capability of Model log() method, we can save each result to a yaml file.

>>> with linear.log("./bostondata_linear"):
...     best_estimator, results = linear.gridsearch(bostondata, { "normalize": [False, True] })

This will generate a yaml file for each

created_time: 2019-10-31 17:32:08.233522
estimator:
- classname: LinearRegression
module: sklearn.linear_model.base
params:
    copy_X: true
    fit_intercept: true
    n_jobs: null
    normalize: true
estimator_path: null
git_hash: afa6def92a1e8a0ac571bec254129818bb337c49
metrics:
    r2: 0.7160133196648374
model_name: BostonData_LinearRegression
versions:
    ml_tooling: 0.9.1
    pandas: 0.25.2
    sklearn: 0.21.3

Storage

In order to store our estimators for later use or comparison, we use a Storage class and pass it to save_estimator().

>>> from ml_tooling.storage import FileStorage
>>>
>>> estimator_dir = './estimator_dir'
>>> storage = FileStorage(estimator_dir)
>>> estimator_path = linear.save_estimator(storage)
>>> estimator_path.name 
'LinearRegression_2019-10-23_13:23:22.058684.pkl' 

The model creates a filename for the model estimator based on the current date and time and the estimator name.

We can also load the model from a storage by specifying the filename to load in the Storage directory.

>>> loaded_linear = linear.load_estimator(estimator_path.name, storage=storage)
>>> loaded_linear
<Model: LinearRegression>

Saving an estimator ready for production

You have a trained estimator ready to be saved for use in production on your filesystem.

Now users of your model package can always find your estimator through load_production_estimator() using the module name.

By default, if no storage is specified, ML-Tooling will save models in your current working directory in a folder called estimators

Configuration

To change the default configuration values, modify the config attributes directly:

>>> linear.config.RANDOM_STATE = 2

See also

Refer to Config for a list of available configuration options

Logging

We also have the ability to log our experiments using the Model.log() context manager. The results will be saved in

>>> with linear.log('test_dir'):
...     linear.score_estimator(bostondata)
<Result LinearRegression: {'r2': 0.68}>

This will write a yaml file specifying attributes of the model, results, git-hash of the model and other pertinent information.

See also

Check out Model.log() for more info on what is logged

Continue to Storage