API

Model

A wrapper for scikit-learn based estimators Implements all the base functionality needed to create the wrapper

class ml_tooling.baseclass.Model(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], feature_pipeline: sklearn.pipeline.Pipeline = None)

Wrapper class for Estimators

Parameters:
  • estimator (Estimator) – Any scikit-learn compatible estimator
  • feature_pipeline (Pipeline) – Optionally pass a feature preprocessing Pipeline. Model will automatically insert the estimator into a preprocessing pipeline
bayesiansearch(data: ml_tooling.data.base_data.Dataset, param_distributions: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, n_iter: int = 10, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]

Runs a cross-validated Bayesian Search on the estimator with a randomized sampling of the passed parameter distributions

Parameters:
  • data (Dataset) – An instance of a DataSet object
  • param_distributions (dict) – Parameter distributions to use for randomizing search. Should be a dictionary of param_names -> one of - ml_tooling.search.Integer - ml_tooling.search.Categorical - ml_tooling.search.Real
  • metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to self.default_metric. First metric is used to sort results.
  • cv (int, optional) – Cross validation to use. Defaults to value in config.CROSS_VALIDATION
  • n_iter (int) – Number of parameter settings that are sampled.
  • refit (bool) – Whether or not to refit the best model
Returns:

  • best_estimator (Model) – Best estimator as found by the Bayesian Search
  • result_group (ResultGroup) – ResultGroup object containing each individual score

default_metric

Defines default metric based on whether or not the estimator is a regressor or classifier. Then CLASSIFIER_METRIC or CLASSIFIER_METRIC is returned.

Returns:Name of the metric
Return type:str
gridsearch(data: ml_tooling.data.base_data.Dataset, param_grid: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]

Runs a cross-validated gridsearch on the estimator with the passed in parameter grid.

Parameters:
  • data (Dataset) – An instance of a DataSet object
  • param_grid (dict) – Parameters to use for grid search
  • metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to self.default_metric. First metric is used to sort results.
  • cv (int, optional) – Cross validation to use. Defaults to value in config.CROSS_VALIDATION
  • refit (bool) – Whether or not to refit the best model
Returns:

  • best_estimator (Model) – Best estimator as found by the gridsearch
  • result_group (ResultGroup) – ResultGroup object containing each individual score

static list_estimators(storage: ml_tooling.storage.base.Storage) → List[pathlib.Path]

Gets a list of estimators from the given Storage

Parameters:storage (Storage) – Storage class to list the estimators with

Example

storage = FileStorage(‘path/to/estimators_dir’) estimator_list = Model.list_estimators(storage)

Returns:list of Paths
Return type:List[pathlib.Path]
classmethod load_estimator(path: Union[str, pathlib.Path], storage: ml_tooling.storage.base.Storage = None) → ml_tooling.baseclass.Model

Instantiates the class with a joblib pickled estimator.

Parameters:
  • storage (Storage) – Storage class to load the estimator with
  • path (str, pathlib.Path, optional) – Path to estimator pickle file

Example

We can load a trained estimator from disk:

storage = FileStorage('path/to/dir')
my_estimator = Model.load_estimator('my_model.pkl', storage=storage)

We now have a trained estimator loaded.

We can also use the default storage:

my_estimator = Model.load_estimator('my_model.pkl')

This will use the default FileStorage defined in Model.config.default_storage

Returns:Instance of Model with a saved estimator
Return type:Model
classmethod load_production_estimator(module_name: str)

Loads a model from a python package. Given that the package is an ML-Tooling package, this will load the production model from the package and create an instance of Model with that package

Parameters:module_name (str) – The name of the package to load a model from
log(run_directory: str)

log() is a context manager that lets you turn on logging for any scoring methods that follow. You can pass a log_dir to specify a subdirectory to store the estimator in. The output is a yaml file recording estimator parameters, package version numbers, metrics and other useful information

Parameters:run_directory (str) – Name of the folder to save the details in

Example

If we want to log an estimator run in the score folder we can write:

with estimator.log('score'):
   estimator.score_estimator

This will save the results of estimator.score_estimator() to runs/score/

make_prediction(data: ml_tooling.data.base_data.Dataset, *args, proba: bool = False, threshold: float = None, use_index: bool = False, use_cache: bool = False, **kwargs) → pandas.core.frame.DataFrame

Makes a prediction given an input. For example a customer number. Calls load_prediction_data(*args) and passes resulting data to predict() on the estimator

Parameters:
  • data (Dataset) – an instantiated Dataset object
  • proba (bool) – Whether prediction is returned as a probability or not. Note that the return value is an n-dimensional array where n = number of classes
  • threshold (float) – Threshold to use for predicting a binary class
  • use_index (bool) – Whether the index from the prediction data should be used for the result.
  • use_cache (bool) – Whether or not to use the cached data in dataset to make predictions. Useful for seeing probability distributions of the model
Returns:

A DataFrame with a prediction per row.

Return type:

pd.DataFrame

randomsearch(data: ml_tooling.data.base_data.Dataset, param_distributions: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, n_iter: int = 10, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]

Runs a cross-validated randomsearch on the estimator with a randomized sampling of the passed parameter distributions

Parameters:
  • data (Dataset) – An instance of a DataSet object
  • param_distributions (dict) – Parameter distributions to use for randomizing search
  • metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to self.default_metric. First metric is used to sort results.
  • cv (int, optional) – Cross validation to use. Defaults to value in config.CROSS_VALIDATION
  • n_iter (int) – Number of parameter settings that are sampled.
  • refit (bool) – Whether or not to refit the best model
Returns:

  • best_estimator (Model) – Best estimator as found by the randomsearch
  • result_group (ResultGroup) – ResultGroup object containing each individual score

save_estimator(storage: ml_tooling.storage.base.Storage = None, prod=False) → pathlib.Path

Saves the estimator as a binary file.

Parameters:
  • storage (Storage) – Storage class to save the estimator with
  • prod (bool) – Whether this is a production model to be saved

Example

If we have trained an estimator and we want to save it to disk we can write:

storage = FileStorage('/path/to/save/dir')
model = Model(LinearRegression())
saved_filename = model.save_estimator(storage)

to save in the given folder.

Returns:The path to where the estimator file was saved
Return type:pathlib.Path
score_estimator(data: ml_tooling.data.base_data.Dataset, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = False) → ml_tooling.result.result.Result

Scores the estimator based on training data from data and validates based on validation data from data.

Defaults to no cross-validation. If you want to cross-validate the results, pass number of folds to cv. If cross-validation is used, score_estimator only cross-validates on training data and doesn’t use the validation data.

If the dataset does not have a train set, it will create one using the default config.

Returns a Result object containing all result parameters

Parameters:
  • data (Dataset) – An instantiated Dataset object with create_train_test called
  • metrics (string, list of strings) – Metric or metrics to use for scoring the estimator. Any sklearn metric string
  • cv (int, optional) – Whether or not to use cross validation. Number of folds if an int is passed If False, don’t use cross validation
Returns:

A Result object that contains the results of the scoring

Return type:

Result

classmethod test_estimators(data: ml_tooling.data.base_data.Dataset, estimators: Sequence[Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]], feature_pipeline: sklearn.pipeline.Pipeline = None, metrics: Union[str, List[str]] = 'default', cv: Union[int, bool] = False, log_dir: str = None, refit: bool = False) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]

Trains each estimator passed and returns a sorted list of results

Parameters:
  • data (Dataset) – An instantiated Dataset object with train_test data
  • estimators (Sequence[Estimator]) – List of estimators to train
  • feature_pipeline (Pipeline) – A pipeline for transforming features
  • metrics (str, list of str) – Metric or list of metrics to use in scoring of estimators
  • cv (int, bool) – Whether or not to use cross-validation. If an int is passed, use that many folds
  • log_dir (str, optional) – Where to store logged estimators. If None, don’t log
  • refit (bool) – Whether or not to refit the best model on all the training data
Returns:

Return type:

List of Result objects

to_dict() → List[dict]

Serializes the estimator to a dictionary

Returns:
Return type:List of dicts
train_estimator(data: ml_tooling.data.base_data.Dataset) → ml_tooling.baseclass.Model

Loads all training data and trains the estimator on all data. Typically used as the last step when estimator tuning is complete.

Warning

This will set self.result attribute to None. This method trains the estimator using all the data, so there is no validation data to measure results against

Returns:Returns an estimator trained on all the data, with no train-test split
Return type:Model

Datasets

class ml_tooling.data.FileDataset(path: Union[str, pathlib.Path])

An Abstract Base Class for use in creating Filebased Datasets. This class is intended to be subclassed and must provide a load_training_data() and load_prediction_data() method.

FileDataset takes a path as its initialization argument, pointing to a file which must be a filetype supported by Pandas, such as csv, parquet etc. The extension determines the pandas method used to read and write the data

Instantiates a Filedataset pointing at a given path.

Parameters:path (Pathlike) – Path to location of file
load_prediction_data(*args, **kwargs) → pandas.core.frame.DataFrame

Used to load prediction data for a given idx - returns features

load_training_data(*args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]

Used to load the full training dataset - returns features and targets

read_file(**kwargs)

Read the data from the passed file path

Parameters:kwargs (dict) – Kwargs are passed to the relevant pd.read_*() method for the given extension
class ml_tooling.data.SQLDataset(conn: Union[str, sqlalchemy.engine.interfaces.Connectable], schema: Optional[str], **kwargs)

An Abstract Base Class for use in creating SQL Datasets. This class is intended to be subclassed and must provide a load_training_data() and load_prediction_data() method.

These methods must accept a conn argument which is an instance of a SQLAlchemy connection. This connection will be passed to the method by the SQLDataset at runtime.

table

SQLAlchemy table definition to use when loading the dataset. Is the table that will be copied when using .copy_to and should be the canonical definition of the feature set. Do not define a schema - that is set at runtime

Type:sa.Table

Instantiates a dataset with the necessary arguments to connect to the database.

Parameters:
  • conn (Connectable) – Either a valid DB_URL string or an engine to connect to the database
  • schema (str) – A string naming the schema to use - allows for swapping schemas at runtime
  • kwargs (dict) – Kwargs are passed to create_engine if conn is a string
copy_to(target: ml_tooling.data.sql.SQLDataset) → ml_tooling.data.sql.SQLDataset

Copies data from one database table into other. This will truncate the table and load new data in.

Parameters:target (SQLDataset) – A SQLDataset object representing the table you want to copy the data into
Returns:The target dataset to copy to
Return type:SQLDataset
create_connection() → sqlalchemy.engine.base.Connection

Instantiates a connection to be used in reading and writing to the database.

Ensures that connections are closed properly and dynamically inserts the schema into the database connection

Returns:An open connection to the database, with a dynamically defined schema
Return type:sa.engine.Connection
load_prediction_data(idx, conn, *args, **kwargs) → pandas.core.frame.DataFrame

Used to load prediction data for a given idx - returns features

load_training_data(conn, *args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]

Used to load the full training dataset - returns features and targets

class ml_tooling.data.Dataset

Baseclass for creating Datasets. Subclass Dataset and provide a load_training_data() and load_prediction_data() method

create_train_test(stratify: bool = False, shuffle: bool = True, test_size: float = 0.25, seed: int = 42) → ml_tooling.data.base_data.Dataset

Creates a training and testing dataset and storing it on the data object.

Parameters:
  • stratify (DataType, optional) – What to stratify the split on. Usually y if given a classification problem
  • shuffle – Whether or not to shuffle the data
  • test_size – What percentage of the data will be part of the test set
  • seed – Random seed for train_test_split
Returns:

Return type:

self

load_prediction_data(*args, **kwargs) → pandas.core.frame.DataFrame

Abstract method to be implemented by the user. Defines data to be used at prediction time, defined as a DataFrame

Returns:DataFrame of input features to get a prediction
Return type:pd.DataFrame
load_training_data(*args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]

Abstract method to be implemented by user. Defines data to be used at training time where X is a dataframe and y is a numpy array

Returns:x, y – Training data to be used by the models
Return type:Tuple of DataTypes
ml_tooling.data.load_demo_dataset(dataset_name: str, **kwargs) → ml_tooling.data.base_data.Dataset

Create a Dataset implementing the demo datasets from sklearn.datasets

Parameters:
  • dataset_name (str) –

    Name of the dataset to use. If ‘openml’ is passed either parameter name or data_id needs to be specified.

    One of:
    • iris
    • boston
    • diabetes
    • digits
    • linnerud
    • wine
    • breast_cancer
    • openml
  • **kwargs – Kwargs are passed on to the scikit-learn dataset function
Returns:

An instance of Dataset

Return type:

Dataset

Dataset Plotting Methods

class ml_tooling.plots.viz.data_viz.DataVisualize(data)
missing_data(ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, feature_pipeline: Optional[sklearn.pipeline.Pipeline] = None) → matplotlib.axes._axes.Axes

Plot number of missing data points per column. Sorted by number of missing values.

Also allows for selecting top_n/bottom_n number or percent of columns by passing an int or float

Parameters:
  • ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • feature_pipeline (Pipeline) – A feature transformation pipeline to be applied before graphing the final results
Returns:

Return type:

plt.Axes

target_correlation(method: str = 'spearman', ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, feature_pipeline: Optional[sklearn.pipeline.Pipeline] = None) → matplotlib.axes._axes.Axes

Plot the correlation between each feature and the target variable using the given value.

Also allows selecting how many features to show by setting the top_n and/or bottom_n parameters.

Parameters:
  • method (str) – Which method to use when calculating correlation. Supports one of ‘pearson’, ‘spearman’, ‘kendall’.
  • ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • feature_pipeline (Pipeline) – A feature transformation pipeline to be applied before graphing the data
Returns:

Return type:

plt.Axes

Storage

class ml_tooling.storage.FileStorage(dir_path: Union[str, pathlib.Path] = None)

File Storage class for handling storage of estimators to the file system

get_list() → List[pathlib.Path]

Finds a list of estimator file paths in the FileStorage directory.

Example

Find and return estimator paths in a given directory:
my_estimators = FileStorage(‘path/to/dir’).get_list()
Returns:list of paths to files sorted by filename
Return type:List[Path]
load(file_path: Union[str, pathlib.Path]) → Any

Loads a joblib pickled estimator from given filepath and returns the unpickled object

Parameters:file_path (Pathlike) – Path where to load the estimator file relative to FileStorage

Example

We can load a saved pickled estimator from disk directly from FileStorage:

storage = FileStorage(‘path/to/dir’) my_estimator = storage.load(‘mymodel.pkl’)

We now have a trained estimator loaded.

Returns:The object loaded from disk
Return type:Object
save(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], filename: str, prod: bool = False) → pathlib.Path

Save a joblib pickled estimator.

Parameters:
  • estimator (obj) – The estimator object
  • filename (str) – filename of estimator pickle file
  • prod (bool) – Whether or not to save in “production mode” - Production mode saves to /src/<projectname>/ regardless of what FileStorage was instantiated with

Example

To save your trained estimator, use the FileStorage context manager.

storage = FileStorage(‘/path/to/save/dir/’) file_path = storage.save(estimator, ‘filename’)

We now have saved an estimator to a pickle file.

Returns:Path to the saved object
Return type:Path
class ml_tooling.storage.Storage

Base class for Storage classes

get_list() → List[pathlib.Path]

Abstract method to be implemented by the user. Defines method used to show which objects have been saved

Returns:Paths to each of the estimators sorted lexically
Return type:List[Path]
load(file_path: Union[str, pathlib.Path]) → Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]

Abstract method to be implemented by the user. Defines method used to load data from the storage type

Returns:Returns the unpickled object
Return type:Estimator
save(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], file_path: Union[str, pathlib.Path], prod: bool = False) → Union[str, pathlib.Path]

Abstract method to be implemented by the user. Defines method used to save data from the storage type

Returns:Path to where the pickled object is saved
Return type:Pathlike
class ml_tooling.storage.ArtifactoryStorage(artifactory_url: str, repo: str, apikey: Optional[str] = None, auth: Optional[Tuple[str, str]] = None)

Artifactory Storage class for handling storage of estimators to JFrog artifactory

Example

Instantiate this class with a url and path to the repo like so:

storage = ArtifactoryStorage(’http://artifactory.com’,’path/to/artifact’)
get_list() → List[ArtifactoryPath]

Finds a list of estimator artifact paths in the ArtifactoryStorage repo.

Example

Find and return estimator paths in a given directory:
my_estimators = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’).get_list()
Returns:list of paths to files sorted by filename
Return type:List[ArtifactoryPath]
load(file_path: Union[str, pathlib.Path]) → Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]

Loads a pickled estimator from given filepath and returns the estimator

Parameters:file_path (Pathlike) – Path to load the estimator relative to ArtifactoryStorage

Example

We can load a saved pickled estimator from disk directly from Artifactory:

storage = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’) my_estimator = storage.load(‘estimatorfile’)

We now have a trained estimator loaded.

Returns:estimator unpickled object
Return type:Object
save(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], filename: str, prod: bool = False) → ArtifactoryPath

Save a pickled estimator to artifactory.

Parameters:
  • estimator (Estimator) – The estimator object
  • filename (str) – filename of estimator pickle file
  • prod (bool) – Production variable, set to True if saving a production-ready estimator

Example

To save your trained estimator:

storage = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’) artifactory_path = storage.save(estimator, ‘estimator.pkl’)

We now have saved an estimator to a pickle file.

Returns:File path to stored estimator
Return type:ArtifactoryPath

Config

All configuration options available

class ml_tooling.config.DefaultConfig

Configuration for Models

VERBOSITY = 0
The level of verbosity from output
CLASSIFIER_METRIC = ‘accuracy’
Default metric for classifiers
REGRESSION_METRIC = ‘r2’
Default metric for regressions
CROSS_VALIDATION = 10
Default Number of cross validation folds to use
N_JOBS = -1
Default number of cores to use when doing multiprocessing. -1 means use all available
RANDOM_STATE = 42
Default random state seed for all functions involving randomness
RUN_DIR = ‘./runs’
Default folder to store run logging files
ESTIMATOR_DIR = ‘./models’
Default folder to store pickled models in
LOG = False
Toggles whether or not to log runs to a file. Set to True if you want every run to be logged, else use the log() context manager
TRAIN_TEST_SHUFFLE = True
Default whether or not to shuffle data for test set
TEST_SIZE = 0.25
Default percentage of data that will be part of the test set

Result

Result class to work with results from scoring a model

class ml_tooling.result.Result(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], metrics: ml_tooling.metrics.metric.Metrics, data: ml_tooling.data.base_data.Dataset)

Contains the result of a given training run. Contains plotting methods, as well as being comparable with other results

Parameters:
  • estimator (Estimator) – Estimator used to generate the result
  • metrics (Metrics) – Metrics used to score the model
  • data (Dataset) – Dataset used to generate the result

Method generated by attrs for class Result.

ResultGroup

A container of Results - some methods in ML Tooling return multiple results, which will be grouped into a ResultGroup. A ResultGroup is sorted by the Result metric and proxies attributes to the best result

class ml_tooling.result.ResultGroup(results: List[ml_tooling.result.result.Result])

A container for results. Proxies attributes to the best result. Supports indexing like a list.

Method generated by attrs for class ResultGroup.

Classification Result Visualizations

class ml_tooling.plots.viz.ClassificationVisualize(estimator, data)

Visualization class for Classification models

confusion_matrix(normalized: bool = True, threshold: Optional[float] = None, **kwargs) → matplotlib.axes._axes.Axes

Visualize a confusion matrix for a classification estimator Any kwargs are passed onto matplotlib

Parameters:
  • normalized (bool) – Whether or not to normalize annotated class counts
  • threshold (float) – Threshold to use for classification - defaults to 0.5
Returns:

Returns a Confusion Matrix plot

Return type:

plt.Axes

default_metric

Finds estimator_type for estimator in a BaseVisualize and returns default metric for this class stated in .config. If passed estimator is a Pipeline, assume last step is the estimator.

Returns:Name of the metric
Return type:str
feature_importance(top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, class_index: int = None, add_label: bool = True, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Visualizes feature importance of the estimator through permutation.

Parameters:
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • class_index (int, optional) – In a multi-class setting, plot the feature importances for the given label. If None, assume a binary classification
  • add_label (bool) – Toggles value labels on end of each bar
  • ax (Axes) – Draws graph on passed ax - otherwise creates new ax
  • kwargs (dict) – Passed to plt.barh
Returns:

Return type:

matplotlib.Axes

learning_curve(cv: int = None, scoring: str = 'default', n_jobs: int = None, train_sizes: Sequence[float] = array([0.1, 0.325, 0.55, 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Generates a learning_curve() plot, used to determine model performance as a function of number of training examples.

Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.

Parameters:
  • cv (int) – Number of CV iterations to run
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
  • train_sizes (Sequence of floats) – Percentage intervals of data to use when training
  • ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
  • kwargs (dict) – Passed along to matplotlib line plots
Returns:

Return type:

plt.Axes

lift_curve(**kwargs) → matplotlib.axes._axes.Axes

Visualize a Lift Curve for a classification estimator Estimator must implement a predict_proba method Any kwargs are passed onto matplotlib

Parameters:kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns:
Return type:plt.Axes
permutation_importance(n_repeats: int = 5, scoring: str = 'default', top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, add_label: bool = True, n_jobs: int = None, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Visualizes feature importance of the estimator through permutation.

Parameters:
  • n_repeats (int) – Number of times to permute a feature
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • add_label (bool) – Toggles value labels on end of each bar
  • ax (Axes) – Draws graph on passed ax - otherwise creates new ax
  • n_jobs (int, optional) – Number of parallel jobs to run. Defaults to N_JOBS setting in config.
  • kwargs (dict) – Passed to plt.barh
Returns:

Return type:

matplotlib.Axes

precision_recall_curve(labels: List[str] = None, **kwargs) → matplotlib.axes._axes.Axes

Visualize a Precision-Recall curve for a classification estimator. Estimator must implement a predict_proba method. Any kwargs are passed onto matplotlib.

Parameters:
  • labels (List of str) – Labels to use for the class names if multi-class
  • kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns:

Plot of precision-recall curve

Return type:

plt.Axes

roc_curve(labels: List[str] = None, **kwargs) → matplotlib.axes._axes.Axes

Visualize a ROC curve for a classification estimator. Estimator must implement a predict_proba method Any kwargs are passed onto matplotlib

Parameters:
  • labels (List of str) – Labels to use for the class names if multi-class
  • kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns:

Returns a ROC AUC plot

Return type:

plt.Axes

validation_curve(param_name: str, param_range: Sequence[T_co], n_jobs: int = None, cv: int = None, scoring: str = 'default', ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Generates a validation_curve() plot, graphing the impact of changing a hyperparameter on the scoring metric.

This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.

Parameters:
  • param_name (str) – Name of hyperparameter to plot
  • param_range (Sequence) – The individual values to plot for param_name
  • n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
  • cv (int) – Number of CV iterations to run. Defaults to value in Model.config. Uses a StratifiedKFold if`estimator` is a classifier - otherwise a KFold is used.
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
  • kwargs (dict) – Passed along to matplotlib line plots
Returns:

Return type:

plt.Axes

Regression Result Visualizations

class ml_tooling.plots.viz.RegressionVisualize(estimator, data)

Visualization class for Regression models

default_metric

Finds estimator_type for estimator in a BaseVisualize and returns default metric for this class stated in .config. If passed estimator is a Pipeline, assume last step is the estimator.

Returns:Name of the metric
Return type:str
feature_importance(top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, class_index: int = None, add_label: bool = True, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Visualizes feature importance of the estimator through permutation.

Parameters:
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • class_index (int, optional) – In a multi-class setting, plot the feature importances for the given label. If None, assume a binary classification
  • add_label (bool) – Toggles value labels on end of each bar
  • ax (Axes) – Draws graph on passed ax - otherwise creates new ax
  • kwargs (dict) – Passed to plt.barh
Returns:

Return type:

matplotlib.Axes

learning_curve(cv: int = None, scoring: str = 'default', n_jobs: int = None, train_sizes: Sequence[float] = array([0.1, 0.325, 0.55, 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Generates a learning_curve() plot, used to determine model performance as a function of number of training examples.

Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.

Parameters:
  • cv (int) – Number of CV iterations to run
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
  • train_sizes (Sequence of floats) – Percentage intervals of data to use when training
  • ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
  • kwargs (dict) – Passed along to matplotlib line plots
Returns:

Return type:

plt.Axes

permutation_importance(n_repeats: int = 5, scoring: str = 'default', top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, add_label: bool = True, n_jobs: int = None, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Visualizes feature importance of the estimator through permutation.

Parameters:
  • n_repeats (int) – Number of times to permute a feature
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • add_label (bool) – Toggles value labels on end of each bar
  • ax (Axes) – Draws graph on passed ax - otherwise creates new ax
  • n_jobs (int, optional) – Number of parallel jobs to run. Defaults to N_JOBS setting in config.
  • kwargs (dict) – Passed to plt.barh
Returns:

Return type:

matplotlib.Axes

prediction_error(**kwargs) → matplotlib.axes._axes.Axes

Visualizes prediction error of a regression estimator Any kwargs are passed onto matplotlib

Returns:Plot of the estimator’s prediction error
Return type:matplotlib.Axes
residuals(**kwargs) → matplotlib.axes._axes.Axes

Visualizes residuals of a regression estimator. Any kwargs are passed onto matplotlib

Returns:Plot of the estimator’s residuals
Return type:matplotlib.Axes
validation_curve(param_name: str, param_range: Sequence[T_co], n_jobs: int = None, cv: int = None, scoring: str = 'default', ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes

Generates a validation_curve() plot, graphing the impact of changing a hyperparameter on the scoring metric.

This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.

Parameters:
  • param_name (str) – Name of hyperparameter to plot
  • param_range (Sequence) – The individual values to plot for param_name
  • n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
  • cv (int) – Number of CV iterations to run. Defaults to value in Model.config. Uses a StratifiedKFold if`estimator` is a classifier - otherwise a KFold is used.
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
  • kwargs (dict) – Passed along to matplotlib line plots
Returns:

Return type:

plt.Axes

Plots

ml_tooling.plots.plot_confusion_matrix(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], normalized: bool = True, title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: Sequence[str] = None) → matplotlib.axes._axes.Axes

Plots a confusion matrix of predicted labels vs actual labels

Parameters:
  • y_true – True labels
  • y_pred – Predicted labels from estimator
  • normalized – Whether to normalize counts in matrix
  • title – Title for plot
  • ax – Pass your own ax
  • labels – Pass custom list of labels
Returns:

matplotlib.Axes

ml_tooling.plots.plot_target_correlation(features: pandas.core.frame.DataFrame, target: Union[pandas.core.series.Series, numpy.ndarray], method: str = 'spearman', ax: matplotlib.axes._axes.Axes = None, top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, title: str = 'Feature-Target Correlation') → matplotlib.axes._axes.Axes

Plot the correlation between each feature and the target variable using the given value.

Also allows selecting how many features to show by setting the top_n and/or bottom_n parameters.

Parameters:
  • features (pd.DataFrame) – Features to plot
  • target (np.Array or pd.Series) – Target to calculate correlation with
  • method (str) – Which method to use when calculating correlation. Supports one of ‘pearson’, ‘spearman’, ‘kendall’.
  • ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
  • title (str) – Title of graph
Returns:

Return type:

plt.Axes

ml_tooling.plots.plot_feature_importance(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, ax: matplotlib.axes._axes.Axes = None, class_index: int = None, bottom_n: Union[int, float] = None, top_n: Union[int, float] = None, add_label: bool = True, title: str = '', **kwargs) → matplotlib.axes._axes.Axes

Plot either the estimator coefficients or the estimator feature importances depending on what is provided by the estimator.

see also :func:ml_tooling.plot.plot_permutation_importance for an unbiased version of feature importance using permutation importance

Parameters:
  • estimator (Estimator) – Estimator to use to calculate permuted feature importance
  • x (DataType) – Features to calculate permuted feature importance for
  • ax (Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
  • class_index (int, optional) – In a multi-class setting, choose which class to get feature importances for. If None, will assume a binary classifier
  • bottom_n (int) – Plot only bottom n features
  • top_n (int) – Plot only top n features
  • add_label (bool) – Whether or not to plot text labels for the bars
  • title (str) – Title to add to the plot
  • kwargs (dict) – Any kwargs are passed to matplotlib
Returns:

Return type:

plt.Axes

ml_tooling.plots.plot_lift_curve(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None, threshold: float = 0.5) → matplotlib.axes._axes.Axes

Plot a lift chart from results. Also calculates lift score based on a .5 threshold

Parameters:
  • y_true (DataType) – True labels
  • y_proba (DataType) – Model’s predicted probability
  • title (str) – Plot title
  • ax (Axes) – Pass your own ax
  • labels (List of str) – Labels to use per class
  • threshold (float) – Threshold to use when determining lift score
Returns:

Return type:

matplotlib.Axes

ml_tooling.plots.plot_prediction_error(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None) → matplotlib.axes._axes.Axes

Plots prediction error of regression estimator

Parameters:
  • y_true – True values
  • y_pred – Model’s predicted values
  • title – Plot title
  • ax – Pass your own ax
Returns:

matplotlib.Axes

ml_tooling.plots.plot_residuals(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None) → matplotlib.axes._axes.Axes

Plots residuals from a regression.

Parameters:
  • y_true – True values
  • y_pred – Models predicted value
  • title – Plot title
  • ax – Pass your own ax
Returns:

matplotlib.Axes

ml_tooling.plots.plot_roc_auc(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None) → matplotlib.axes._axes.Axes

Plot ROC AUC curve. Works only with probabilities

Parameters:
  • y_true (DataType) – True labels
  • y_proba (DataType) – Probability estimate from estimator
  • title (str) – Plot title
  • ax (Axes) – Pass in your own ax
  • labels (List of str) – Optionally specify label names
Returns:

Plot of ROC AUC curve

Return type:

plt.Axes

ml_tooling.plots.plot_pr_curve(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None) → matplotlib.axes._axes.Axes

Plot precision-recall curve. Works only with probabilities.

Parameters:
  • y_true (DataType) – True labels
  • y_proba (DataType) – Probability estimate from estimator
  • title (str) – Plot title
  • ax (plt.Axes) – Pass in your own ax
  • labels (List of str, optional) – Labels for each class
Returns:

Plot of precision-recall curve

Return type:

plt.Axes

ml_tooling.plots.plot_learning_curve(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, y: Union[pandas.core.series.Series, numpy.ndarray], cv: int = 5, scoring: str = 'default', n_jobs: int = -1, train_sizes: Sequence[T_co] = array([0.1 , 0.325, 0.55 , 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, random_state: int = None, title: str = 'Learning Curve', **kwargs) → matplotlib.axes._axes.Axes

Generates a learning_curve() plot, used to determine model performance as a function of number of training examples.

Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.

Parameters:
  • estimator (sklearn-compatible estimator) – An instance of a sklearn estimator
  • x (pd.DataFrame) – DataFrame of features
  • y (pd.Series or np.Array) – Target values to predict
  • cv (int) – Number of CV iterations to run. Uses a StratifiedKFold if estimator is a classifier - otherwise a KFold is used.
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
  • train_sizes (Sequence of floats) – Percentage intervals of data to use when training
  • ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
  • random_state (int) – Random state to use in CV splitting
  • title (str) – Title to be used on the plot
  • kwargs (dict) – Passed along to matplotlib line plots
Returns:

Return type:

plt.Axes

ml_tooling.plots.plot_validation_curve(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, y: Union[pandas.core.series.Series, numpy.ndarray], param_name: str, param_range: Sequence[T_co], cv: int = 5, scoring: str = 'default', n_jobs: int = -1, ax: matplotlib.axes._axes.Axes = None, title: str = '', **kwargs) → matplotlib.axes._axes.Axes

Plots a validation_curve(), graphing the impact of changing a hyperparameter on the scoring metric.

This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.

Parameters:
  • estimator (sklearn-compatible estimator) – An instance of a sklearn estimator
  • x (pd.DataFrame) – DataFrame of features
  • y (pd.Series or np.Array) – Target values to predict
  • param_name (str) – Name of hyperparameter to plot
  • param_range (Sequence) – The individual values to plot for param_name
  • cv (int) – Number of CV iterations to run. Uses a StratifiedKFold if estimator is a classifier - otherwise a KFold is used.
  • scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
  • n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
  • ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
  • title (str) – Title to be used on the plot
  • kwargs (dict) – Passed along to matplotlib line plots
Returns:

Return type:

plt.Axes

ml_tooling.plots.plot_missing_data(df: pandas.core.frame.DataFrame, ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, **kwargs) → matplotlib.axes._axes.Axes

Plot number of missing data points per column. Sorted by number of missing values.

Also allows for selecting top_n/bottom_n number or percent of columns by passing an int or float

Parameters:
  • df (pd.DataFrame) – Feature DataFrame to calculate missing values from
  • ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
  • top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
  • bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
Returns:

Return type:

plt.Axes

Transformers

class ml_tooling.transformers.Binarize(value: Any = None)

Sets all instances of value to 1 and all others to 0 Returns a pandas DataFrame

Parameters:value (Any) – The value to be set to 1
class ml_tooling.transformers.Binner(bins: Union[int, list] = 5, labels: list = None)

Bins data according to passed bins and labels. Uses pandas.cut() under the hood, see for further details

Parameters:
  • bins (int, list) – The criteria to bin by. An int value defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x. If a list is passed, defines the bin edges allowing for non-uniform width and no extension of the range of x is done.
  • labels (list) – Specifies the labels for the returned bins. Must be the same length as the resulting bins.
class ml_tooling.transformers.ToCategorical

Converts a column into a one-hot encoded column through pd.Categorical

class ml_tooling.transformers.DateEncoder(day: bool = True, month: bool = True, week: bool = True, year: bool = True)

Converts a date column into multiple day-month-year columns

Parameters:
  • day (bool) – If True, a new day column will be added.
  • month (bool) – If True, a new month column will be added.
  • week (bool) – If True, a new week column will be added.
  • year (bool) – If True, a new year column will be added.
class ml_tooling.transformers.DFFeatureUnion(transformer_list: list)

Merges together two pipelines based on index.

Parameters:transformer_list (list) – transformer_list is a list of (name, transformer) tuples, where transfomer implements fit/transform.
class ml_tooling.transformers.FillNA(value: Union[str, int, None] = None, strategy: Optional[str] = None, indicate_nan: bool = False)

Fills NA values with given value or strategy. Either a value or a strategy must be passed.

Parameters:
  • value (str, int) – A specific value to replace NaNs with.
  • strategy (str) – A named strategy to replace NaNs with. One of ‘mean’, ‘median’, ‘most_freq’, ‘max’, ‘min’
  • indicate_nan (bool) – If True, a new column is added which indicates if a value in a column was missing.
class ml_tooling.transformers.FreqFeature

Converts a column into its normalized value count

class ml_tooling.transformers.FuncTransformer(func: Callable[[...], pandas.core.frame.DataFrame] = None, **kwargs)

Applies a given function to each column

Parameters:
  • func (Callable[.., pd.DataFrame]) – Define the function which should be applied on each column.
  • kwargs – Specific for the selected func.
class ml_tooling.transformers.DFRowFunc(strategy: Union[Callable[[...], pandas.core.frame.DataFrame], str] = None)

Row-wise operation on Pandas DataFrame.

Parameters:strategy (Callable[.., pd.DataFrame], str) –

Strategy can either be one of the predefined or a callable. If some elements in the row are NaN these elements are ignored for the built-in strategies. Valid strategies are:

  • sum
  • min
  • max
  • mean

If a callable is used, it must return a pd.Series

class ml_tooling.transformers.RareFeatureEncoder(threshold: Union[int, float] = 0.2, fill_rare: Any = 'Rare')

Replaces categories with a specified value, if they occur less often than the provided threshold.

Parameters:
  • threshold (int, float) – Sets the threshold for when a value is considered rare. Any value which occurs less than the threshold will be replaced with fill_rare. If threshold is a float, it will be considered a percentage and if it is an int, threshold will be considered the minimum number of observations.
  • fill_rare (Any) – Fill value to use when replacing rare categories.
class ml_tooling.transformers.Renamer(column_names: Union[list, str] = None)

Renames columns to passed names.

Parameters:column_names (list, str) – The column names which should replace the original column names.
class ml_tooling.transformers.DFStandardScaler(copy: bool = True, with_mean: bool = True, with_std: bool = True)

Wrapping of the StandardScaler from scikit-learn for Pandas DataFrames. See: StandardScaler

Parameters:
  • copy (bool) – If True, a copy of the dataframe is made.
  • with_mean (bool) – If True, center the data before scaling.
  • with_std (bool) – If True, scale the data to unit standard deviation.
class ml_tooling.transformers.Select(columns: Union[List[str], str] = None)

Selects columns from DataFrame

Parameters:columns (List[str], str, None) – Specify which columns are selected.
class ml_tooling.transformers.Pipeline(steps, *, memory=None, verbose=False)

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.

Read more in the User Guide.

New in version 0.5.

Parameters:
  • steps (list) – List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
  • memory (str or object with the joblib.Memory interface, default=None) – Used to cache the fitted transformers of the pipeline. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming.
  • verbose (bool, default=False) – If True, the time elapsed while fitting each step will be printed as it is completed.
named_steps

Dictionary-like object, with the following attributes. Read-only attribute to access any step parameter by user given name. Keys are step names and values are steps parameters.

Type:Bunch

See also

sklearn.pipeline.make_pipeline
Convenience function for simplified pipeline construction.

Examples

>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
>>> pipe.score(X_test, y_test)
0.88
decision_function(X)

Apply transforms, and decision_function of the final estimator

Parameters:X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns:y_score
Return type:array-like of shape (n_samples, n_classes)
fit(X, y=None, **fit_params)

Fit the model

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters:
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
Returns:

self – This estimator

Return type:

Pipeline

fit_predict(X, y=None, **fit_params)

Applies fit_predict of last step in pipeline after transforms.

Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.

Parameters:
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
Returns:

y_pred

Return type:

array-like

fit_transform(X, y=None, **fit_params)

Fit the model and transform with the final estimator

Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.

Parameters:
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
Returns:

Xt – Transformed samples

Return type:

array-like of shape (n_samples, n_transformed_features)

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
inverse_transform

Apply inverse transformations in reverse order

All estimators in the pipeline must support inverse_transform.

Parameters:Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where n_samples is the number of samples and n_features is the number of features. Must fulfill input requirements of last step of pipeline’s inverse_transform method.
Returns:Xt
Return type:array-like of shape (n_samples, n_features)
predict(X, **predict_params)

Apply transforms to the data, and predict with the final estimator

Parameters:
  • X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
  • **predict_params (dict of string -> object) –

    Parameters to the predict called at the end of all transformations in the pipeline. Note that while this may be used to return uncertainties from some models with return_std or return_cov, uncertainties that are generated by the transformations in the pipeline are not propagated to the final estimator.

    New in version 0.20.

Returns:

y_pred

Return type:

array-like

predict_log_proba(X)

Apply transforms, and predict_log_proba of the final estimator

Parameters:X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns:y_score
Return type:array-like of shape (n_samples, n_classes)
predict_proba(X)

Apply transforms, and predict_proba of the final estimator

Parameters:X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns:y_proba
Return type:array-like of shape (n_samples, n_classes)
score(X, y=None, sample_weight=None)

Apply transforms, and score with the final estimator

Parameters:
  • X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
  • y (iterable, default=None) – Targets used for scoring. Must fulfill label requirements for all steps of the pipeline.
  • sample_weight (array-like, default=None) – If not None, this argument is passed as sample_weight keyword argument to the score method of the final estimator.
Returns:

score

Return type:

float

score_samples(X)

Apply transforms, and score_samples of the final estimator.

Parameters:X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
Returns:y_score
Return type:ndarray of shape (n_samples,)
set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params().

Returns:
Return type:self
transform

Apply transforms, and transform with the final estimator

This also works where final estimator is None: all prior transformations are applied.

Parameters:X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
Returns:Xt
Return type:array-like of shape (n_samples, n_transformed_features)

Metric

class ml_tooling.metrics.Metric(name: str, score: float = None, cross_val_scores: Optional[numpy.ndarray] = None)

Represents a single metric, containing a metric name and its corresponding score. Can be instantiated using any sklearn-compatible score_ strings

A Metric knows how to generate it’s own score by calling score_metric(), passing an estimator, an X and a Y. A Metric can also get a cross-validated score by calling score_metric_cv() and passing a CV value - either a CV_ object or an int specifying number of folds

Examples

>>> from ml_tooling.metrics import Metric
>>> from sklearn.linear_model import LinearRegression
>>> import numpy as np
>>> metric = Metric('r2')
>>> x = np.array([[1],[2],[3],[4]])
>>> y = np.array([[2], [4], [6], [8]])
>>> estimator = LinearRegression().fit(x, y)
>>> metric.score_metric(estimator, x, y)
Metric(name='r2', score=1.0)
>>> metric.score
1.0
>>> metric.name
'r2'
>>> metric.score_metric_cv(estimator, x, y, cv=2)
Metric(name='r2', score=1.0)
>>> metric.score
1.0
>>> metric.name
'r2'
>>> metric.cross_val_scores
array([1., 1.])
>>> metric.std
0.0

Method generated by attrs for class Metric.

score_metric(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: Union[pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.series.Series, numpy.ndarray]) → ml_tooling.metrics.metric.Metric

Calculates the score for this metric. Takes a fitted estimator, x and y values. Scores are calculated with sklearn metrics - using the string defined in self.metric to look up the appropriate scoring function.

Parameters:
  • estimator (Pipeline or BaseEstimator) – A fitted estimator to score
  • x (np.ndarray, pd.DataFrame) – Features to score model with
  • y (np.ndarray, pd.Series) – Target to score model with
Returns:

Return type:

self

score_metric_cv(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: Union[pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.series.Series, numpy.ndarray], cv: Any, n_jobs: int = -1, verbose: int = 0) → ml_tooling.metrics.metric.Metric

Score metric using cross-validation. When scoring with cross_validation, self.cross_val_scores is populated with the cross validated scores and self.score is set to the mean value of self.cross_val_scores. Cross validation can be parallelized by passing the n_jobs parameter

Parameters:
  • estimator (Pipeline or BaseEstimator) – Fitted estimator to score
  • x (np.ndarray or pd.DataFrame) – Features to use in scoring
  • y (np.ndarray or pd.Series) – Target to use in scoring
  • cv (int, BaseCrossValidator) – If an int is passed, cross-validate using K-Fold with cv folds. If BaseCrossValidator is passed, use that object instead
  • n_jobs (int) – Number of jobs to use in parallelizing. Pass None to not do CV in parallel
  • verbose (int) – Verbosity level of output
Returns:

Return type:

self

class ml_tooling.metrics.Metrics(metrics: List[ml_tooling.metrics.metric.Metric])

Represents a collection of Metric. This is the default object used when scoring an estimator.

There are two alternate constructors: - from_list() takes a list of metric names and instantiates one metric per list item - from_dict() takes a dictionary of name -> score and instantiates one metric with the given score per dictionary item

Calling either score_metrics() or score_metrics_cv() will in turn call score_metric() or score_metric_cv() of each Metric in its collection

Examples

To score multiple metrics, create a metrics object from a list and call score_metrics() to score all metrics in one operation

We can convert metrics to a dictionary

or a list

Method generated by attrs for class Metrics.