API¶

Model¶

A wrapper for scikit-learn based estimators Implements all the base functionality needed to create the wrapper

class ml_tooling.baseclass.Model(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], feature_pipeline: sklearn.pipeline.Pipeline = None)¶

Wrapper class for Estimators

Parameters:	estimator (Estimator) – Any scikit-learn compatible estimator feature_pipeline (Pipeline) – Optionally pass a feature preprocessing Pipeline. Model will automatically insert the estimator into a preprocessing pipeline

bayesiansearch(data: ml_tooling.data.base_data.Dataset, param_distributions: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, n_iter: int = 10, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶

Runs a cross-validated Bayesian Search on the estimator with a randomized sampling of the passed parameter distributions

Parameters:

data (Dataset) – An instance of a DataSet object
param_distributions (dict) – Parameter distributions to use for randomizing search. Should be a dictionary of param_names -> one of - ml_tooling.search.Integer - ml_tooling.search.Categorical - ml_tooling.search.Real
metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to self.default_metric. First metric is used to sort results.
cv (int, optional) – Cross validation to use. Defaults to value in config.CROSS_VALIDATION
n_iter (int) – Number of parameter settings that are sampled.
refit (bool) – Whether or not to refit the best model

Returns:

best_estimator (Model) – Best estimator as found by the Bayesian Search
result_group (ResultGroup) – ResultGroup object containing each individual score

default_metric¶

Defines default metric based on whether or not the estimator is a regressor or classifier. Then CLASSIFIER_METRIC or CLASSIFIER_METRIC is returned.

Returns:	Name of the metric
Return type:	str

gridsearch(data: ml_tooling.data.base_data.Dataset, param_grid: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶

Runs a cross-validated gridsearch on the estimator with the passed in parameter grid.

Parameters:

data (Dataset) – An instance of a DataSet object
param_grid (dict) – Parameters to use for grid search
metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to self.default_metric. First metric is used to sort results.
cv (int, optional) – Cross validation to use. Defaults to value in config.CROSS_VALIDATION
refit (bool) – Whether or not to refit the best model

Returns:

best_estimator (Model) – Best estimator as found by the gridsearch
result_group (ResultGroup) – ResultGroup object containing each individual score

static list_estimators(storage: ml_tooling.storage.base.Storage) → List[pathlib.Path]¶

Gets a list of estimators from the given Storage

Parameters:	storage (Storage) – Storage class to list the estimators with

Example

storage = FileStorage(‘path/to/estimators_dir’) estimator_list = Model.list_estimators(storage)

Returns:	list of Paths
Return type:	List[pathlib.Path]

classmethod load_estimator(path: Union[str, pathlib.Path], storage: ml_tooling.storage.base.Storage = None) → ml_tooling.baseclass.Model¶

Instantiates the class with a joblib pickled estimator.

Parameters:	storage (Storage) – Storage class to load the estimator with path (str, pathlib.Path, optional) – Path to estimator pickle file

Example

We can load a trained estimator from disk:

storage = FileStorage('path/to/dir')
my_estimator = Model.load_estimator('my_model.pkl', storage=storage)

We now have a trained estimator loaded.

We can also use the default storage:

my_estimator = Model.load_estimator('my_model.pkl')

This will use the default FileStorage defined in Model.config.default_storage

Returns:	Instance of Model with a saved estimator
Return type:	Model

classmethod load_production_estimator(module_name: str)¶

Loads a model from a python package. Given that the package is an ML-Tooling package, this will load the production model from the package and create an instance of Model with that package

Parameters:	module_name (str) – The name of the package to load a model from

log(run_directory: str)¶

log() is a context manager that lets you turn on logging for any scoring methods that follow. You can pass a log_dir to specify a subdirectory to store the estimator in. The output is a yaml file recording estimator parameters, package version numbers, metrics and other useful information

Parameters:	run_directory (str) – Name of the folder to save the details in

Example

If we want to log an estimator run in the score folder we can write:

with estimator.log('score'):
   estimator.score_estimator

This will save the results of estimator.score_estimator() to runs/score/

make_prediction(data: ml_tooling.data.base_data.Dataset, *args, proba: bool = False, threshold: float = None, use_index: bool = False, use_cache: bool = False, **kwargs) → pandas.core.frame.DataFrame¶

Makes a prediction given an input. For example a customer number. Calls load_prediction_data(*args) and passes resulting data to predict() on the estimator

Parameters:	data (Dataset) – an instantiated Dataset object proba (bool) – Whether prediction is returned as a probability or not. Note that the return value is an n-dimensional array where n = number of classes threshold (float) – Threshold to use for predicting a binary class use_index (bool) – Whether the index from the prediction data should be used for the result. use_cache (bool) – Whether or not to use the cached data in dataset to make predictions. Useful for seeing probability distributions of the model
Returns:	A DataFrame with a prediction per row.
Return type:	pd.DataFrame

randomsearch(data: ml_tooling.data.base_data.Dataset, param_distributions: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, n_iter: int = 10, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶

Runs a cross-validated randomsearch on the estimator with a randomized sampling of the passed parameter distributions

Parameters:

data (Dataset) – An instance of a DataSet object
param_distributions (dict) – Parameter distributions to use for randomizing search
metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to self.default_metric. First metric is used to sort results.
cv (int, optional) – Cross validation to use. Defaults to value in config.CROSS_VALIDATION
n_iter (int) – Number of parameter settings that are sampled.
refit (bool) – Whether or not to refit the best model

Returns:

best_estimator (Model) – Best estimator as found by the randomsearch
result_group (ResultGroup) – ResultGroup object containing each individual score

save_estimator(storage: ml_tooling.storage.base.Storage = None, prod=False) → pathlib.Path¶

Saves the estimator as a binary file.

Parameters:	storage (Storage) – Storage class to save the estimator with prod (bool) – Whether this is a production model to be saved

Example

If we have trained an estimator and we want to save it to disk we can write:

storage = FileStorage('/path/to/save/dir')
model = Model(LinearRegression())
saved_filename = model.save_estimator(storage)

to save in the given folder.

Returns:	The path to where the estimator file was saved
Return type:	pathlib.Path

score_estimator(data: ml_tooling.data.base_data.Dataset, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = False) → ml_tooling.result.result.Result¶

Scores the estimator based on training data from data and validates based on validation data from data.

Defaults to no cross-validation. If you want to cross-validate the results, pass number of folds to cv. If cross-validation is used, score_estimator only cross-validates on training data and doesn’t use the validation data.

If the dataset does not have a train set, it will create one using the default config.

Returns a Result object containing all result parameters

Parameters:	data (Dataset) – An instantiated Dataset object with create_train_test called metrics (string, list of strings) – Metric or metrics to use for scoring the estimator. Any sklearn metric string cv (int, optional) – Whether or not to use cross validation. Number of folds if an int is passed If False, don’t use cross validation
Returns:	A Result object that contains the results of the scoring
Return type:	Result

classmethod test_estimators(data: ml_tooling.data.base_data.Dataset, estimators: Sequence[Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]], feature_pipeline: sklearn.pipeline.Pipeline = None, metrics: Union[str, List[str]] = 'default', cv: Union[int, bool] = False, log_dir: str = None, refit: bool = False) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶

Trains each estimator passed and returns a sorted list of results

Parameters:	data (Dataset) – An instantiated Dataset object with train_test data estimators (Sequence[Estimator]) – List of estimators to train feature_pipeline (Pipeline) – A pipeline for transforming features metrics (str, list of str) – Metric or list of metrics to use in scoring of estimators cv (int, bool) – Whether or not to use cross-validation. If an int is passed, use that many folds log_dir (str, optional) – Where to store logged estimators. If None, don’t log refit (bool) – Whether or not to refit the best model on all the training data
Returns:
Return type:	List of Result objects

to_dict() → List[dict]¶

Serializes the estimator to a dictionary

Returns:
Return type:	List of dicts

train_estimator(data: ml_tooling.data.base_data.Dataset) → ml_tooling.baseclass.Model¶

Loads all training data and trains the estimator on all data. Typically used as the last step when estimator tuning is complete.

Warning

This will set self.result attribute to None. This method trains the estimator using all the data, so there is no validation data to measure results against

Returns:	Returns an estimator trained on all the data, with no train-test split
Return type:	Model

Datasets¶

class ml_tooling.data.FileDataset(path: Union[str, pathlib.Path])¶

An Abstract Base Class for use in creating Filebased Datasets. This class is intended to be subclassed and must provide a load_training_data() and load_prediction_data() method.

FileDataset takes a path as its initialization argument, pointing to a file which must be a filetype supported by Pandas, such as csv, parquet etc. The extension determines the pandas method used to read and write the data

Instantiates a Filedataset pointing at a given path.

Parameters:	path (Pathlike) – Path to location of file

load_prediction_data(*args, **kwargs) → pandas.core.frame.DataFrame¶: Used to load prediction data for a given idx - returns features

load_training_data(*args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]¶: Used to load the full training dataset - returns features and targets

read_file(**kwargs)¶

Read the data from the passed file path

Parameters:	kwargs (dict) – Kwargs are passed to the relevant `pd.read_*()` method for the given extension

class ml_tooling.data.SQLDataset(conn: Union[str, sqlalchemy.engine.interfaces.Connectable], schema: Optional[str], **kwargs)¶

An Abstract Base Class for use in creating SQL Datasets. This class is intended to be subclassed and must provide a load_training_data() and load_prediction_data() method.

These methods must accept a conn argument which is an instance of a SQLAlchemy connection. This connection will be passed to the method by the SQLDataset at runtime.

table¶

SQLAlchemy table definition to use when loading the dataset. Is the table that will be copied when using .copy_to and should be the canonical definition of the feature set. Do not define a schema - that is set at runtime

Type:	sa.Table

Instantiates a dataset with the necessary arguments to connect to the database.

Parameters:	conn (Connectable) – Either a valid DB_URL string or an engine to connect to the database schema (str) – A string naming the schema to use - allows for swapping schemas at runtime kwargs (dict) – Kwargs are passed to create_engine if conn is a string

copy_to(target: ml_tooling.data.sql.SQLDataset) → ml_tooling.data.sql.SQLDataset¶

Copies data from one database table into other. This will truncate the table and load new data in.

Parameters:	target (SQLDataset) – A SQLDataset object representing the table you want to copy the data into
Returns:	The target dataset to copy to
Return type:	SQLDataset

create_connection() → sqlalchemy.engine.base.Connection¶

Instantiates a connection to be used in reading and writing to the database.

Ensures that connections are closed properly and dynamically inserts the schema into the database connection

Returns:	An open connection to the database, with a dynamically defined schema
Return type:	sa.engine.Connection

load_prediction_data(idx, conn, *args, **kwargs) → pandas.core.frame.DataFrame¶: Used to load prediction data for a given idx - returns features

load_training_data(conn, *args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]¶: Used to load the full training dataset - returns features and targets

class ml_tooling.data.Dataset¶

Baseclass for creating Datasets. Subclass Dataset and provide a load_training_data() and load_prediction_data() method

create_train_test(stratify: bool = False, shuffle: bool = True, test_size: float = 0.25, seed: int = 42) → ml_tooling.data.base_data.Dataset¶

Creates a training and testing dataset and storing it on the data object.

Parameters:	stratify (DataType, optional) – What to stratify the split on. Usually y if given a classification problem shuffle – Whether or not to shuffle the data test_size – What percentage of the data will be part of the test set seed – Random seed for train_test_split
Returns:
Return type:	self

load_prediction_data(*args, **kwargs) → pandas.core.frame.DataFrame¶

Abstract method to be implemented by the user. Defines data to be used at prediction time, defined as a DataFrame

Returns:	DataFrame of input features to get a prediction
Return type:	pd.DataFrame

load_training_data(*args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]¶

Abstract method to be implemented by user. Defines data to be used at training time where X is a dataframe and y is a numpy array

Returns:	x, y – Training data to be used by the models
Return type:	Tuple of DataTypes

ml_tooling.data.load_demo_dataset(dataset_name: str, **kwargs) → ml_tooling.data.base_data.Dataset¶

Create a Dataset implementing the demo datasets from sklearn.datasets

Parameters:	dataset_name (str) – Name of the dataset to use. If ‘openml’ is passed either parameter name or data_id needs to be specified. One of: iris boston diabetes digits linnerud wine breast_cancer openml **kwargs – Kwargs are passed on to the scikit-learn dataset function
Returns:	An instance of `Dataset`
Return type:	Dataset

Dataset Plotting Methods¶

class ml_tooling.plots.viz.data_viz.DataVisualize(data)¶

missing_data(ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, feature_pipeline: Optional[sklearn.pipeline.Pipeline] = None) → matplotlib.axes._axes.Axes¶

Plot number of missing data points per column. Sorted by number of missing values.

Also allows for selecting top_n/bottom_n number or percent of columns by passing an int or float

Parameters:	ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features feature_pipeline (Pipeline) – A feature transformation pipeline to be applied before graphing the final results
Returns:
Return type:	plt.Axes

target_correlation(method: str = 'spearman', ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, feature_pipeline: Optional[sklearn.pipeline.Pipeline] = None) → matplotlib.axes._axes.Axes¶

Plot the correlation between each feature and the target variable using the given value.

Also allows selecting how many features to show by setting the top_n and/or bottom_n parameters.

Parameters:	method (str) – Which method to use when calculating correlation. Supports one of ‘pearson’, ‘spearman’, ‘kendall’. ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features feature_pipeline (Pipeline) – A feature transformation pipeline to be applied before graphing the data
Returns:
Return type:	plt.Axes

Storage¶

class ml_tooling.storage.FileStorage(dir_path: Union[str, pathlib.Path] = None)¶

File Storage class for handling storage of estimators to the file system

get_list() → List[pathlib.Path]¶

Finds a list of estimator file paths in the FileStorage directory.

Example

Find and return estimator paths in a given directory:: my_estimators = FileStorage(‘path/to/dir’).get_list()

Returns:	list of paths to files sorted by filename
Return type:	List[Path]

load(file_path: Union[str, pathlib.Path]) → Any¶

Loads a joblib pickled estimator from given filepath and returns the unpickled object

Parameters:	file_path (Pathlike) – Path where to load the estimator file relative to FileStorage

Example

We can load a saved pickled estimator from disk directly from FileStorage:

storage = FileStorage(‘path/to/dir’) my_estimator = storage.load(‘mymodel.pkl’)

We now have a trained estimator loaded.

Returns:	The object loaded from disk
Return type:	Object

save(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], filename: str, prod: bool = False) → pathlib.Path¶

Save a joblib pickled estimator.

Parameters:	estimator (obj) – The estimator object filename (str) – filename of estimator pickle file prod (bool) – Whether or not to save in “production mode” - Production mode saves to /src/<projectname>/ regardless of what FileStorage was instantiated with

Example

To save your trained estimator, use the FileStorage context manager.

storage = FileStorage(‘/path/to/save/dir/’) file_path = storage.save(estimator, ‘filename’)

We now have saved an estimator to a pickle file.

Returns:	Path to the saved object
Return type:	Path

class ml_tooling.storage.Storage¶

Base class for Storage classes

get_list() → List[pathlib.Path]¶

Abstract method to be implemented by the user. Defines method used to show which objects have been saved

Returns:	Paths to each of the estimators sorted lexically
Return type:	List[Path]

load(file_path: Union[str, pathlib.Path]) → Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]¶

Abstract method to be implemented by the user. Defines method used to load data from the storage type

Returns:	Returns the unpickled object
Return type:	Estimator

save(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], file_path: Union[str, pathlib.Path], prod: bool = False) → Union[str, pathlib.Path]¶

Abstract method to be implemented by the user. Defines method used to save data from the storage type

Returns:	Path to where the pickled object is saved
Return type:	Pathlike

class ml_tooling.storage.ArtifactoryStorage(artifactory_url: str, repo: str, apikey: Optional[str] = None, auth: Optional[Tuple[str, str]] = None)¶

Artifactory Storage class for handling storage of estimators to JFrog artifactory

Example

Instantiate this class with a url and path to the repo like so:

storage = ArtifactoryStorage(’http://artifactory.com’,’path/to/artifact’)

get_list() → List[ArtifactoryPath]¶

Finds a list of estimator artifact paths in the ArtifactoryStorage repo.

Example

Find and return estimator paths in a given directory:: my_estimators = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’).get_list()

Returns:	list of paths to files sorted by filename
Return type:	List[ArtifactoryPath]

load(file_path: Union[str, pathlib.Path]) → Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]¶

Loads a pickled estimator from given filepath and returns the estimator

Parameters:	file_path (Pathlike) – Path to load the estimator relative to ArtifactoryStorage

Example

We can load a saved pickled estimator from disk directly from Artifactory:

storage = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’) my_estimator = storage.load(‘estimatorfile’)

We now have a trained estimator loaded.

Returns:	estimator unpickled object
Return type:	Object

save(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], filename: str, prod: bool = False) → ArtifactoryPath¶

Save a pickled estimator to artifactory.

Parameters:	estimator (Estimator) – The estimator object filename (str) – filename of estimator pickle file prod (bool) – Production variable, set to True if saving a production-ready estimator

Example

To save your trained estimator:

storage = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’) artifactory_path = storage.save(estimator, ‘estimator.pkl’)

We now have saved an estimator to a pickle file.

Returns:	File path to stored estimator
Return type:	ArtifactoryPath

Config¶

All configuration options available

class ml_tooling.config.DefaultConfig¶

Configuration for Models

VERBOSITY = 0: The level of verbosity from output
CLASSIFIER_METRIC = ‘accuracy’: Default metric for classifiers
REGRESSION_METRIC = ‘r2’: Default metric for regressions
CROSS_VALIDATION = 10: Default Number of cross validation folds to use
N_JOBS = -1: Default number of cores to use when doing multiprocessing. -1 means use all available
RANDOM_STATE = 42: Default random state seed for all functions involving randomness
RUN_DIR = ‘./runs’: Default folder to store run logging files
ESTIMATOR_DIR = ‘./models’: Default folder to store pickled models in
LOG = False: Toggles whether or not to log runs to a file. Set to True if you want every run to be logged, else use the log() context manager
TRAIN_TEST_SHUFFLE = True: Default whether or not to shuffle data for test set
TEST_SIZE = 0.25: Default percentage of data that will be part of the test set

Result¶

Result class to work with results from scoring a model

class ml_tooling.result.Result(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], metrics: ml_tooling.metrics.metric.Metrics, data: ml_tooling.data.base_data.Dataset)¶

Contains the result of a given training run. Contains plotting methods, as well as being comparable with other results

Parameters:	estimator (Estimator) – Estimator used to generate the result metrics (Metrics) – Metrics used to score the model data (Dataset) – Dataset used to generate the result

Method generated by attrs for class Result.

ResultGroup¶

A container of Results - some methods in ML Tooling return multiple results, which will be grouped into a ResultGroup. A ResultGroup is sorted by the Result metric and proxies attributes to the best result

class ml_tooling.result.ResultGroup(results: List[ml_tooling.result.result.Result])¶

A container for results. Proxies attributes to the best result. Supports indexing like a list.

Method generated by attrs for class ResultGroup.

Classification Result Visualizations¶

class ml_tooling.plots.viz.ClassificationVisualize(estimator, data)¶

Visualization class for Classification models

confusion_matrix(normalized: bool = True, threshold: Optional[float] = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualize a confusion matrix for a classification estimator Any kwargs are passed onto matplotlib

Parameters:	normalized (bool) – Whether or not to normalize annotated class counts threshold (float) – Threshold to use for classification - defaults to 0.5
Returns:	Returns a Confusion Matrix plot
Return type:	plt.Axes

default_metric¶

Finds estimator_type for estimator in a BaseVisualize and returns default metric for this class stated in .config. If passed estimator is a Pipeline, assume last step is the estimator.

Returns:	Name of the metric
Return type:	str

feature_importance(top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, class_index: int = None, add_label: bool = True, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualizes feature importance of the estimator through permutation.

Parameters:	top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features class_index (int, optional) – In a multi-class setting, plot the feature importances for the given label. If None, assume a binary classification add_label (bool) – Toggles value labels on end of each bar ax (Axes) – Draws graph on passed ax - otherwise creates new ax kwargs (dict) – Passed to plt.barh
Returns:
Return type:	matplotlib.Axes

learning_curve(cv: int = None, scoring: str = 'default', n_jobs: int = None, train_sizes: Sequence[float] = array([0.1, 0.325, 0.55, 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Generates a learning_curve() plot, used to determine model performance as a function of number of training examples.

Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.

Parameters:	cv (int) – Number of CV iterations to run scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring train_sizes (Sequence of floats) – Percentage intervals of data to use when training ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created. kwargs (dict) – Passed along to matplotlib line plots
Returns:
Return type:	plt.Axes

lift_curve(**kwargs) → matplotlib.axes._axes.Axes¶

Visualize a Lift Curve for a classification estimator Estimator must implement a predict_proba method Any kwargs are passed onto matplotlib

Parameters:	kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns:
Return type:	plt.Axes

permutation_importance(n_repeats: int = 5, scoring: str = 'default', top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, add_label: bool = True, n_jobs: int = None, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualizes feature importance of the estimator through permutation.

Parameters:	n_repeats (int) – Number of times to permute a feature scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features add_label (bool) – Toggles value labels on end of each bar ax (Axes) – Draws graph on passed ax - otherwise creates new ax n_jobs (int, optional) – Number of parallel jobs to run. Defaults to N_JOBS setting in config. kwargs (dict) – Passed to plt.barh
Returns:
Return type:	matplotlib.Axes

precision_recall_curve(labels: List[str] = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualize a Precision-Recall curve for a classification estimator. Estimator must implement a predict_proba method. Any kwargs are passed onto matplotlib.

Parameters:	labels (List of str) – Labels to use for the class names if multi-class kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns:	Plot of precision-recall curve
Return type:	plt.Axes

roc_curve(labels: List[str] = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualize a ROC curve for a classification estimator. Estimator must implement a predict_proba method Any kwargs are passed onto matplotlib

Parameters:	labels (List of str) – Labels to use for the class names if multi-class kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns:	Returns a ROC AUC plot
Return type:	plt.Axes

validation_curve(param_name: str, param_range: Sequence[T_co], n_jobs: int = None, cv: int = None, scoring: str = 'default', ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Generates a validation_curve() plot, graphing the impact of changing a hyperparameter on the scoring metric.

This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.

Parameters:	param_name (str) – Name of hyperparameter to plot param_range (Sequence) – The individual values to plot for param_name n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring cv (int) – Number of CV iterations to run. Defaults to value in Model.config. Uses a `StratifiedKFold` if`estimator` is a classifier - otherwise a `KFold` is used. scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created. kwargs (dict) – Passed along to matplotlib line plots
Returns:
Return type:	plt.Axes

Regression Result Visualizations¶

class ml_tooling.plots.viz.RegressionVisualize(estimator, data)¶

Visualization class for Regression models

default_metric¶

Finds estimator_type for estimator in a BaseVisualize and returns default metric for this class stated in .config. If passed estimator is a Pipeline, assume last step is the estimator.

Returns:	Name of the metric
Return type:	str

feature_importance(top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, class_index: int = None, add_label: bool = True, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualizes feature importance of the estimator through permutation.

Parameters:	top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features class_index (int, optional) – In a multi-class setting, plot the feature importances for the given label. If None, assume a binary classification add_label (bool) – Toggles value labels on end of each bar ax (Axes) – Draws graph on passed ax - otherwise creates new ax kwargs (dict) – Passed to plt.barh
Returns:
Return type:	matplotlib.Axes

learning_curve(cv: int = None, scoring: str = 'default', n_jobs: int = None, train_sizes: Sequence[float] = array([0.1, 0.325, 0.55, 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Generates a learning_curve() plot, used to determine model performance as a function of number of training examples.

Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.

Parameters:	cv (int) – Number of CV iterations to run scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring train_sizes (Sequence of floats) – Percentage intervals of data to use when training ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created. kwargs (dict) – Passed along to matplotlib line plots
Returns:
Return type:	plt.Axes

permutation_importance(n_repeats: int = 5, scoring: str = 'default', top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, add_label: bool = True, n_jobs: int = None, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Visualizes feature importance of the estimator through permutation.

Parameters:	n_repeats (int) – Number of times to permute a feature scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features add_label (bool) – Toggles value labels on end of each bar ax (Axes) – Draws graph on passed ax - otherwise creates new ax n_jobs (int, optional) – Number of parallel jobs to run. Defaults to N_JOBS setting in config. kwargs (dict) – Passed to plt.barh
Returns:
Return type:	matplotlib.Axes

prediction_error(**kwargs) → matplotlib.axes._axes.Axes¶

Visualizes prediction error of a regression estimator Any kwargs are passed onto matplotlib

Returns:	Plot of the estimator’s prediction error
Return type:	matplotlib.Axes

residuals(**kwargs) → matplotlib.axes._axes.Axes¶

Visualizes residuals of a regression estimator. Any kwargs are passed onto matplotlib

Returns:	Plot of the estimator’s residuals
Return type:	matplotlib.Axes

validation_curve(param_name: str, param_range: Sequence[T_co], n_jobs: int = None, cv: int = None, scoring: str = 'default', ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶

Generates a validation_curve() plot, graphing the impact of changing a hyperparameter on the scoring metric.

This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.

Parameters:	param_name (str) – Name of hyperparameter to plot param_range (Sequence) – The individual values to plot for param_name n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring cv (int) – Number of CV iterations to run. Defaults to value in Model.config. Uses a `StratifiedKFold` if`estimator` is a classifier - otherwise a `KFold` is used. scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created. kwargs (dict) – Passed along to matplotlib line plots
Returns:
Return type:	plt.Axes

Plots¶

ml_tooling.plots.plot_confusion_matrix(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], normalized: bool = True, title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: Sequence[str] = None) → matplotlib.axes._axes.Axes¶

Plots a confusion matrix of predicted labels vs actual labels

Parameters:	y_true – True labels y_pred – Predicted labels from estimator normalized – Whether to normalize counts in matrix title – Title for plot ax – Pass your own ax labels – Pass custom list of labels
Returns:	matplotlib.Axes

ml_tooling.plots.plot_target_correlation(features: pandas.core.frame.DataFrame, target: Union[pandas.core.series.Series, numpy.ndarray], method: str = 'spearman', ax: matplotlib.axes._axes.Axes = None, top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, title: str = 'Feature-Target Correlation') → matplotlib.axes._axes.Axes¶

Plot the correlation between each feature and the target variable using the given value.

Also allows selecting how many features to show by setting the top_n and/or bottom_n parameters.

Parameters:	features (pd.DataFrame) – Features to plot target (np.Array or pd.Series) – Target to calculate correlation with method (str) – Which method to use when calculating correlation. Supports one of ‘pearson’, ‘spearman’, ‘kendall’. ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features title (str) – Title of graph
Returns:
Return type:	plt.Axes

ml_tooling.plots.plot_feature_importance(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, ax: matplotlib.axes._axes.Axes = None, class_index: int = None, bottom_n: Union[int, float] = None, top_n: Union[int, float] = None, add_label: bool = True, title: str = '', **kwargs) → matplotlib.axes._axes.Axes¶

Plot either the estimator coefficients or the estimator feature importances depending on what is provided by the estimator.

see also :func:ml_tooling.plot.plot_permutation_importance for an unbiased version of feature importance using permutation importance

Parameters:	estimator (Estimator) – Estimator to use to calculate permuted feature importance x (DataType) – Features to calculate permuted feature importance for ax (Axes) – Matplotlib axes to draw the graph on. Creates a new one by default class_index (int, optional) – In a multi-class setting, choose which class to get feature importances for. If None, will assume a binary classifier bottom_n (int) – Plot only bottom n features top_n (int) – Plot only top n features add_label (bool) – Whether or not to plot text labels for the bars title (str) – Title to add to the plot kwargs (dict) – Any kwargs are passed to matplotlib
Returns:
Return type:	plt.Axes

ml_tooling.plots.plot_lift_curve(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None, threshold: float = 0.5) → matplotlib.axes._axes.Axes¶

Plot a lift chart from results. Also calculates lift score based on a .5 threshold

Parameters:	y_true (DataType) – True labels y_proba (DataType) – Model’s predicted probability title (str) – Plot title ax (Axes) – Pass your own ax labels (List of str) – Labels to use per class threshold (float) – Threshold to use when determining lift score
Returns:
Return type:	matplotlib.Axes

ml_tooling.plots.plot_prediction_error(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None) → matplotlib.axes._axes.Axes¶

Plots prediction error of regression estimator

Parameters:	y_true – True values y_pred – Model’s predicted values title – Plot title ax – Pass your own ax
Returns:	matplotlib.Axes

ml_tooling.plots.plot_residuals(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None) → matplotlib.axes._axes.Axes¶

Plots residuals from a regression.

Parameters:	y_true – True values y_pred – Models predicted value title – Plot title ax – Pass your own ax
Returns:	matplotlib.Axes

ml_tooling.plots.plot_roc_auc(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None) → matplotlib.axes._axes.Axes¶

Plot ROC AUC curve. Works only with probabilities

Parameters:	y_true (DataType) – True labels y_proba (DataType) – Probability estimate from estimator title (str) – Plot title ax (Axes) – Pass in your own ax labels (List of str) – Optionally specify label names
Returns:	Plot of ROC AUC curve
Return type:	plt.Axes

ml_tooling.plots.plot_pr_curve(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None) → matplotlib.axes._axes.Axes¶

Plot precision-recall curve. Works only with probabilities.

Parameters:	y_true (DataType) – True labels y_proba (DataType) – Probability estimate from estimator title (str) – Plot title ax (plt.Axes) – Pass in your own ax labels (List of str, optional) – Labels for each class
Returns:	Plot of precision-recall curve
Return type:	plt.Axes

ml_tooling.plots.plot_learning_curve(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, y: Union[pandas.core.series.Series, numpy.ndarray], cv: int = 5, scoring: str = 'default', n_jobs: int = -1, train_sizes: Sequence[T_co] = array([0.1 , 0.325, 0.55 , 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, random_state: int = None, title: str = 'Learning Curve', **kwargs) → matplotlib.axes._axes.Axes¶

Generates a learning_curve() plot, used to determine model performance as a function of number of training examples.

Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.

Parameters:	estimator (sklearn-compatible estimator) – An instance of a sklearn estimator x (pd.DataFrame) – DataFrame of features y (pd.Series or np.Array) – Target values to predict cv (int) – Number of CV iterations to run. Uses a `StratifiedKFold` if estimator is a classifier - otherwise a `KFold` is used. scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring train_sizes (Sequence of floats) – Percentage intervals of data to use when training ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created. random_state (int) – Random state to use in CV splitting title (str) – Title to be used on the plot kwargs (dict) – Passed along to matplotlib line plots
Returns:
Return type:	plt.Axes

ml_tooling.plots.plot_validation_curve(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, y: Union[pandas.core.series.Series, numpy.ndarray], param_name: str, param_range: Sequence[T_co], cv: int = 5, scoring: str = 'default', n_jobs: int = -1, ax: matplotlib.axes._axes.Axes = None, title: str = '', **kwargs) → matplotlib.axes._axes.Axes¶

Plots a validation_curve(), graphing the impact of changing a hyperparameter on the scoring metric.

This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.

Parameters:	estimator (sklearn-compatible estimator) – An instance of a sklearn estimator x (pd.DataFrame) – DataFrame of features y (pd.Series or np.Array) – Target values to predict param_name (str) – Name of hyperparameter to plot param_range (Sequence) – The individual values to plot for param_name cv (int) – Number of CV iterations to run. Uses a `StratifiedKFold` if estimator is a classifier - otherwise a `KFold` is used. scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created. title (str) – Title to be used on the plot kwargs (dict) – Passed along to matplotlib line plots
Returns:
Return type:	plt.Axes

ml_tooling.plots.plot_missing_data(df: pandas.core.frame.DataFrame, ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, **kwargs) → matplotlib.axes._axes.Axes¶

Plot number of missing data points per column. Sorted by number of missing values.

Also allows for selecting top_n/bottom_n number or percent of columns by passing an int or float

Parameters:	df (pd.DataFrame) – Feature DataFrame to calculate missing values from ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
Returns:
Return type:	plt.Axes

Transformers¶

class ml_tooling.transformers.Binarize(value: Any = None)¶

Sets all instances of value to 1 and all others to 0 Returns a pandas DataFrame

Parameters:	value (Any) – The value to be set to 1

class ml_tooling.transformers.Binner(bins: Union[int, list] = 5, labels: list = None)¶

Bins data according to passed bins and labels. Uses pandas.cut() under the hood, see for further details

Parameters:

bins (int, list) – The criteria to bin by. An int value defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x. If a list is passed, defines the bin edges allowing for non-uniform width and no extension of the range of x is done.
labels (list) – Specifies the labels for the returned bins. Must be the same length as the resulting bins.

class ml_tooling.transformers.ToCategorical¶: Converts a column into a one-hot encoded column through pd.Categorical

class ml_tooling.transformers.DateEncoder(day: bool = True, month: bool = True, week: bool = True, year: bool = True)¶

Converts a date column into multiple day-month-year columns

Parameters:	day (bool) – If True, a new day column will be added. month (bool) – If True, a new month column will be added. week (bool) – If True, a new week column will be added. year (bool) – If True, a new year column will be added.

class ml_tooling.transformers.DFFeatureUnion(transformer_list: list)¶

Merges together two pipelines based on index.

Parameters:	transformer_list (list) – transformer_list is a list of (name, transformer) tuples, where transfomer implements fit/transform.

class ml_tooling.transformers.FillNA(value: Union[str, int, None] = None, strategy: Optional[str] = None, indicate_nan: bool = False)¶

Fills NA values with given value or strategy. Either a value or a strategy must be passed.

Parameters:	value (str, int) – A specific value to replace NaNs with. strategy (str) – A named strategy to replace NaNs with. One of ‘mean’, ‘median’, ‘most_freq’, ‘max’, ‘min’ indicate_nan (bool) – If True, a new column is added which indicates if a value in a column was missing.

class ml_tooling.transformers.FreqFeature¶: Converts a column into its normalized value count

class ml_tooling.transformers.FuncTransformer(func: Callable[[...], pandas.core.frame.DataFrame] = None, **kwargs)¶

Applies a given function to each column

Parameters:	func (Callable[.., pd.DataFrame]) – Define the function which should be applied on each column. kwargs – Specific for the selected func.

class ml_tooling.transformers.DFRowFunc(strategy: Union[Callable[[...], pandas.core.frame.DataFrame], str] = None)¶

Row-wise operation on Pandas DataFrame.

Parameters:

strategy (Callable[.., pd.DataFrame], str) –

Strategy can either be one of the predefined or a callable. If some elements in the row are NaN these elements are ignored for the built-in strategies. Valid strategies are:

sum

min

max

mean

If a callable is used, it must return a pd.Series

class ml_tooling.transformers.RareFeatureEncoder(threshold: Union[int, float] = 0.2, fill_rare: Any = 'Rare')¶

Replaces categories with a specified value, if they occur less often than the provided threshold.

Parameters:	threshold (int, float) – Sets the threshold for when a value is considered rare. Any value which occurs less than the threshold will be replaced with fill_rare. If threshold is a float, it will be considered a percentage and if it is an int, threshold will be considered the minimum number of observations. fill_rare (Any) – Fill value to use when replacing rare categories.

class ml_tooling.transformers.Renamer(column_names: Union[list, str] = None)¶

Renames columns to passed names.

Parameters:	column_names (list, str) – The column names which should replace the original column names.

class ml_tooling.transformers.DFStandardScaler(copy: bool = True, with_mean: bool = True, with_std: bool = True)¶

Wrapping of the StandardScaler from scikit-learn for Pandas DataFrames. See: StandardScaler

Parameters:	copy (bool) – If True, a copy of the dataframe is made. with_mean (bool) – If True, center the data before scaling. with_std (bool) – If True, scale the data to unit standard deviation.

class ml_tooling.transformers.Select(columns: Union[List[str], str] = None)¶

Selects columns from DataFrame

Parameters:	columns (List[str], str, None) – Specify which columns are selected.

class ml_tooling.transformers.Pipeline(steps, *, memory=None, verbose=False)¶

Pipeline of transforms with a final estimator.

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.

Metric¶

class ml_tooling.metrics.Metric(name: str, score: float = None, cross_val_scores: Optional[numpy.ndarray] = None)¶

Represents a single metric, containing a metric name and its corresponding score. Can be instantiated using any sklearn-compatible score_ strings

A Metric knows how to generate it’s own score by calling score_metric(), passing an estimator, an X and a Y. A Metric can also get a cross-validated score by calling score_metric_cv() and passing a CV value - either a CV_ object or an int specifying number of folds

Examples

>>> from ml_tooling.metrics import Metric
>>> from sklearn.linear_model import LinearRegression
>>> import numpy as np
>>> metric = Metric('r2')
>>> x = np.array([[1],[2],[3],[4]])
>>> y = np.array([[2], [4], [6], [8]])
>>> estimator = LinearRegression().fit(x, y)
>>> metric.score_metric(estimator, x, y)
Metric(name='r2', score=1.0)
>>> metric.score
1.0
>>> metric.name
'r2'

>>> metric.score_metric_cv(estimator, x, y, cv=2)
Metric(name='r2', score=1.0)
>>> metric.score
1.0
>>> metric.name
'r2'
>>> metric.cross_val_scores
array([1., 1.])
>>> metric.std
0.0

Method generated by attrs for class Metric.

score_metric(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: Union[pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.series.Series, numpy.ndarray]) → ml_tooling.metrics.metric.Metric¶

Calculates the score for this metric. Takes a fitted estimator, x and y values. Scores are calculated with sklearn metrics - using the string defined in self.metric to look up the appropriate scoring function.

Parameters:	estimator (Pipeline or BaseEstimator) – A fitted estimator to score x (np.ndarray, pd.DataFrame) – Features to score model with y (np.ndarray, pd.Series) – Target to score model with
Returns:
Return type:	self

score_metric_cv(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: Union[pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.series.Series, numpy.ndarray], cv: Any, n_jobs: int = -1, verbose: int = 0) → ml_tooling.metrics.metric.Metric¶

Score metric using cross-validation. When scoring with cross_validation, self.cross_val_scores is populated with the cross validated scores and self.score is set to the mean value of self.cross_val_scores. Cross validation can be parallelized by passing the n_jobs parameter

Parameters:	estimator (Pipeline or BaseEstimator) – Fitted estimator to score x (np.ndarray or pd.DataFrame) – Features to use in scoring y (np.ndarray or pd.Series) – Target to use in scoring cv (int, BaseCrossValidator) – If an int is passed, cross-validate using K-Fold with cv folds. If BaseCrossValidator is passed, use that object instead n_jobs (int) – Number of jobs to use in parallelizing. Pass None to not do CV in parallel verbose (int) – Verbosity level of output
Returns:
Return type:	self

class ml_tooling.metrics.Metrics(metrics: List[ml_tooling.metrics.metric.Metric])¶

Represents a collection of Metric. This is the default object used when scoring an estimator.

There are two alternate constructors: - from_list() takes a list of metric names and instantiates one metric per list item - from_dict() takes a dictionary of name -> score and instantiates one metric with the given score per dictionary item

Calling either score_metrics() or score_metrics_cv() will in turn call score_metric() or score_metric_cv() of each Metric in its collection

Examples

To score multiple metrics, create a metrics object from a list and call score_metrics() to score all metrics in one operation

We can convert metrics to a dictionary

or a list

Method generated by attrs for class Metrics.