API¶
Model¶
A wrapper for scikit-learn based estimators Implements all the base functionality needed to create the wrapper
-
class
ml_tooling.baseclass.
Model
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], feature_pipeline: sklearn.pipeline.Pipeline = None)¶ Wrapper class for Estimators
Parameters: - estimator (Estimator) – Any scikit-learn compatible estimator
- feature_pipeline (Pipeline) – Optionally pass a feature preprocessing Pipeline. Model will automatically insert the estimator into a preprocessing pipeline
-
bayesiansearch
(data: ml_tooling.data.base_data.Dataset, param_distributions: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, n_iter: int = 10, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶ Runs a cross-validated Bayesian Search on the estimator with a randomized sampling of the passed parameter distributions
Parameters: - data (Dataset) – An instance of a DataSet object
- param_distributions (dict) – Parameter distributions to use for randomizing search. Should be a dictionary
of param_names -> one of
-
ml_tooling.search.Integer
-ml_tooling.search.Categorical
-ml_tooling.search.Real
- metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to
self.default_metric
. First metric is used to sort results. - cv (int, optional) – Cross validation to use. Defaults to value in
config.CROSS_VALIDATION
- n_iter (int) – Number of parameter settings that are sampled.
- refit (bool) – Whether or not to refit the best model
Returns: - best_estimator (Model) – Best estimator as found by the Bayesian Search
- result_group (ResultGroup) – ResultGroup object containing each individual score
-
default_metric
¶ Defines default metric based on whether or not the estimator is a regressor or classifier. Then
CLASSIFIER_METRIC
orCLASSIFIER_METRIC
is returned.Returns: Name of the metric Return type: str
-
gridsearch
(data: ml_tooling.data.base_data.Dataset, param_grid: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶ Runs a cross-validated gridsearch on the estimator with the passed in parameter grid.
Parameters: - data (Dataset) – An instance of a DataSet object
- param_grid (dict) – Parameters to use for grid search
- metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to
self.default_metric
. First metric is used to sort results. - cv (int, optional) – Cross validation to use. Defaults to value in
config.CROSS_VALIDATION
- refit (bool) – Whether or not to refit the best model
Returns: - best_estimator (Model) – Best estimator as found by the gridsearch
- result_group (ResultGroup) – ResultGroup object containing each individual score
-
static
list_estimators
(storage: ml_tooling.storage.base.Storage) → List[pathlib.Path]¶ Gets a list of estimators from the given Storage
Parameters: storage (Storage) – Storage class to list the estimators with Example
storage = FileStorage(‘path/to/estimators_dir’) estimator_list = Model.list_estimators(storage)
Returns: list of Paths Return type: List[pathlib.Path]
-
classmethod
load_estimator
(path: Union[str, pathlib.Path], storage: ml_tooling.storage.base.Storage = None) → ml_tooling.baseclass.Model¶ Instantiates the class with a joblib pickled estimator.
Parameters: - storage (Storage) – Storage class to load the estimator with
- path (str, pathlib.Path, optional) – Path to estimator pickle file
Example
We can load a trained estimator from disk:
storage = FileStorage('path/to/dir') my_estimator = Model.load_estimator('my_model.pkl', storage=storage)
We now have a trained estimator loaded.
We can also use the default storage:
my_estimator = Model.load_estimator('my_model.pkl')
This will use the default FileStorage defined in Model.config.default_storage
Returns: Instance of Model with a saved estimator Return type: Model
-
classmethod
load_production_estimator
(module_name: str)¶ Loads a model from a python package. Given that the package is an ML-Tooling package, this will load the production model from the package and create an instance of Model with that package
Parameters: module_name (str) – The name of the package to load a model from
-
log
(run_directory: str)¶ log()
is a context manager that lets you turn on logging for any scoring methods that follow. You can pass a log_dir to specify a subdirectory to store the estimator in. The output is a yaml file recording estimator parameters, package version numbers, metrics and other useful informationParameters: run_directory (str) – Name of the folder to save the details in Example
If we want to log an estimator run in the score folder we can write:
with estimator.log('score'): estimator.score_estimator
This will save the results of estimator.score_estimator() to runs/score/
-
make_prediction
(data: ml_tooling.data.base_data.Dataset, *args, proba: bool = False, threshold: float = None, use_index: bool = False, use_cache: bool = False, **kwargs) → pandas.core.frame.DataFrame¶ Makes a prediction given an input. For example a customer number. Calls load_prediction_data(*args) and passes resulting data to predict() on the estimator
Parameters: - data (Dataset) – an instantiated Dataset object
- proba (bool) – Whether prediction is returned as a probability or not. Note that the return value is an n-dimensional array where n = number of classes
- threshold (float) – Threshold to use for predicting a binary class
- use_index (bool) – Whether the index from the prediction data should be used for the result.
- use_cache (bool) – Whether or not to use the cached data in dataset to make predictions. Useful for seeing probability distributions of the model
Returns: A DataFrame with a prediction per row.
Return type: pd.DataFrame
-
randomsearch
(data: ml_tooling.data.base_data.Dataset, param_distributions: dict, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = None, n_iter: int = 10, refit: bool = True) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶ Runs a cross-validated randomsearch on the estimator with a randomized sampling of the passed parameter distributions
Parameters: - data (Dataset) – An instance of a DataSet object
- param_distributions (dict) – Parameter distributions to use for randomizing search
- metrics (str, list of str) – Metrics to use for scoring. “default” sets metric equal to
self.default_metric
. First metric is used to sort results. - cv (int, optional) – Cross validation to use. Defaults to value in
config.CROSS_VALIDATION
- n_iter (int) – Number of parameter settings that are sampled.
- refit (bool) – Whether or not to refit the best model
Returns: - best_estimator (Model) – Best estimator as found by the randomsearch
- result_group (ResultGroup) – ResultGroup object containing each individual score
-
save_estimator
(storage: ml_tooling.storage.base.Storage = None, prod=False) → pathlib.Path¶ Saves the estimator as a binary file.
Parameters: - storage (Storage) – Storage class to save the estimator with
- prod (bool) – Whether this is a production model to be saved
Example
If we have trained an estimator and we want to save it to disk we can write:
storage = FileStorage('/path/to/save/dir') model = Model(LinearRegression()) saved_filename = model.save_estimator(storage)
to save in the given folder.
Returns: The path to where the estimator file was saved Return type: pathlib.Path
-
score_estimator
(data: ml_tooling.data.base_data.Dataset, metrics: Union[str, List[str]] = 'default', cv: Optional[int] = False) → ml_tooling.result.result.Result¶ Scores the estimator based on training data from data and validates based on validation data from data.
Defaults to no cross-validation. If you want to cross-validate the results, pass number of folds to cv. If cross-validation is used, score_estimator only cross-validates on training data and doesn’t use the validation data.
If the dataset does not have a train set, it will create one using the default config.
Returns a
Result
object containing all result parametersParameters: - data (Dataset) – An instantiated Dataset object with create_train_test called
- metrics (string, list of strings) – Metric or metrics to use for scoring the estimator. Any sklearn metric string
- cv (int, optional) – Whether or not to use cross validation. Number of folds if an int is passed If False, don’t use cross validation
Returns: A Result object that contains the results of the scoring
Return type:
-
classmethod
test_estimators
(data: ml_tooling.data.base_data.Dataset, estimators: Sequence[Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]], feature_pipeline: sklearn.pipeline.Pipeline = None, metrics: Union[str, List[str]] = 'default', cv: Union[int, bool] = False, log_dir: str = None, refit: bool = False) → Tuple[ml_tooling.baseclass.Model, ml_tooling.result.result_group.ResultGroup]¶ Trains each estimator passed and returns a sorted list of results
Parameters: - data (Dataset) – An instantiated Dataset object with train_test data
- estimators (Sequence[Estimator]) – List of estimators to train
- feature_pipeline (Pipeline) – A pipeline for transforming features
- metrics (str, list of str) – Metric or list of metrics to use in scoring of estimators
- cv (int, bool) – Whether or not to use cross-validation. If an int is passed, use that many folds
- log_dir (str, optional) – Where to store logged estimators. If None, don’t log
- refit (bool) – Whether or not to refit the best model on all the training data
Returns: Return type: List of Result objects
-
to_dict
() → List[dict]¶ Serializes the estimator to a dictionary
Returns: Return type: List of dicts
-
train_estimator
(data: ml_tooling.data.base_data.Dataset) → ml_tooling.baseclass.Model¶ Loads all training data and trains the estimator on all data. Typically used as the last step when estimator tuning is complete.
Warning
This will set self.result attribute to None. This method trains the estimator using all the data, so there is no validation data to measure results against
Returns: Returns an estimator trained on all the data, with no train-test split Return type: Model
Datasets¶
-
class
ml_tooling.data.
FileDataset
(path: Union[str, pathlib.Path])¶ An Abstract Base Class for use in creating Filebased Datasets. This class is intended to be subclassed and must provide a
load_training_data()
andload_prediction_data()
method.FileDataset takes a path as its initialization argument, pointing to a file which must be a filetype supported by Pandas, such as csv, parquet etc. The extension determines the pandas method used to read and write the data
Instantiates a Filedataset pointing at a given path.
Parameters: path (Pathlike) – Path to location of file -
load_prediction_data
(*args, **kwargs) → pandas.core.frame.DataFrame¶ Used to load prediction data for a given idx - returns features
-
load_training_data
(*args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]¶ Used to load the full training dataset - returns features and targets
-
read_file
(**kwargs)¶ Read the data from the passed file path
Parameters: kwargs (dict) – Kwargs are passed to the relevant pd.read_*()
method for the given extension
-
-
class
ml_tooling.data.
SQLDataset
(conn: Union[str, sqlalchemy.engine.interfaces.Connectable], schema: Optional[str], **kwargs)¶ An Abstract Base Class for use in creating SQL Datasets. This class is intended to be subclassed and must provide a
load_training_data()
andload_prediction_data()
method.These methods must accept a conn argument which is an instance of a SQLAlchemy connection. This connection will be passed to the method by the SQLDataset at runtime.
-
table
¶ SQLAlchemy table definition to use when loading the dataset. Is the table that will be copied when using .copy_to and should be the canonical definition of the feature set. Do not define a schema - that is set at runtime
Type: sa.Table
Instantiates a dataset with the necessary arguments to connect to the database.
Parameters: - conn (Connectable) – Either a valid DB_URL string or an engine to connect to the database
- schema (str) – A string naming the schema to use - allows for swapping schemas at runtime
- kwargs (dict) – Kwargs are passed to create_engine if conn is a string
-
copy_to
(target: ml_tooling.data.sql.SQLDataset) → ml_tooling.data.sql.SQLDataset¶ Copies data from one database table into other. This will truncate the table and load new data in.
Parameters: target (SQLDataset) – A SQLDataset object representing the table you want to copy the data into Returns: The target dataset to copy to Return type: SQLDataset
-
create_connection
() → sqlalchemy.engine.base.Connection¶ Instantiates a connection to be used in reading and writing to the database.
Ensures that connections are closed properly and dynamically inserts the schema into the database connection
Returns: An open connection to the database, with a dynamically defined schema Return type: sa.engine.Connection
-
load_prediction_data
(idx, conn, *args, **kwargs) → pandas.core.frame.DataFrame¶ Used to load prediction data for a given idx - returns features
-
load_training_data
(conn, *args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]¶ Used to load the full training dataset - returns features and targets
-
-
class
ml_tooling.data.
Dataset
¶ Baseclass for creating Datasets. Subclass Dataset and provide a
load_training_data()
andload_prediction_data()
method-
create_train_test
(stratify: bool = False, shuffle: bool = True, test_size: float = 0.25, seed: int = 42) → ml_tooling.data.base_data.Dataset¶ Creates a training and testing dataset and storing it on the data object.
Parameters: - stratify (DataType, optional) – What to stratify the split on. Usually y if given a classification problem
- shuffle – Whether or not to shuffle the data
- test_size – What percentage of the data will be part of the test set
- seed – Random seed for train_test_split
Returns: Return type: self
-
load_prediction_data
(*args, **kwargs) → pandas.core.frame.DataFrame¶ Abstract method to be implemented by the user. Defines data to be used at prediction time, defined as a DataFrame
Returns: DataFrame of input features to get a prediction Return type: pd.DataFrame
-
load_training_data
(*args, **kwargs) → Tuple[pandas.core.frame.DataFrame, Union[pandas.core.series.Series, numpy.ndarray]]¶ Abstract method to be implemented by user. Defines data to be used at training time where X is a dataframe and y is a numpy array
Returns: x, y – Training data to be used by the models Return type: Tuple of DataTypes
-
-
ml_tooling.data.
load_demo_dataset
(dataset_name: str, **kwargs) → ml_tooling.data.base_data.Dataset¶ Create a
Dataset
implementing the demo datasets from sklearn.datasetsParameters: - dataset_name (str) –
Name of the dataset to use. If ‘openml’ is passed either parameter name or data_id needs to be specified.
- One of:
- iris
- boston
- diabetes
- digits
- linnerud
- wine
- breast_cancer
- openml
- **kwargs – Kwargs are passed on to the scikit-learn dataset function
Returns: An instance of
Dataset
Return type: - dataset_name (str) –
Dataset Plotting Methods¶
-
class
ml_tooling.plots.viz.data_viz.
DataVisualize
(data)¶ -
missing_data
(ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, feature_pipeline: Optional[sklearn.pipeline.Pipeline] = None) → matplotlib.axes._axes.Axes¶ Plot number of missing data points per column. Sorted by number of missing values.
Also allows for selecting top_n/bottom_n number or percent of columns by passing an int or float
Parameters: - ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
- top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- feature_pipeline (Pipeline) – A feature transformation pipeline to be applied before graphing the final results
Returns: Return type: plt.Axes
-
target_correlation
(method: str = 'spearman', ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, feature_pipeline: Optional[sklearn.pipeline.Pipeline] = None) → matplotlib.axes._axes.Axes¶ Plot the correlation between each feature and the target variable using the given value.
Also allows selecting how many features to show by setting the top_n and/or bottom_n parameters.
Parameters: - method (str) – Which method to use when calculating correlation. Supports one of ‘pearson’, ‘spearman’, ‘kendall’.
- ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
- top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- feature_pipeline (Pipeline) – A feature transformation pipeline to be applied before graphing the data
Returns: Return type: plt.Axes
-
Storage¶
-
class
ml_tooling.storage.
FileStorage
(dir_path: Union[str, pathlib.Path] = None)¶ File Storage class for handling storage of estimators to the file system
-
get_list
() → List[pathlib.Path]¶ Finds a list of estimator file paths in the FileStorage directory.
Example
- Find and return estimator paths in a given directory:
- my_estimators = FileStorage(‘path/to/dir’).get_list()
Returns: list of paths to files sorted by filename Return type: List[Path]
-
load
(file_path: Union[str, pathlib.Path]) → Any¶ Loads a joblib pickled estimator from given filepath and returns the unpickled object
Parameters: file_path (Pathlike) – Path where to load the estimator file relative to FileStorage Example
We can load a saved pickled estimator from disk directly from FileStorage:
storage = FileStorage(‘path/to/dir’) my_estimator = storage.load(‘mymodel.pkl’)We now have a trained estimator loaded.
Returns: The object loaded from disk Return type: Object
-
save
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], filename: str, prod: bool = False) → pathlib.Path¶ Save a joblib pickled estimator.
Parameters: - estimator (obj) – The estimator object
- filename (str) – filename of estimator pickle file
- prod (bool) – Whether or not to save in “production mode” - Production mode saves to /src/<projectname>/ regardless of what FileStorage was instantiated with
Example
To save your trained estimator, use the FileStorage context manager.
storage = FileStorage(‘/path/to/save/dir/’) file_path = storage.save(estimator, ‘filename’)We now have saved an estimator to a pickle file.
Returns: Path to the saved object Return type: Path
-
-
class
ml_tooling.storage.
Storage
¶ Base class for Storage classes
-
get_list
() → List[pathlib.Path]¶ Abstract method to be implemented by the user. Defines method used to show which objects have been saved
Returns: Paths to each of the estimators sorted lexically Return type: List[Path]
-
load
(file_path: Union[str, pathlib.Path]) → Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]¶ Abstract method to be implemented by the user. Defines method used to load data from the storage type
Returns: Returns the unpickled object Return type: Estimator
-
save
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], file_path: Union[str, pathlib.Path], prod: bool = False) → Union[str, pathlib.Path]¶ Abstract method to be implemented by the user. Defines method used to save data from the storage type
Returns: Path to where the pickled object is saved Return type: Pathlike
-
-
class
ml_tooling.storage.
ArtifactoryStorage
(artifactory_url: str, repo: str, apikey: Optional[str] = None, auth: Optional[Tuple[str, str]] = None)¶ Artifactory Storage class for handling storage of estimators to JFrog artifactory
Example
Instantiate this class with a url and path to the repo like so:
storage = ArtifactoryStorage(’http://artifactory.com’,’path/to/artifact’)-
get_list
() → List[ArtifactoryPath]¶ Finds a list of estimator artifact paths in the ArtifactoryStorage repo.
Example
- Find and return estimator paths in a given directory:
- my_estimators = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’).get_list()
Returns: list of paths to files sorted by filename Return type: List[ArtifactoryPath]
-
load
(file_path: Union[str, pathlib.Path]) → Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline]¶ Loads a pickled estimator from given filepath and returns the estimator
Parameters: file_path (Pathlike) – Path to load the estimator relative to ArtifactoryStorage Example
We can load a saved pickled estimator from disk directly from Artifactory:
storage = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’) my_estimator = storage.load(‘estimatorfile’)We now have a trained estimator loaded.
Returns: estimator unpickled object Return type: Object
-
save
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], filename: str, prod: bool = False) → ArtifactoryPath¶ Save a pickled estimator to artifactory.
Parameters: - estimator (Estimator) – The estimator object
- filename (str) – filename of estimator pickle file
- prod (bool) – Production variable, set to True if saving a production-ready estimator
Example
To save your trained estimator:
storage = ArtifactoryStorage(’http://artifactory.com’, ‘path/to/repo’) artifactory_path = storage.save(estimator, ‘estimator.pkl’)We now have saved an estimator to a pickle file.
Returns: File path to stored estimator Return type: ArtifactoryPath
-
Config¶
All configuration options available
-
class
ml_tooling.config.
DefaultConfig
¶ Configuration for Models
VERBOSITY
= 0- The level of verbosity from output
CLASSIFIER_METRIC
= ‘accuracy’- Default metric for classifiers
REGRESSION_METRIC
= ‘r2’- Default metric for regressions
CROSS_VALIDATION
= 10- Default Number of cross validation folds to use
N_JOBS
= -1- Default number of cores to use when doing multiprocessing. -1 means use all available
RANDOM_STATE
= 42- Default random state seed for all functions involving randomness
RUN_DIR
= ‘./runs’- Default folder to store run logging files
ESTIMATOR_DIR
= ‘./models’- Default folder to store pickled models in
LOG
= False- Toggles whether or not to log runs to a file. Set to True if you
want every run to be logged, else use the
log()
context manager TRAIN_TEST_SHUFFLE
= True- Default whether or not to shuffle data for test set
TEST_SIZE
= 0.25- Default percentage of data that will be part of the test set
Result¶
Result class to work with results from scoring a model
-
class
ml_tooling.result.
Result
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], metrics: ml_tooling.metrics.metric.Metrics, data: ml_tooling.data.base_data.Dataset)¶ Contains the result of a given training run. Contains plotting methods, as well as being comparable with other results
Parameters: Method generated by attrs for class Result.
ResultGroup¶
A container of Results - some methods in ML Tooling return multiple results, which will be grouped into a ResultGroup. A ResultGroup is sorted by the Result metric and proxies attributes to the best result
-
class
ml_tooling.result.
ResultGroup
(results: List[ml_tooling.result.result.Result])¶ A container for results. Proxies attributes to the best result. Supports indexing like a list.
Method generated by attrs for class ResultGroup.
Classification Result Visualizations¶
-
class
ml_tooling.plots.viz.
ClassificationVisualize
(estimator, data)¶ Visualization class for Classification models
-
confusion_matrix
(normalized: bool = True, threshold: Optional[float] = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualize a confusion matrix for a classification estimator Any kwargs are passed onto matplotlib
Parameters: - normalized (bool) – Whether or not to normalize annotated class counts
- threshold (float) – Threshold to use for classification - defaults to 0.5
Returns: Returns a Confusion Matrix plot
Return type: plt.Axes
-
default_metric
¶ Finds estimator_type for estimator in a BaseVisualize and returns default metric for this class stated in .config. If passed estimator is a Pipeline, assume last step is the estimator.
Returns: Name of the metric Return type: str
-
feature_importance
(top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, class_index: int = None, add_label: bool = True, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualizes feature importance of the estimator through permutation.
Parameters: - top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- class_index (int, optional) – In a multi-class setting, plot the feature importances for the given label. If None, assume a binary classification
- add_label (bool) – Toggles value labels on end of each bar
- ax (Axes) – Draws graph on passed ax - otherwise creates new ax
- kwargs (dict) – Passed to plt.barh
Returns: Return type: matplotlib.Axes
-
learning_curve
(cv: int = None, scoring: str = 'default', n_jobs: int = None, train_sizes: Sequence[float] = array([0.1, 0.325, 0.55, 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Generates a
learning_curve()
plot, used to determine model performance as a function of number of training examples.Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.
Parameters: - cv (int) – Number of CV iterations to run
- scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
- train_sizes (Sequence of floats) – Percentage intervals of data to use when training
- ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
- kwargs (dict) – Passed along to matplotlib line plots
Returns: Return type: plt.Axes
-
lift_curve
(**kwargs) → matplotlib.axes._axes.Axes¶ Visualize a Lift Curve for a classification estimator Estimator must implement a predict_proba method Any kwargs are passed onto matplotlib
Parameters: kwargs (optional) – Keyword arguments to pass on to matplotlib Returns: Return type: plt.Axes
-
permutation_importance
(n_repeats: int = 5, scoring: str = 'default', top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, add_label: bool = True, n_jobs: int = None, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualizes feature importance of the estimator through permutation.
Parameters: - n_repeats (int) – Number of times to permute a feature
- scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- add_label (bool) – Toggles value labels on end of each bar
- ax (Axes) – Draws graph on passed ax - otherwise creates new ax
- n_jobs (int, optional) – Number of parallel jobs to run. Defaults to N_JOBS setting in config.
- kwargs (dict) – Passed to plt.barh
Returns: Return type: matplotlib.Axes
-
precision_recall_curve
(labels: List[str] = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualize a Precision-Recall curve for a classification estimator. Estimator must implement a predict_proba method. Any kwargs are passed onto matplotlib.
Parameters: - labels (List of str) – Labels to use for the class names if multi-class
- kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns: Plot of precision-recall curve
Return type: plt.Axes
-
roc_curve
(labels: List[str] = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualize a ROC curve for a classification estimator. Estimator must implement a predict_proba method Any kwargs are passed onto matplotlib
Parameters: - labels (List of str) – Labels to use for the class names if multi-class
- kwargs (optional) – Keyword arguments to pass on to matplotlib
Returns: Returns a ROC AUC plot
Return type: plt.Axes
-
validation_curve
(param_name: str, param_range: Sequence[T_co], n_jobs: int = None, cv: int = None, scoring: str = 'default', ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Generates a
validation_curve()
plot, graphing the impact of changing a hyperparameter on the scoring metric.This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.
Parameters: - param_name (str) – Name of hyperparameter to plot
- param_range (Sequence) – The individual values to plot for param_name
- n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
- cv (int) – Number of CV iterations to run. Defaults to value in Model.config.
Uses a
StratifiedKFold
if`estimator` is a classifier - otherwise aKFold
is used. - scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
- kwargs (dict) – Passed along to matplotlib line plots
Returns: Return type: plt.Axes
-
Regression Result Visualizations¶
-
class
ml_tooling.plots.viz.
RegressionVisualize
(estimator, data)¶ Visualization class for Regression models
-
default_metric
¶ Finds estimator_type for estimator in a BaseVisualize and returns default metric for this class stated in .config. If passed estimator is a Pipeline, assume last step is the estimator.
Returns: Name of the metric Return type: str
-
feature_importance
(top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, class_index: int = None, add_label: bool = True, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualizes feature importance of the estimator through permutation.
Parameters: - top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- class_index (int, optional) – In a multi-class setting, plot the feature importances for the given label. If None, assume a binary classification
- add_label (bool) – Toggles value labels on end of each bar
- ax (Axes) – Draws graph on passed ax - otherwise creates new ax
- kwargs (dict) – Passed to plt.barh
Returns: Return type: matplotlib.Axes
-
learning_curve
(cv: int = None, scoring: str = 'default', n_jobs: int = None, train_sizes: Sequence[float] = array([0.1, 0.325, 0.55, 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Generates a
learning_curve()
plot, used to determine model performance as a function of number of training examples.Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.
Parameters: - cv (int) – Number of CV iterations to run
- scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
- train_sizes (Sequence of floats) – Percentage intervals of data to use when training
- ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
- kwargs (dict) – Passed along to matplotlib line plots
Returns: Return type: plt.Axes
-
permutation_importance
(n_repeats: int = 5, scoring: str = 'default', top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, add_label: bool = True, n_jobs: int = None, ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Visualizes feature importance of the estimator through permutation.
Parameters: - n_repeats (int) – Number of times to permute a feature
- scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- add_label (bool) – Toggles value labels on end of each bar
- ax (Axes) – Draws graph on passed ax - otherwise creates new ax
- n_jobs (int, optional) – Number of parallel jobs to run. Defaults to N_JOBS setting in config.
- kwargs (dict) – Passed to plt.barh
Returns: Return type: matplotlib.Axes
-
prediction_error
(**kwargs) → matplotlib.axes._axes.Axes¶ Visualizes prediction error of a regression estimator Any kwargs are passed onto matplotlib
Returns: Plot of the estimator’s prediction error Return type: matplotlib.Axes
-
residuals
(**kwargs) → matplotlib.axes._axes.Axes¶ Visualizes residuals of a regression estimator. Any kwargs are passed onto matplotlib
Returns: Plot of the estimator’s residuals Return type: matplotlib.Axes
-
validation_curve
(param_name: str, param_range: Sequence[T_co], n_jobs: int = None, cv: int = None, scoring: str = 'default', ax: matplotlib.axes._axes.Axes = None, **kwargs) → matplotlib.axes._axes.Axes¶ Generates a
validation_curve()
plot, graphing the impact of changing a hyperparameter on the scoring metric.This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.
Parameters: - param_name (str) – Name of hyperparameter to plot
- param_range (Sequence) – The individual values to plot for param_name
- n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
- cv (int) – Number of CV iterations to run. Defaults to value in Model.config.
Uses a
StratifiedKFold
if`estimator` is a classifier - otherwise aKFold
is used. - scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
- kwargs (dict) – Passed along to matplotlib line plots
Returns: Return type: plt.Axes
-
Plots¶
-
ml_tooling.plots.
plot_confusion_matrix
(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], normalized: bool = True, title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: Sequence[str] = None) → matplotlib.axes._axes.Axes¶ Plots a confusion matrix of predicted labels vs actual labels
Parameters: - y_true – True labels
- y_pred – Predicted labels from estimator
- normalized – Whether to normalize counts in matrix
- title – Title for plot
- ax – Pass your own ax
- labels – Pass custom list of labels
Returns: matplotlib.Axes
-
ml_tooling.plots.
plot_target_correlation
(features: pandas.core.frame.DataFrame, target: Union[pandas.core.series.Series, numpy.ndarray], method: str = 'spearman', ax: matplotlib.axes._axes.Axes = None, top_n: Union[int, float] = None, bottom_n: Union[int, float] = None, title: str = 'Feature-Target Correlation') → matplotlib.axes._axes.Axes¶ Plot the correlation between each feature and the target variable using the given value.
Also allows selecting how many features to show by setting the top_n and/or bottom_n parameters.
Parameters: - features (pd.DataFrame) – Features to plot
- target (np.Array or pd.Series) – Target to calculate correlation with
- method (str) – Which method to use when calculating correlation. Supports one of ‘pearson’, ‘spearman’, ‘kendall’.
- ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
- top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
- title (str) – Title of graph
Returns: Return type: plt.Axes
-
ml_tooling.plots.
plot_feature_importance
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, ax: matplotlib.axes._axes.Axes = None, class_index: int = None, bottom_n: Union[int, float] = None, top_n: Union[int, float] = None, add_label: bool = True, title: str = '', **kwargs) → matplotlib.axes._axes.Axes¶ Plot either the estimator coefficients or the estimator feature importances depending on what is provided by the estimator.
see also :func:ml_tooling.plot.plot_permutation_importance for an unbiased version of feature importance using permutation importance
Parameters: - estimator (Estimator) – Estimator to use to calculate permuted feature importance
- x (DataType) – Features to calculate permuted feature importance for
- ax (Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
- class_index (int, optional) – In a multi-class setting, choose which class to get feature importances for. If None, will assume a binary classifier
- bottom_n (int) – Plot only bottom n features
- top_n (int) – Plot only top n features
- add_label (bool) – Whether or not to plot text labels for the bars
- title (str) – Title to add to the plot
- kwargs (dict) – Any kwargs are passed to matplotlib
Returns: Return type: plt.Axes
-
ml_tooling.plots.
plot_lift_curve
(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None, threshold: float = 0.5) → matplotlib.axes._axes.Axes¶ Plot a lift chart from results. Also calculates lift score based on a .5 threshold
Parameters: - y_true (DataType) – True labels
- y_proba (DataType) – Model’s predicted probability
- title (str) – Plot title
- ax (Axes) – Pass your own ax
- labels (List of str) – Labels to use per class
- threshold (float) – Threshold to use when determining lift score
Returns: Return type: matplotlib.Axes
-
ml_tooling.plots.
plot_prediction_error
(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None) → matplotlib.axes._axes.Axes¶ Plots prediction error of regression estimator
Parameters: - y_true – True values
- y_pred – Model’s predicted values
- title – Plot title
- ax – Pass your own ax
Returns: matplotlib.Axes
-
ml_tooling.plots.
plot_residuals
(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_pred: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None) → matplotlib.axes._axes.Axes¶ Plots residuals from a regression.
Parameters: - y_true – True values
- y_pred – Models predicted value
- title – Plot title
- ax – Pass your own ax
Returns: matplotlib.Axes
-
ml_tooling.plots.
plot_roc_auc
(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None) → matplotlib.axes._axes.Axes¶ Plot ROC AUC curve. Works only with probabilities
Parameters: - y_true (DataType) – True labels
- y_proba (DataType) – Probability estimate from estimator
- title (str) – Plot title
- ax (Axes) – Pass in your own ax
- labels (List of str) – Optionally specify label names
Returns: Plot of ROC AUC curve
Return type: plt.Axes
-
ml_tooling.plots.
plot_pr_curve
(y_true: Union[pandas.core.series.Series, numpy.ndarray], y_proba: Union[pandas.core.series.Series, numpy.ndarray], title: str = None, ax: matplotlib.axes._axes.Axes = None, labels: List[str] = None) → matplotlib.axes._axes.Axes¶ Plot precision-recall curve. Works only with probabilities.
Parameters: - y_true (DataType) – True labels
- y_proba (DataType) – Probability estimate from estimator
- title (str) – Plot title
- ax (plt.Axes) – Pass in your own ax
- labels (List of str, optional) – Labels for each class
Returns: Plot of precision-recall curve
Return type: plt.Axes
-
ml_tooling.plots.
plot_learning_curve
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, y: Union[pandas.core.series.Series, numpy.ndarray], cv: int = 5, scoring: str = 'default', n_jobs: int = -1, train_sizes: Sequence[T_co] = array([0.1 , 0.325, 0.55 , 0.775, 1. ]), ax: matplotlib.axes._axes.Axes = None, random_state: int = None, title: str = 'Learning Curve', **kwargs) → matplotlib.axes._axes.Axes¶ Generates a
learning_curve()
plot, used to determine model performance as a function of number of training examples.Illustrates whether or not number of training examples is the performance bottleneck. Also used to diagnose underfitting or overfitting, by seeing how the training set and validation set performance differ.
Parameters: - estimator (sklearn-compatible estimator) – An instance of a sklearn estimator
- x (pd.DataFrame) – DataFrame of features
- y (pd.Series or np.Array) – Target values to predict
- cv (int) – Number of CV iterations to run. Uses a
StratifiedKFold
if estimator is a classifier - otherwise aKFold
is used. - scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
- train_sizes (Sequence of floats) – Percentage intervals of data to use when training
- ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
- random_state (int) – Random state to use in CV splitting
- title (str) – Title to be used on the plot
- kwargs (dict) – Passed along to matplotlib line plots
Returns: Return type: plt.Axes
-
ml_tooling.plots.
plot_validation_curve
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: pandas.core.frame.DataFrame, y: Union[pandas.core.series.Series, numpy.ndarray], param_name: str, param_range: Sequence[T_co], cv: int = 5, scoring: str = 'default', n_jobs: int = -1, ax: matplotlib.axes._axes.Axes = None, title: str = '', **kwargs) → matplotlib.axes._axes.Axes¶ Plots a
validation_curve()
, graphing the impact of changing a hyperparameter on the scoring metric.This lets us examine how a hyperparameter affects over/underfitting by examining train/test performance with different values of the hyperparameter.
Parameters: - estimator (sklearn-compatible estimator) – An instance of a sklearn estimator
- x (pd.DataFrame) – DataFrame of features
- y (pd.Series or np.Array) – Target values to predict
- param_name (str) – Name of hyperparameter to plot
- param_range (Sequence) – The individual values to plot for param_name
- cv (int) – Number of CV iterations to run. Uses a
StratifiedKFold
if estimator is a classifier - otherwise aKFold
is used. - scoring (str) – Metric to use in scoring - must be a scikit-learn compatible scoring method
- n_jobs (int) – Number of jobs to use in parallelizing the estimator fitting and scoring
- ax (plt.Axes) – The plot will be drawn on the passed ax - otherwise a new figure and ax will be created.
- title (str) – Title to be used on the plot
- kwargs (dict) – Passed along to matplotlib line plots
Returns: Return type: plt.Axes
-
ml_tooling.plots.
plot_missing_data
(df: pandas.core.frame.DataFrame, ax: Optional[matplotlib.axes._axes.Axes] = None, top_n: Union[int, float, None] = None, bottom_n: Union[int, float, None] = None, **kwargs) → matplotlib.axes._axes.Axes¶ Plot number of missing data points per column. Sorted by number of missing values.
Also allows for selecting top_n/bottom_n number or percent of columns by passing an int or float
Parameters: - df (pd.DataFrame) – Feature DataFrame to calculate missing values from
- ax (plt.Axes) – Matplotlib axes to draw the graph on. Creates a new one by default
- top_n (int, float) – If top_n is an integer, return top_n features. If top_n is a float between (0, 1), return top_n percent features
- bottom_n (int, float) – If bottom_n is an integer, return bottom_n features. If bottom_n is a float between (0, 1), return bottom_n percent features
Returns: Return type: plt.Axes
Transformers¶
-
class
ml_tooling.transformers.
Binarize
(value: Any = None)¶ Sets all instances of value to 1 and all others to 0 Returns a pandas DataFrame
Parameters: value (Any) – The value to be set to 1
-
class
ml_tooling.transformers.
Binner
(bins: Union[int, list] = 5, labels: list = None)¶ Bins data according to passed bins and labels. Uses
pandas.cut()
under the hood, see for further detailsParameters: - bins (int, list) – The criteria to bin by. An int value defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x. If a list is passed, defines the bin edges allowing for non-uniform width and no extension of the range of x is done.
- labels (list) – Specifies the labels for the returned bins. Must be the same length as the resulting bins.
-
class
ml_tooling.transformers.
ToCategorical
¶ Converts a column into a one-hot encoded column through pd.Categorical
-
class
ml_tooling.transformers.
DateEncoder
(day: bool = True, month: bool = True, week: bool = True, year: bool = True)¶ Converts a date column into multiple day-month-year columns
Parameters: - day (bool) – If True, a new day column will be added.
- month (bool) – If True, a new month column will be added.
- week (bool) – If True, a new week column will be added.
- year (bool) – If True, a new year column will be added.
-
class
ml_tooling.transformers.
DFFeatureUnion
(transformer_list: list)¶ Merges together two pipelines based on index.
Parameters: transformer_list (list) – transformer_list is a list of (name, transformer) tuples, where transfomer implements fit/transform.
-
class
ml_tooling.transformers.
FillNA
(value: Union[str, int, None] = None, strategy: Optional[str] = None, indicate_nan: bool = False)¶ Fills NA values with given value or strategy. Either a value or a strategy must be passed.
Parameters: - value (str, int) – A specific value to replace NaNs with.
- strategy (str) – A named strategy to replace NaNs with. One of ‘mean’, ‘median’, ‘most_freq’, ‘max’, ‘min’
- indicate_nan (bool) – If True, a new column is added which indicates if a value in a column was missing.
-
class
ml_tooling.transformers.
FreqFeature
¶ Converts a column into its normalized value count
-
class
ml_tooling.transformers.
FuncTransformer
(func: Callable[[...], pandas.core.frame.DataFrame] = None, **kwargs)¶ Applies a given function to each column
Parameters: - func (Callable[.., pd.DataFrame]) – Define the function which should be applied on each column.
- kwargs – Specific for the selected func.
-
class
ml_tooling.transformers.
DFRowFunc
(strategy: Union[Callable[[...], pandas.core.frame.DataFrame], str] = None)¶ Row-wise operation on Pandas DataFrame.
Parameters: strategy (Callable[.., pd.DataFrame], str) – Strategy can either be one of the predefined or a callable. If some elements in the row are NaN these elements are ignored for the built-in strategies. Valid strategies are:
- sum
- min
- max
- mean
If a callable is used, it must return a pd.Series
-
class
ml_tooling.transformers.
RareFeatureEncoder
(threshold: Union[int, float] = 0.2, fill_rare: Any = 'Rare')¶ Replaces categories with a specified value, if they occur less often than the provided threshold.
Parameters: - threshold (int, float) – Sets the threshold for when a value is considered rare. Any value which occurs less than the threshold will be replaced with fill_rare. If threshold is a float, it will be considered a percentage and if it is an int, threshold will be considered the minimum number of observations.
- fill_rare (Any) – Fill value to use when replacing rare categories.
-
class
ml_tooling.transformers.
Renamer
(column_names: Union[list, str] = None)¶ Renames columns to passed names.
Parameters: column_names (list, str) – The column names which should replace the original column names.
-
class
ml_tooling.transformers.
DFStandardScaler
(copy: bool = True, with_mean: bool = True, with_std: bool = True)¶ Wrapping of the StandardScaler from scikit-learn for Pandas DataFrames. See:
StandardScaler
Parameters: - copy (bool) – If True, a copy of the dataframe is made.
- with_mean (bool) – If True, center the data before scaling.
- with_std (bool) – If True, scale the data to unit standard deviation.
-
class
ml_tooling.transformers.
Select
(columns: Union[List[str], str] = None)¶ Selects columns from DataFrame
Parameters: columns (List[str], str, None) – Specify which columns are selected.
-
class
ml_tooling.transformers.
Pipeline
(steps, *, memory=None, verbose=False)¶ Pipeline of transforms with a final estimator.
Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using
memory
argument.The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or
None
.Read more in the User Guide.
New in version 0.5.
Parameters: - steps (list) – List of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
- memory (str or object with the joblib.Memory interface, default=None) – Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
the caching directory. Enabling caching triggers a clone of
the transformers before fitting. Therefore, the transformer
instance given to the pipeline cannot be inspected
directly. Use the attribute
named_steps
orsteps
to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. - verbose (bool, default=False) – If True, the time elapsed while fitting each step will be printed as it is completed.
-
named_steps
¶ Dictionary-like object, with the following attributes. Read-only attribute to access any step parameter by user given name. Keys are step names and values are steps parameters.
Type: Bunch
See also
sklearn.pipeline.make_pipeline
- Convenience function for simplified pipeline construction.
Examples
>>> from sklearn.svm import SVC >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.datasets import make_classification >>> from sklearn.model_selection import train_test_split >>> from sklearn.pipeline import Pipeline >>> X, y = make_classification(random_state=0) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, ... random_state=0) >>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())]) >>> # The pipeline can be used as any other estimator >>> # and avoids leaking the test set into the train set >>> pipe.fit(X_train, y_train) Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())]) >>> pipe.score(X_test, y_test) 0.88
-
decision_function
(X)¶ Apply transforms, and decision_function of the final estimator
Parameters: X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline. Returns: y_score Return type: array-like of shape (n_samples, n_classes)
-
fit
(X, y=None, **fit_params)¶ Fit the model
Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Parameters: - X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
- y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
- **fit_params (dict of string -> object) – Parameters passed to the
fit
method of each step, where each parameter name is prefixed such that parameterp
for steps
has keys__p
.
Returns: self – This estimator
Return type:
-
fit_predict
(X, y=None, **fit_params)¶ Applies fit_predict of last step in pipeline after transforms.
Applies fit_transforms of a pipeline to the data, followed by the fit_predict method of the final estimator in the pipeline. Valid only if the final estimator implements fit_predict.
Parameters: - X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
- y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
- **fit_params (dict of string -> object) – Parameters passed to the
fit
method of each step, where each parameter name is prefixed such that parameterp
for steps
has keys__p
.
Returns: y_pred
Return type: array-like
-
fit_transform
(X, y=None, **fit_params)¶ Fit the model and transform with the final estimator
Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.
Parameters: - X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
- y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
- **fit_params (dict of string -> object) – Parameters passed to the
fit
method of each step, where each parameter name is prefixed such that parameterp
for steps
has keys__p
.
Returns: Xt – Transformed samples
Return type: array-like of shape (n_samples, n_transformed_features)
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: mapping of string to any
-
inverse_transform
¶ Apply inverse transformations in reverse order
All estimators in the pipeline must support
inverse_transform
.Parameters: Xt (array-like of shape (n_samples, n_transformed_features)) – Data samples, where n_samples
is the number of samples andn_features
is the number of features. Must fulfill input requirements of last step of pipeline’sinverse_transform
method.Returns: Xt Return type: array-like of shape (n_samples, n_features)
-
predict
(X, **predict_params)¶ Apply transforms to the data, and predict with the final estimator
Parameters: - X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- **predict_params (dict of string -> object) –
Parameters to the
predict
called at the end of all transformations in the pipeline. Note that while this may be used to return uncertainties from some models with return_std or return_cov, uncertainties that are generated by the transformations in the pipeline are not propagated to the final estimator.New in version 0.20.
Returns: y_pred
Return type: array-like
-
predict_log_proba
(X)¶ Apply transforms, and predict_log_proba of the final estimator
Parameters: X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline. Returns: y_score Return type: array-like of shape (n_samples, n_classes)
-
predict_proba
(X)¶ Apply transforms, and predict_proba of the final estimator
Parameters: X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline. Returns: y_proba Return type: array-like of shape (n_samples, n_classes)
-
score
(X, y=None, sample_weight=None)¶ Apply transforms, and score with the final estimator
Parameters: - X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
- y (iterable, default=None) – Targets used for scoring. Must fulfill label requirements for all steps of the pipeline.
- sample_weight (array-like, default=None) – If not None, this argument is passed as
sample_weight
keyword argument to thescore
method of the final estimator.
Returns: score
Return type: float
-
score_samples
(X)¶ Apply transforms, and score_samples of the final estimator.
Parameters: X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline. Returns: y_score Return type: ndarray of shape (n_samples,)
-
set_params
(**kwargs)¶ Set the parameters of this estimator.
Valid parameter keys can be listed with
get_params()
.Returns: Return type: self
-
transform
¶ Apply transforms, and transform with the final estimator
This also works where final estimator is
None
: all prior transformations are applied.Parameters: X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline. Returns: Xt Return type: array-like of shape (n_samples, n_transformed_features)
Metric¶
-
class
ml_tooling.metrics.
Metric
(name: str, score: float = None, cross_val_scores: Optional[numpy.ndarray] = None)¶ Represents a single metric, containing a metric name and its corresponding score. Can be instantiated using any sklearn-compatible score_ strings
A Metric knows how to generate it’s own score by calling
score_metric()
, passing an estimator, an X and a Y. A Metric can also get a cross-validated score by callingscore_metric_cv()
and passing a CV value - either a CV_ object or an int specifying number of foldsExamples
>>> from ml_tooling.metrics import Metric >>> from sklearn.linear_model import LinearRegression >>> import numpy as np >>> metric = Metric('r2') >>> x = np.array([[1],[2],[3],[4]]) >>> y = np.array([[2], [4], [6], [8]]) >>> estimator = LinearRegression().fit(x, y) >>> metric.score_metric(estimator, x, y) Metric(name='r2', score=1.0) >>> metric.score 1.0 >>> metric.name 'r2'
>>> metric.score_metric_cv(estimator, x, y, cv=2) Metric(name='r2', score=1.0) >>> metric.score 1.0 >>> metric.name 'r2' >>> metric.cross_val_scores array([1., 1.]) >>> metric.std 0.0
Method generated by attrs for class Metric.
-
score_metric
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: Union[pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.series.Series, numpy.ndarray]) → ml_tooling.metrics.metric.Metric¶ Calculates the score for this metric. Takes a fitted estimator, x and y values. Scores are calculated with sklearn metrics - using the string defined in self.metric to look up the appropriate scoring function.
Parameters: - estimator (Pipeline or BaseEstimator) – A fitted estimator to score
- x (np.ndarray, pd.DataFrame) – Features to score model with
- y (np.ndarray, pd.Series) – Target to score model with
Returns: Return type: self
-
score_metric_cv
(estimator: Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline], x: Union[pandas.core.series.Series, numpy.ndarray], y: Union[pandas.core.series.Series, numpy.ndarray], cv: Any, n_jobs: int = -1, verbose: int = 0) → ml_tooling.metrics.metric.Metric¶ Score metric using cross-validation. When scoring with cross_validation, self.cross_val_scores is populated with the cross validated scores and self.score is set to the mean value of self.cross_val_scores. Cross validation can be parallelized by passing the n_jobs parameter
Parameters: - estimator (Pipeline or BaseEstimator) – Fitted estimator to score
- x (np.ndarray or pd.DataFrame) – Features to use in scoring
- y (np.ndarray or pd.Series) – Target to use in scoring
- cv (int, BaseCrossValidator) – If an int is passed, cross-validate using K-Fold with cv folds. If BaseCrossValidator is passed, use that object instead
- n_jobs (int) – Number of jobs to use in parallelizing. Pass None to not do CV in parallel
- verbose (int) – Verbosity level of output
Returns: Return type: self
-
-
class
ml_tooling.metrics.
Metrics
(metrics: List[ml_tooling.metrics.metric.Metric])¶ Represents a collection of
Metric
. This is the default object used when scoring an estimator.There are two alternate constructors: -
from_list()
takes a list of metric names and instantiates one metric per list item -from_dict()
takes a dictionary of name -> score and instantiates one metric with the given score per dictionary itemCalling either
score_metrics()
orscore_metrics_cv()
will in turn callscore_metric()
orscore_metric_cv()
of eachMetric
in its collectionExamples
To score multiple metrics, create a metrics object from a list and call
score_metrics()
to score all metrics in one operationWe can convert metrics to a dictionary
or a list
Method generated by attrs for class Metrics.