model_utils¶
ks_abc
¶
ks_abc(y_true, y_pred, ax=None, figsize=None, colors=('darkorange', 'b'), title=None, xlim=(0.,1.), ylim=(0.,1.), fmt='.2f', lw=2, legend='best', plot=True, filename=None)
Perform the Kolmogorov–Smirnov test over the positive and negative distributions of a binary classifier, and compute the area between curves.
The KS test plots the fraction of positives and negatives predicted correctly below each threshold. It then finds the optimal threshold, being the one enabling the best class separation.
The area between curves allows a better insight into separation. The higher the area is (1 being the maximum), the more the positive and negative distributions' center-of-mass are closer to 1 and 0, respectively.
Based on scikit-plot plot_ks_statistic
method.
-
y_true
: array-likeThe true labels of the dataset
-
y_pred
: array-likeThe probabilities predicted by a binary classifier
-
ax
: matplotlib axDefault: None
Matplotlib Axis on which the curves will be plotted
-
figsize
:(int,int)
orNone
Default: None
a Matplotlib figure-size tuple. If
None
, falls back to Matplotlib's default. Only used ifax=None
-
colors
: list of Matplotlib color stringsDefault:
('darkorange', 'b')
List of colors to be used for the plotted curves
-
title
: string orNone
Default: None
Plotted graph title. If
None
, default title is used -
xlim
:(float, float)
Default: (0.,1.)
X-axis limits.
-
ylim
:(float,float)
Default: (0.,1.)
Y-axis limits.
-
fmt
:string
Default: '.2f'
String formatting of displayed numbers.
-
lw
:int
Default: 2
Line-width.
-
legend
:string
orNone
Default: 'best'
A Matplotlib legend location string. See Matplotlib documentation for possible options
-
plot
:Boolean
, default = TruePlot the KS curves
-
filename
:string
orNone
Default: None
If not None, plot will be saved to the given file name.
Returns: A dictionary of the following keys:
-
abc
: area between curves -
ks_stat
: computed statistic of the KS test -
eopt
: estimated optimal threshold -
ax
: the ax used to plot the curves
Example: See examples.
metric_graph
¶
metric_graph(y_true, y_pred, metric, micro=True, macro=True, eoptimal_threshold=True, class_names=None, colors=None, ax=None, figsize=None, xlim=(0.,1.), ylim=(0.,1.02), lw=2, ls='-', ms=10, fmt='.2f', title=None, filename=None, force_multiclass=False)
Plot a metric graph of predictor's results (including AUC scores), where each row of y_true and y_pred represent a single example.
ROC: Plots true-positive rate as a function of the false-positive rate of the positive label in a binary classification, where \(TPR = TP / (TP + FN)\) and \(FPR = FP / (FP + TN)\). A naive algorithm will display a linear line going from (0,0) to (1,1), therefore having an area under-curve (AUC) of 0.5.
Precision-Recall: Plots precision as a function of recall of the positive label in a binary classification, where \(Precision = TP / (TP + FP)\) and \(Recall = TP / (TP + FN)\). A naive algorithm will display a horizontal linear line with precision of the ratio of positive examples in the dataset.
Based on scikit-learn examples (as was seen on April 2018):
-
y_true
:list / NumPy ndarray
The true classes of the predicted data. If only one or two columns exist, the data is treated as a binary classification (see input example below). If there are more than 2 columns, each column is considered a unique class, and a ROC graph and AUC score will be computed for each.
-
y_pred
:list / NumPy ndarray
The predicted classes. Must have the same shape as
y_true
. -
metric
:string
The metric graph to plot. Currently supported: 'roc' for Receiver Operating Characteristic curve and 'pr' for Precision-Recall curve
-
micro
:Boolean
Default: True
Whether to calculate a Micro graph (not applicable for binary cases)
-
macro
:Boolean
Default: True
Whether to calculate a Macro graph (ROC metric only, not applicable for binary cases)
-
eopt
:Boolean
Default: True
Whether to calculate and display the estimated-optimal threshold for each metric graph. For ROC curves, the estimated-optimal threshold is the closest computed threshold with (fpr,tpr) values closest to (0,1). For PR curves, it is the closest one to (1,1) (perfect recall and precision)
-
class_names
:list
orstring
Default: None
Names of the different classes. In a multi-class classification, the order must match the order of the classes probabilities in the input data. In a binary classification, can be a string or a list. If a list, only the last element will be used.
-
colors
: list of Matplotlib color strings orNone
Default: None
List of colors to be used for the plotted curves. If
None
, falls back to a predefined default. -
ax
: matplotlibax
Default: None
Matplotlib Axis on which the curves will be plotted
-
figsize
:(int,int)
orNone
Default: None
A Matplotlib figure-size tuple. If
None
, falls back to Matplotlib's default. Only used ifax=None
. -
xlim
:(float, float)
Default: (0.,1.)
X-axis limits.
-
ylim
:(float,float)
Default: (0.,1.02)
Y-axis limits.
-
lw
:int
Default: 2
Line-width.
-
ls
:string
Default: '-'
Matplotlib line-style string
-
ms
:int
Default: 10
Marker-size.
-
fmt
:string
Default: '.2f'
String formatting of displayed AUC and threshold numbers.
-
legend
:string
orNone
Default: 'best'
A Matplotlib legend location string. See Matplotlib documentation for possible options
-
plot
:Boolean
, default = TruePlot the histogram
-
title
:string
orNone
Default: None
Plotted graph title. If None, default title is used.
-
filename
:string
orNone
Default: None
If not None, plot will be saved to the given file name.
-
force_multiclass
:Boolean
Default: False
Only applicable if
y_true
andy_pred
have two columns. If so, consider the data as a multiclass data rather than binary (useful when plotting curves of different models one against the other)
Returns: A dictionary, one key for each class. Each value is another dictionary, holding AUC and eOpT values.
Example: See examples.
Binary Classification Input Example: Consider a data-set of two data-points where the true class of the first line is class 0, which was predicted with a probability of 0.6, and the second line's true class is 1, with predicted probability of 0.8.
# First option:
>>> metric_graph(y_true=[0,1], y_pred=[0.6,0.8], metric='roc')
# Second option:
>>> metric_graph(y_true=[[1,0],[0,1]], y_pred=[[0.6,0.4],[0.2,0.8]], metric='roc')
# Both yield the same result
random_forest_feature_importance
¶
random_forest_feature_importance(forest, features, precision=4)
Given a trained sklearn.ensemble.RandomForestClassifier
, plot the different features based on their
importance according to the classifier, from the most important to the least.
-
forest
:sklearn.ensemble.RandomForestClassifier
A trained
RandomForestClassifier
-
features
:list
A list of the names of the features the classifier was trained on, ordered by the same order the appeared in the training data
-
precision
:int
Default: 4
Precision of feature importance.