model_utils¶
ks_abc
¶
ks_abc(y_true, y_pred, ax=None, figsize=None, colors=('darkorange', 'b'), title=None, xlim=(0.,1.), ylim=(0.,1.), fmt='.2f', lw=2, legend='best', plot=True, filename=None)
Perform the Kolmogorovâ€“Smirnov test over the positive and negative distributions of a binary classifier, and compute the area between curves.
The KS test plots the fraction of positives and negatives predicted correctly below each threshold. It then finds the optimal threshold, being the one enabling the best class separation.
The area between curves allows a better insight into separation. The higher the area is (1 being the maximum), the more the positive and negative distributions' centerofmass are closer to 1 and 0, respectively.
Based on scikitplot plot_ks_statistic
method.

y_true
: arraylikeThe true labels of the dataset

y_pred
: arraylikeThe probabilities predicted by a binary classifier

ax
: matplotlib axDefault: None
Matplotlib Axis on which the curves will be plotted

figsize
:(int,int)
orNone
Default: None
a Matplotlib figuresize tuple. If
None
, falls back to Matplotlib's default. Only used ifax=None

colors
: list of Matplotlib color stringsDefault:
('darkorange', 'b')
List of colors to be used for the plotted curves

title
: string orNone
Default: None
Plotted graph title. If
None
, default title is used 
xlim
:(float, float)
Default: (0.,1.)
Xaxis limits.

ylim
:(float,float)
Default: (0.,1.)
Yaxis limits.

fmt
:string
Default: '.2f'
String formatting of displayed numbers.

lw
:int
Default: 2
Linewidth.

legend
:string
orNone
Default: 'best'
A Matplotlib legend location string. See Matplotlib documentation for possible options

plot
:Boolean
, default = TruePlot the KS curves

filename
:string
orNone
Default: None
If not None, plot will be saved to the given file name.
Returns: A dictionary of the following keys:

abc
: area between curves 
ks_stat
: computed statistic of the KS test 
eopt
: estimated optimal threshold 
ax
: the ax used to plot the curves
Example: See examples.
metric_graph
¶
metric_graph(y_true, y_pred, metric, micro=True, macro=True, eoptimal_threshold=True, class_names=None, colors=None, ax=None, figsize=None, xlim=(0.,1.), ylim=(0.,1.02), lw=2, ls='', ms=10, fmt='.2f', title=None, filename=None, force_multiclass=False)
Plot a metric graph of predictor's results (including AUC scores), where each row of y_true and y_pred represent a single example.
ROC: Plots truepositive rate as a function of the falsepositive rate of the positive label in a binary classification, where \(TPR = TP / (TP + FN)\) and \(FPR = FP / (FP + TN)\). A naive algorithm will display a linear line going from (0,0) to (1,1), therefore having an area undercurve (AUC) of 0.5.
PrecisionRecall: Plots precision as a function of recall of the positive label in a binary classification, where \(Precision = TP / (TP + FP)\) and \(Recall = TP / (TP + FN)\). A naive algorithm will display a horizontal linear line with precision of the ratio of positive examples in the dataset.
Based on scikitlearn examples (as was seen on April 2018):

y_true
:list / NumPy ndarray
The true classes of the predicted data. If only one or two columns exist, the data is treated as a binary classification (see input example below). If there are more than 2 columns, each column is considered a unique class, and a ROC graph and AUC score will be computed for each.

y_pred
:list / NumPy ndarray
The predicted classes. Must have the same shape as
y_true
. 
metric
:string
The metric graph to plot. Currently supported: 'roc' for Receiver Operating Characteristic curve and 'pr' for PrecisionRecall curve

micro
:Boolean
Default: True
Whether to calculate a Micro graph (not applicable for binary cases)

macro
:Boolean
Default: True
Whether to calculate a Macro graph (ROC metric only, not applicable for binary cases)

eopt
:Boolean
Default: True
Whether to calculate and display the estimatedoptimal threshold for each metric graph. For ROC curves, the estimatedoptimal threshold is the closest computed threshold with (fpr,tpr) values closest to (0,1). For PR curves, it is the closest one to (1,1) (perfect recall and precision)

class_names
:list
orstring
Default: None
Names of the different classes. In a multiclass classification, the order must match the order of the classes probabilities in the input data. In a binary classification, can be a string or a list. If a list, only the last element will be used.

colors
: list of Matplotlib color strings orNone
Default: None
List of colors to be used for the plotted curves. If
None
, falls back to a predefined default. 
ax
: matplotlibax
Default: None
Matplotlib Axis on which the curves will be plotted

figsize
:(int,int)
orNone
Default: None
A Matplotlib figuresize tuple. If
None
, falls back to Matplotlib's default. Only used ifax=None
. 
xlim
:(float, float)
Default: (0.,1.)
Xaxis limits.

ylim
:(float,float)
Default: (0.,1.02)
Yaxis limits.

lw
:int
Default: 2
Linewidth.

ls
:string
Default: ''
Matplotlib linestyle string

ms
:int
Default: 10
Markersize.

fmt
:string
Default: '.2f'
String formatting of displayed AUC and threshold numbers.

legend
:string
orNone
Default: 'best'
A Matplotlib legend location string. See Matplotlib documentation for possible options

plot
:Boolean
, default = TruePlot the histogram

title
:string
orNone
Default: None
Plotted graph title. If None, default title is used.

filename
:string
orNone
Default: None
If not None, plot will be saved to the given file name.

force_multiclass
:Boolean
Default: False
Only applicable if
y_true
andy_pred
have two columns. If so, consider the data as a multiclass data rather than binary (useful when plotting curves of different models one against the other)
Returns: A dictionary, one key for each class. Each value is another dictionary, holding AUC and eOpT values.
Example: See examples.
Binary Classification Input Example: Consider a dataset of two datapoints where the true class of the first line is class 0, which was predicted with a probability of 0.6, and the second line's true class is 1, with predicted probability of 0.8.
# First option:
>>> metric_graph(y_true=[0,1], y_pred=[0.6,0.8], metric='roc')
# Second option:
>>> metric_graph(y_true=[[1,0],[0,1]], y_pred=[[0.6,0.4],[0.2,0.8]], metric='roc')
# Both yield the same result
random_forest_feature_importance
¶
random_forest_feature_importance(forest, features, precision=4)
Given a trained sklearn.ensemble.RandomForestClassifier
, plot the different features based on their
importance according to the classifier, from the most important to the least.

forest
:sklearn.ensemble.RandomForestClassifier
A trained
RandomForestClassifier

features
:list
A list of the names of the features the classifier was trained on, ordered by the same order the appeared in the training data

precision
:int
Default: 4
Precision of feature importance.