Skip to content

model_utils

ks_abc

ks_abc(y_true, y_pred, ax=None, figsize=None, colors=('darkorange', 'b'), title=None, xlim=(0.,1.), ylim=(0.,1.), fmt='.2f', lw=2, legend='best', plot=True, filename=None)

Perform the Kolmogorov–Smirnov test over the positive and negative distributions of a binary classifier, and compute the area between curves.

The KS test plots the fraction of positives and negatives predicted correctly below each threshold. It then finds the optimal threshold, being the one enabling the best class separation.

The area between curves allows a better insight into separation. The higher the area is (1 being the maximum), the more the positive and negative distributions' center-of-mass are closer to 1 and 0, respectively.

Based on scikit-plot plot_ks_statistic method.

  • y_true : array-like

    The true labels of the dataset

  • y_pred : array-like

    The probabilities predicted by a binary classifier

  • ax : matplotlib ax

    Default = None

    Matplotlib Axis on which the curves will be plotted

  • figsize : (int,int) or None

    Default = None

    a Matplotlib figure-size tuple. If None, falls back to Matplotlib's default. Only used if ax=None

  • colors : list of Matplotlib color strings

    Default = ('darkorange', 'b')

    List of colors to be used for the plotted curves

  • title : string or None

    Default = None

    Plotted graph title. If None, default title is used

  • xlim : (float, float)

    Default = (0.,1.)

    X-axis limits.

  • ylim : (float,float)

    Default = (0.,1.)

    Y-axis limits.

  • fmt : string

    Default = '.2f'

    String formatting of displayed numbers.

  • lw : int

    Default = 2

    Line-width.

  • legend: string or None

    Default = 'best'

    A Matplotlib legend location string. See Matplotlib documentation for possible options

  • plot: Boolean, default = True

    Plot the KS curves

  • filename: string or None

    Default = None

    If not None, plot will be saved to the given file name.

Returns: A dictionary of the following keys:

  • abc: area between curves

  • ks_stat: computed statistic of the KS test

  • eopt: estimated optimal threshold

  • ax: the ax used to plot the curves

Example: See examples.


metric_graph

metric_graph(y_true, y_pred, metric, micro=True, macro=True, eoptimal_threshold=True, class_names=None, colors=None, ax=None, figsize=None, xlim=(0.,1.), ylim=(0.,1.02), lw=2, ls='-', ms=10, fmt='.2f', title=None, filename=None, force_multiclass=False)

Plot a metric graph of predictor's results (including AUC scores), where each row of y_true and y_pred represent a single example.

ROC: Plots true-positive rate as a function of the false-positive rate of the positive label in a binary classification, where TPR = TP / (TP + FN) and FPR = FP / (FP + TN). A naive algorithm will display a linear line going from (0,0) to (1,1), therefore having an area under-curve (AUC) of 0.5.

Precision-Recall: Plots precision as a function of recall of the positive label in a binary classification, where Precision = TP / (TP + FP) and Recall = TP / (TP + FN). A naive algorithm will display a horizontal linear line with precision of the ratio of positive examples in the dataset.

Based on scikit-learn examples (as was seen on April 2018):

  • y_true : list / NumPy ndarray

    The true classes of the predicted data. If only one or two columns exist, the data is treated as a binary classification (see input example below). If there are more than 2 columns, each column is considered a unique class, and a ROC graph and AUC score will be computed for each.

  • y_pred : list / NumPy ndarray

    The predicted classes. Must have the same shape as y_true.

  • metric : string

    The metric graph to plot. Currently supported: 'roc' for Receiver Operating Characteristic curve and 'pr' for Precision-Recall curve

  • micro : Boolean

    Default = True

    Whether to calculate a Micro graph (not applicable for binary cases)

  • macro : Boolean

    Default = True

    Whether to calculate a Macro graph (ROC metric only, not applicable for binary cases)

  • eopt : Boolean

    Default = True

    Whether to calculate and display the estimated-optimal threshold for each metric graph. For ROC curves, the estimated-optimal threshold is the closest computed threshold with (fpr,tpr) values closest to (0,1). For PR curves, it is the closest one to (1,1) (perfect recall and precision)

  • class_names: list or string

    Default = None

    Names of the different classes. In a multi-class classification, the order must match the order of the classes probabilities in the input data. In a binary classification, can be a string or a list. If a list, only the last element will be used.

  • colors : list of Matplotlib color strings or None

    Default = None

    List of colors to be used for the plotted curves. If None, falls back to a predefined default.

  • ax : matplotlib ax

    Default = None

    Matplotlib Axis on which the curves will be plotted

  • figsize : (int,int) or None

    Default = None

    A Matplotlib figure-size tuple. If None, falls back to Matplotlib's default. Only used if ax=None.

  • xlim : (float, float)

    Default = (0.,1.)

    X-axis limits.

  • ylim : (float,float)

    Default = (0.,1.02)

    Y-axis limits.

  • lw : int

    Default = 2

    Line-width.

  • ls : string

    Default = '-'

    Matplotlib line-style string

  • ms : int

    Default = 10

    Marker-size.

  • fmt : string

    Default = '.2f'

    String formatting of displayed AUC and threshold numbers.

  • legend: string or None

    Default = 'best'

    A Matplotlib legend location string. See Matplotlib documentation for possible options

  • plot: Boolean, default = True

    Plot the histogram

  • title: string or None

    Default = None

    Plotted graph title. If None, default title is used.

  • filename: string or None

    Default = None

    If not None, plot will be saved to the given file name.

  • force_multiclass: Boolean

    Default = False

    Only applicable if y_true and y_pred have two columns. If so, consider the data as a multiclass data rather than binary (useful when plotting curves of different models one against the other)

Returns: A dictionary, one key for each class. Each value is another dictionary, holding AUC and eOpT values.

Example: See examples.

Binary Classification Input Example: Consider a data-set of two data-points where the true class of the first line is class 0, which was predicted with a probability of 0.6, and the second line's true class is 1, with predicted probability of 0.8.

# First option: 
>> metric_graph(y_true=[0,1], y_pred=[0.6,0.8], metric='roc') 
# Second option:
>> metric_graph(y_true=[[1,0],[0,1]], y_pred=[[0.6,0.4],[0.2,0.8]], metric='roc')
# Both yield the same result


random_forest_feature_importance

random_forest_feature_importance(forest, features, precision=4)

Given a trained sklearn.ensemble.RandomForestClassifier, plot the different features based on their importance according to the classifier, from the most important to the least.

  • forest : sklearn.ensemble.RandomForestClassifier

    A trained RandomForestClassifier

  • features : list

    A list of the names of the features the classifier was trained on, ordered by the same order the appeared in the training data

  • precision : int

    Default = 4

    Precision of feature importance.