nominal¶

`associations`¶

associations(dataset, nominal_columns='auto', numerical_columns=None, mark_columns=False,nom_nom_assoc='cramer', num_num_assoc='pearson', nom_num_assoc='correlation_ratio', symmetric_nom_nom=True, symmetric_num_num=True, display_rows='all', display_columns='all', hide_rows=None, hide_columns=None, cramers_v_bias_correction=True, nan_strategy=_REPLACE, nan_replace_value=_DEFAULT_REPLACE_VALUE, ax=None, figsize=None, annot=True, fmt='.2f', cmap=None, sv_color='silver', cbar=True, vmax=1.0, vmin=None, plot=True, compute_only=False, clustering=False, title=None, filename=None, multiprocessing=False, max_cpu_cores=None)

Calculate the correlation/strength-of-association of features in data-set with both categorical and continuous features using: * Pearson's R for continuous-continuous cases * Correlation Ratio for categorical-continuous cases * Cramer's V or Theil's U for categorical-categorical cases

dataset : NumPy ndarray / Pandas DataFrame

The data-set for which the features' correlation is computed
nominal_columns : string / list / NumPy ndarray

Default: 'auto'

Names of columns of the data-set which hold categorical values. Can also be the string 'all' to state that all columns are categorical, 'auto' (default) to identify nominal columns automatically, or None to state none are categorical. Only used if numerical_columns is None.
numerical_columns : string / list / NumPy ndarray

Default: None

To be used instead of nominal_columns. Names of columns of the data-set which hold numerical values. Can also be the string 'all' to state that all columns are numerical (equivalent to nominal_columns=None) or 'auto' to try to identify numerical columns (equivalent to nominal_columns=auto). If None, nominal_columns is used.
mark_columns : Boolean

Default: False

if True, output's columns' names will have a suffix of '(nom)' or '(con)' based on their type (nominal or continuous), as provided by nominal_columns
nom_nom_assoc : callable / string

Default: 'cramer'

Method signature change

This replaces the theil_u flag which was used till version 0.6.6.

If callable, a function which recieves two pd.Series and returns a single number.

If string, name of nominal-nominal (categorical-categorical) association to use:
- cramer: Cramer's V
- theil: Theil's U. When selected, heat-map columns are the provided information (meaning: \(U = U(row|col)\))
num_num_assoc : callable / string

Default: 'pearson'

If callable, a function which recieves two pd.Series and returns a single number.

If string, name of numerical-numerical association to use:
- pearson: Pearson's R
- spearman: Spearman's R
- kendall: Kendall's Tau
nom_num_assoc : callable / string

Default: 'correlation_ratio'

If callable, a function which recieves two pd.Series and returns a single number.

If string, name of nominal-numerical association to use:
- correlation_ratio: correlation ratio
symmetric_nom_nom : Boolean

Default: True

Relevant only if nom_nom_assoc is a callable. If so, declare whether the function is symmetric (\(f(x,y) = f(y,x)\)). If False, heat-map values should be interpreted as \(f(row,col)\).
symmetric_num_num : Boolean

Default: True

Relevant only if num_num_assoc is a callable. If so, declare whether the function is symmetric (\(f(x,y) = f(y,x)\)). If False, heat-map values should be interpreted as \(f(row,col)\).
display_rows : list / string

Default: 'all'

Choose which of the dataset's features will be displyed in the output's correlations table rows. If string, can either be a single feature's name or 'all'. Only used if hide_rows is None.
display_columns : list / string

Default: 'all'

Choose which of the dataset's features will be displyed in the output's correlations table columns. If string, can either be a single feature's name or 'all'. Only used if hide_columns is None.
hide_rows : list / string

Default: None

choose which of the dataset's features will not be displyed in the output's correlations table rows. If string, must be a single feature's name. If None, display_rows is used.
hide_columns : list / string

Default: None

choose which of the dataset's features will not be displyed in the output's correlations table columns. If string, must be a single feature's name. If None, display_columns is used.
cramers_v_bias_correction : Boolean

Default: True

Method signature change

This replaces the bias_correction flag which was used till version 0.6.6.

Use bias correction for Cramer's V from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
nan_strategy : string

Default: 'replace'

How to handle missing values: can be either 'drop_samples' to remove samples with missing values, 'drop_features' to remove features (columns) with missing values, 'replace' to replace all missing values with the nan_replace_value, or 'drop_sample_pairs' to drop each pair of missing observables separately before calculating the corresponding coefficient. Missing values are None and np.nan.
nan_replace_value : any

Default: 0.0

The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'
ax : matplotlib Axe

Default: None

Matplotlib Axis on which the heat-map will be plotted
figsize : (float, float) or None

Default: None

A Matplotlib figure-size tuple. If None, will attempt to set the size automatically. Only used if ax=None.
annot : Boolean

Default: True

Plot number annotations on the heat-map
fmt : string

Default: '.2f'

String formatting of annotations
cmap : Matplotlib colormap or None

Default: None

A colormap to be used for the heat-map. If None, falls back to Seaborn's heat-map default
sv_color : string

Default: 'silver'

A Matplotlib color. The color to be used when displaying single-value features over the heat-map
cbar : Boolean

Default: True

Display heat-map's color-bar
vmax : float

Default: 1.0

Set heat-map vmax option
vmin : float or None

Default: None

Set heat-map vmin option. If set to None, vmin will be chosen automatically between 0 and -1.0, depending on the types of associations used (-1.0 if Pearson's R is used, 0 otherwise)
plot : Boolean

Default: True

Plot a heat-map of the correlation matrix. If False, heat-map will still be drawn, but not shown. The heat-map's ax is part of this function's output.
compute_only : Boolean

Default: False

Use this flag only if you have no need of the plotting at all. This skips the entire plotting mechanism (similar to the old compute_associations method).
clustering : Boolean

Default: False

If True, the computed associations will be sorted into groups by similar correlations
title: string or None

Default: None

Plotted graph title.
filename: string or None

Default: None

If not None, plot will be saved to the given file name.
multiprocessing: Boolean

Default: False

If True, use multiprocessing to speed up computations. If None, falls back to single core computation
max_cpu_cores: int or None

Default: None

If not None, ProcessPoolExecutor will use the given number of CPU cores

Returns: A dictionary with the following keys:

corr: A DataFrame of the correlation/strength-of-association between all features
ax: A Matplotlib Axe

Example: See examples.

`cluster_correlations`¶

cluster_correlations(corr_mat, indexes=None)

Apply agglomerative clustering in order to sort a correlation matrix. Based on this clustering example.

corr_mat : Pandas DataFrame

A correlation matrix (as output from associations)
indexes : list / NumPy ndarray / Pandas Series

A sequence of cluster indexes for sorting. If not present, a clustering is performed.

Returns:

a sorted correlation matrix (pd.DataFrame)
cluster indexes based on the original dataset (list)

Example:

>>> assoc = associations(
  customers,
  plot=False
)
>>> correlations = assoc['corr']
>>> correlations, _ = cluster_correlations(correlations)

`compute_associations`¶

Deprecated

compute_associations was deprecated and removed. Use associations(compute_only=True)['corr'].

`conditional_entropy`¶

conditional_entropy(x, y, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE, log_base=math.e)

Given measurements x and y of random variables \(X\) and \(Y\), calculates the conditional entropy of \(X\) given \(Y\):

\[ S(X|Y) = - \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(y)} \]

`correlation_ratio`¶

correlation_ratio(categories, measurements, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)

Calculates the Correlation Ratio (\(\eta\)) for categorical-continuous association:

\[ \eta = \sqrt{\frac{\sum_x{n_x (\bar{y}_x - \bar{y})^2}}{\sum_{x,i}{(y_{xi}-\bar{y})^2}}} \]

where \(n_x\) is the number of observations in category \(x\), and we define:

\[\bar{y}_x = \frac{\sum_i{y_{xi}}}{n_x} , \bar{y} = \frac{\sum_i{n_x \bar{y}_x}}{\sum_x{n_x}}\]

Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty. Read more on Wikipedia.

categories : list / NumPy ndarray / Pandas Series

A sequence of categorical measurements
measurements : list / NumPy ndarray / Pandas Series

A sequence of continuous measurements
nan_strategy : string

Default: 'replace'

How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
nan_replace_value : any

Default: 0.0

The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.

Returns: float in the range of [0,1]

`cramers_v`¶

cramers_v(x, y, bias_correction=True, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)

Calculates Cramer's V statistic for categorical-categorical association. This is a symmetric coefficient: \(V(x,y) = V(y,x)\). Read more on Wikipedia.

Original function taken from this answer on StackOverflow.

Cramer's V limitations when applied on skewed or small datasets

As the Cramer's V measure of association depends directly on the counts of each samples-pair in the data, it tends to be suboptimal when applied on skewed or small datasets.

Consider each of the following cases, where we would expect Cramer's V to reach a high value, yet this only happens in the first scenario:

>>> x = ['a'] * 400 + ['b'] * 100
>>> y = ['X'] * 400 + ['Y'] * 100
>>> cramers_v(x,y)
0.9937374102534072

# skewed dataset
>>> x = ['a'] * 500 + ['b'] * 1
>>> y = ['X'] * 500 + ['Y'] * 1
>>> cramers_v(x,y)
0.4974896903293253

# very small dataset
>>> x = ['a'] * 4 + ['b'] * 1
>>> y = ['X'] * 4 + ['Y'] * 1
>>> cramers_v(x,y)
0.0

x : list / NumPy ndarray / Pandas Series

A sequence of categorical measurements
y : list / NumPy ndarray / Pandas Series

A sequence of categorical measurements
bias_correction : Boolean

Default: True

Use bias correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
nan_strategy : string

Default: 'replace'

How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
nan_replace_value : any

Default: 0.0

The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.

Returns: float in the range of [0,1]

`identify_nominal_columns`¶

identify_nominal_columns(dataset)

Given a dataset, identify categorical columns. This is used internally in associations and numerical_encoding, but can also be used directly.

Note:

This is a shortcut for data_utils.identify_columns_by_type(dataset, include=['object', 'category'])

dataset : np.ndarray / pd.DataFrame

Returns: list of categorical columns

Example:

>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1]})
>>> identify_nominal_columns(df)
['col1']

`identify_numeric_columns`¶

identify_numeric_columns(dataset)

Given a dataset, identify numeric columns.

Note:

This is a shortcut for data_utils.identify_columns_by_type(dataset, include=['int64', 'float64'])

dataset : np.ndarray / pd.DataFrame

Returns: list of numerical columns

Example:

>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1], 'col3': [1., 2., 3., 4.]})
>>> identify_numeric_columns(df)
['col2', 'col3']

`numerical_encoding`¶

numerical_encoding(dataset, nominal_columns='auto', drop_single_label=False, drop_fact_dict=True, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)

Encoding a data-set with mixed data (numerical and categorical) to a numerical-only data-set, using the following logic:

categorical with only a single value will be marked as zero (or dropped, if requested)
categorical with two values will be replaced with the result of Pandas factorize
categorical with more than two values will be replaced with the result of Pandas get_dummies
numerical columns will not be modified
dataset : NumPy ndarray / Pandas DataFrame

The data-set to encode
nominal_columns : sequence / string

Default: 'auto'

Names of columns of the data-set which hold categorical values. Can also be the string 'all' to state that all columns are categorical, 'auto' (default) to identify nominal columns automatically, or None to state none are categorical (nothing happens)
drop_single_label : Boolean

Default: False

If True, nominal columns with a only a single value will be dropped.
drop_fact_dict : Boolean

Default: True

If True, the return value will be the encoded DataFrame alone. If False, it will be a tuple of the DataFrame and the dictionary of the binary factorization (originating from pd.factorize)
nan_strategy : string

Default: 'replace'

How to handle missing values: can be either 'drop_samples' to remove samples with missing values, 'drop_features' to remove features (columns) with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
nan_replace_value : any

Default: 0.0

The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'

Returns: pd.DataFrame or (pd.DataFrame, dict). If drop_fact_dict is True, returns the encoded DataFrame. else, returns a tuple of the encoded DataFrame and dictionary, where each key is a two-value column, and the value is the original labels, as supplied by Pandas factorize. Will be empty if no two-value columns are present in the data-set

`replot_last_associations`¶

replot_last_associations(ax=None, figsize=None, annot=None, fmt=None, cmap=None, sv_color=None, cbar=None, vmax=None, vmin=None, plot=True, title=None, filename=None)

Re-plot last computed associations heat-map. This method performs no new computations, but only allows to change the visual output of the last computed heat-map.

ax : matplotlib Axe

Default: None

Matplotlib Axis on which the heat-map will be plotted
figsize : (int,int) or None

Default: None

A Matplotlib figure-size tuple. If None, uses the last associations call value. Only used if ax=None.
annot : Boolean or None

Default: None

Plot number annotations on the heat-map. If None, uses the last associations call value.
fmt : string

Default: None

String formatting of annotations. If None, uses the last associations call value.
cmap : Matplotlib colormap or None

Default: None

A colormap to be used for the heat-map. If None, uses the last associations call value.
sv_color : string

Default: None

A Matplotlib color. The color to be used when displaying single-value. If None, uses the last associations call value.
cbar : Booleanor None

Default: None

Display heat-map's color-bar. If None, uses the last associations call value.
vmax : float or None

Default: None

Set heat-map vmax option. If None, uses the last associations call value.
vmin : float or None

Default: None

Set heat-map vmin option. If None, uses the last associations call value.
plot : Boolean

Default: True

Plot a heat-map of the correlation matrix. If False, plotting still happens, but the heat-map will not be displayed.
title : string or None

Default: None

Plotted graph title. If None, uses the last associations call value.
filename : string or None

Default: None

If not None, plot will be saved to the given file name. Note: in order to avoid accidental file overwrites, the last associations call value is never used, and when filename is set to None, no writing to file occurs.

Returns: A Matplotlib Axe

`theils_u`¶

theils_u(x, y, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)

Calculates Theil's U statistic (Uncertainty coefficient) for categorical-categorical association, defined as:

\[ U(X|Y) = \frac{S(X) - S(X|Y)}{S(X)} \]

where \(S(X)\) is the entropy of \(X\) and \(S(X|Y)\) is the conditional entropy of \(X\) given \(Y\).

This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: \(U(x,y) \neq U(y,x)\). Read more on Wikipedia.

x : list / NumPy ndarray / Pandas Series

A sequence of categorical measurements
y : list / NumPy ndarray / Pandas Series

A sequence of categorical measurements
nan_strategy : string

Default: 'replace'

How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
nan_replace_value : any

Default: 0.0

The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.

Returns: float in the range of [0,1]

nominal¶

associations¶

cluster_correlations¶

compute_associations¶

conditional_entropy¶

correlation_ratio¶

cramers_v¶

identify_nominal_columns¶

identify_numeric_columns¶

numerical_encoding¶

replot_last_associations¶

theils_u¶

`associations`¶

`cluster_correlations`¶

`compute_associations`¶

`conditional_entropy`¶

`correlation_ratio`¶

`cramers_v`¶

`identify_nominal_columns`¶

`identify_numeric_columns`¶

`numerical_encoding`¶

`replot_last_associations`¶

`theils_u`¶