nominal¶
associations
¶
associations(dataset, nominal_columns='auto', numerical_columns=None, mark_columns=False,nom_nom_assoc='cramer', num_num_assoc='pearson', nom_num_assoc='correlation_ratio', symmetric_nom_nom=True, symmetric_num_num=True, display_rows='all', display_columns='all', hide_rows=None, hide_columns=None, cramers_v_bias_correction=True, nan_strategy=_REPLACE, nan_replace_value=_DEFAULT_REPLACE_VALUE, ax=None, figsize=None, annot=True, fmt='.2f', cmap=None, sv_color='silver', cbar=True, vmax=1.0, vmin=None, plot=True, compute_only=False, clustering=False, title=None, filename=None, multiprocessing=False, max_cpu_cores=None)
Calculate the correlation/strength-of-association of features in data-set with both categorical and continuous features using: * Pearson's R for continuous-continuous cases * Correlation Ratio for categorical-continuous cases * Cramer's V or Theil's U for categorical-categorical cases
-
dataset
:NumPy ndarray / Pandas DataFrame
The data-set for which the features' correlation is computed
-
nominal_columns
:string / list / NumPy ndarray
Default: 'auto'
Names of columns of the data-set which hold categorical values. Can also be the string 'all' to state that all columns are categorical, 'auto' (default) to identify nominal columns automatically, or None to state none are categorical. Only used if
numerical_columns
isNone
. -
numerical_columns
:string / list / NumPy ndarray
Default: None
To be used instead of
nominal_columns
. Names of columns of the data-set which hold numerical values. Can also be the string 'all' to state that all columns are numerical (equivalent tonominal_columns=None
) or 'auto' to try to identify numerical columns (equivalent tonominal_columns=auto
). IfNone
,nominal_columns
is used. -
mark_columns
:Boolean
Default: False
if True, output's columns' names will have a suffix of '(nom)' or '(con)' based on their type (nominal or continuous), as provided by nominal_columns
-
nom_nom_assoc
:callable / string
Default: 'cramer'
Method signature change
This replaces the
theil_u
flag which was used till version 0.6.6.If callable, a function which recieves two
pd.Series
and returns a single number.If string, name of nominal-nominal (categorical-categorical) association to use:
-
cramer
: Cramer's V -
theil
: Theil's U. When selected, heat-map columns are the provided information (meaning: \(U = U(row|col)\))
-
-
num_num_assoc
:callable / string
Default: 'pearson'
If callable, a function which recieves two
pd.Series
and returns a single number.If string, name of numerical-numerical association to use:
-
pearson
: Pearson's R -
spearman
: Spearman's R -
kendall
: Kendall's Tau
-
-
nom_num_assoc
:callable / string
Default: 'correlation_ratio'
If callable, a function which recieves two
pd.Series
and returns a single number.If string, name of nominal-numerical association to use:
correlation_ratio
: correlation ratio
-
symmetric_nom_nom
:Boolean
Default: True
Relevant only if
nom_nom_assoc
is a callable. If so, declare whether the function is symmetric (\(f(x,y) = f(y,x)\)). If False, heat-map values should be interpreted as \(f(row,col)\). -
symmetric_num_num
:Boolean
Default: True
Relevant only if
num_num_assoc
is a callable. If so, declare whether the function is symmetric (\(f(x,y) = f(y,x)\)). If False, heat-map values should be interpreted as \(f(row,col)\). -
display_rows
:list / string
Default: 'all'
Choose which of the dataset's features will be displyed in the output's correlations table rows. If string, can either be a single feature's name or 'all'. Only used if
hide_rows
isNone
. -
display_columns
:list / string
Default: 'all'
Choose which of the dataset's features will be displyed in the output's correlations table columns. If string, can either be a single feature's name or 'all'. Only used if
hide_columns
isNone
. -
hide_rows
:list / string
Default: None
choose which of the dataset's features will not be displyed in the output's correlations table rows. If string, must be a single feature's name. If
None
,display_rows
is used. -
hide_columns
:list / string
Default: None
choose which of the dataset's features will not be displyed in the output's correlations table columns. If string, must be a single feature's name. If
None
,display_columns
is used. -
cramers_v_bias_correction
:Boolean
Default: True
Method signature change
This replaces the
bias_correction
flag which was used till version 0.6.6.Use bias correction for Cramer's V from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
-
nan_strategy
:string
Default: 'replace'
How to handle missing values: can be either
'drop_samples'
to remove samples with missing values,'drop_features'
to remove features (columns) with missing values,'replace'
to replace all missing values with thenan_replace_value
, or'drop_sample_pairs'
to drop each pair of missing observables separately before calculating the corresponding coefficient. Missing values areNone
andnp.nan
. -
nan_replace_value
:any
Default: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'
-
ax
: matplotlibAxe
Default: None
Matplotlib Axis on which the heat-map will be plotted
-
figsize
:(float, float)
orNone
Default: None
A Matplotlib figure-size tuple. If
None
, will attempt to set the size automatically. Only used ifax=None
. -
annot
:Boolean
Default: True
Plot number annotations on the heat-map
-
fmt
:string
Default: '.2f'
String formatting of annotations
-
cmap
: Matplotlib colormap orNone
Default: None
A colormap to be used for the heat-map. If None, falls back to Seaborn's heat-map default
-
sv_color
:string
Default: 'silver'
A Matplotlib color. The color to be used when displaying single-value features over the heat-map
-
cbar
:Boolean
Default: True
Display heat-map's color-bar
-
vmax
:float
Default: 1.0
Set heat-map
vmax
option -
vmin
:float
orNone
Default: None
Set heat-map
vmin
option. If set toNone
,vmin
will be chosen automatically between 0 and -1.0, depending on the types of associations used (-1.0 if Pearson's R is used, 0 otherwise) -
plot
:Boolean
Default: True
Plot a heat-map of the correlation matrix. If False, heat-map will still be drawn, but not shown. The heat-map's
ax
is part of this function's output. -
compute_only
:Boolean
Default: False
Use this flag only if you have no need of the plotting at all. This skips the entire plotting mechanism (similar to the old
compute_associations
method). -
clustering
:Boolean
Default: False
If True, the computed associations will be sorted into groups by similar correlations
-
title
:string
orNone
Default: None
Plotted graph title.
-
filename
:string
orNone
Default: None
If not None, plot will be saved to the given file name.
-
multiprocessing
:Boolean
Default: False
If True, use multiprocessing to speed up computations. If None, falls back to single core computation
-
max_cpu_cores
:int
orNone
Default:
None
If not
None
,ProcessPoolExecutor
will use the given number of CPU cores
Returns: A dictionary with the following keys:
corr
: A DataFrame of the correlation/strength-of-association between all featuresax
: A MatplotlibAxe
Example: See examples.
cluster_correlations
¶
cluster_correlations(corr_mat, indexes=None)
Apply agglomerative clustering in order to sort a correlation matrix. Based on this clustering example.
-
corr_mat
:Pandas DataFrame
A correlation matrix (as output from
associations
) -
indexes
:list / NumPy ndarray / Pandas Series
A sequence of cluster indexes for sorting. If not present, a clustering is performed.
Returns:
- a sorted correlation matrix (
pd.DataFrame
) - cluster indexes based on the original dataset (
list
)
Example:
>>> assoc = associations(
customers,
plot=False
)
>>> correlations = assoc['corr']
>>> correlations, _ = cluster_correlations(correlations)
compute_associations
¶
Deprecated
compute_associations
was deprecated and removed. Use associations(compute_only=True)['corr']
.
conditional_entropy
¶
conditional_entropy(x, y, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE, log_base=math.e)
Given measurements x
and y
of random variables \(X\) and \(Y\), calculates the conditional entropy of \(X\) given \(Y\):
Read more on Wikipedia.
-
x
:list / NumPy ndarray / Pandas Series
A sequence of measurements
-
y
:list / NumPy ndarray / Pandas Series
A sequence of measurements
-
nan_strategy
:string
Default: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value
:any
Default: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
-
log_base
:float
Default:
math.e
Specifying base for calculating entropy.
Returns: float
correlation_ratio
¶
correlation_ratio(categories, measurements, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Calculates the Correlation Ratio (\(\eta\)) for categorical-continuous association:
where \(n_x\) is the number of observations in category \(x\), and we define:
Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty. Read more on Wikipedia.
-
categories
:list / NumPy ndarray / Pandas Series
A sequence of categorical measurements
-
measurements
:list / NumPy ndarray / Pandas Series
A sequence of continuous measurements
-
nan_strategy
:string
Default: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value
:any
Default: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
Returns: float in the range of [0,1]
cramers_v
¶
cramers_v(x, y, bias_correction=True, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Calculates Cramer's V statistic for categorical-categorical association. This is a symmetric coefficient: \(V(x,y) = V(y,x)\). Read more on Wikipedia.
Original function taken from this answer on StackOverflow.
Cramer's V limitations when applied on skewed or small datasets
As the Cramer's V measure of association depends directly on the counts of each samples-pair in the data, it tends to be suboptimal when applied on skewed or small datasets.
Consider each of the following cases, where we would expect Cramer's V to reach a high value, yet this only happens in the first scenario:
>>> x = ['a'] * 400 + ['b'] * 100
>>> y = ['X'] * 400 + ['Y'] * 100
>>> cramers_v(x,y)
0.9937374102534072
# skewed dataset
>>> x = ['a'] * 500 + ['b'] * 1
>>> y = ['X'] * 500 + ['Y'] * 1
>>> cramers_v(x,y)
0.4974896903293253
# very small dataset
>>> x = ['a'] * 4 + ['b'] * 1
>>> y = ['X'] * 4 + ['Y'] * 1
>>> cramers_v(x,y)
0.0
-
x
:list / NumPy ndarray / Pandas Series
A sequence of categorical measurements
-
y
:list / NumPy ndarray / Pandas Series
A sequence of categorical measurements
-
bias_correction
:Boolean
Default: True
Use bias correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
-
nan_strategy
:string
Default: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value
:any
Default: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
Returns: float in the range of [0,1]
identify_nominal_columns
¶
identify_nominal_columns(dataset)
Given a dataset, identify categorical columns. This is used internally in associations
and numerical_encoding
,
but can also be used directly.
Note:
This is a shortcut for data_utils.identify_columns_by_type(dataset, include=['object', 'category'])
dataset
:np.ndarray
/pd.DataFrame
Returns: list of categorical columns
Example:
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1]})
>>> identify_nominal_columns(df)
['col1']
identify_numeric_columns
¶
identify_numeric_columns(dataset)
Given a dataset, identify numeric columns.
Note:
This is a shortcut for data_utils.identify_columns_by_type(dataset, include=['int64', 'float64'])
dataset
:np.ndarray
/pd.DataFrame
Returns: list of numerical columns
Example:
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1], 'col3': [1., 2., 3., 4.]})
>>> identify_numeric_columns(df)
['col2', 'col3']
numerical_encoding
¶
numerical_encoding(dataset, nominal_columns='auto', drop_single_label=False, drop_fact_dict=True, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Encoding a data-set with mixed data (numerical and categorical) to a numerical-only data-set, using the following logic:
-
categorical with only a single value will be marked as zero (or dropped, if requested)
-
categorical with two values will be replaced with the result of Pandas
factorize
-
categorical with more than two values will be replaced with the result of Pandas
get_dummies
-
numerical columns will not be modified
-
dataset
:NumPy ndarray / Pandas DataFrame
The data-set to encode
-
nominal_columns
:sequence / string
Default: 'auto'
Names of columns of the data-set which hold categorical values. Can also be the string 'all' to state that all columns are categorical, 'auto' (default) to identify nominal columns automatically, or None to state none are categorical (nothing happens)
-
drop_single_label
:Boolean
Default: False
If True, nominal columns with a only a single value will be dropped.
-
drop_fact_dict
:Boolean
Default: True
If True, the return value will be the encoded DataFrame alone. If False, it will be a tuple of the DataFrame and the dictionary of the binary factorization (originating from pd.factorize)
-
nan_strategy
:string
Default: 'replace'
How to handle missing values: can be either 'drop_samples' to remove samples with missing values, 'drop_features' to remove features (columns) with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value
:any
Default: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'
Returns: pd.DataFrame
or (pd.DataFrame, dict)
. If drop_fact_dict
is True, returns the encoded DataFrame.
else, returns a tuple of the encoded DataFrame and dictionary, where each key is a two-value column, and the
value is the original labels, as supplied by Pandas factorize
. Will be empty if no two-value columns are
present in the data-set
replot_last_associations
¶
replot_last_associations(ax=None, figsize=None, annot=None, fmt=None, cmap=None, sv_color=None, cbar=None, vmax=None, vmin=None, plot=True, title=None, filename=None)
Re-plot last computed associations heat-map. This method performs no new computations, but only allows to change the visual output of the last computed heat-map.
-
ax
: matplotlibAxe
Default:
None
Matplotlib Axis on which the heat-map will be plotted
-
figsize
:(int,int)
orNone
Default:
None
A Matplotlib figure-size tuple. If
None
, uses the lastassociations
call value. Only used ifax=None
. -
annot
:Boolean
orNone
Default:
None
Plot number annotations on the heat-map. If
None
, uses the lastassociations
call value. -
fmt
:string
Default:
None
String formatting of annotations. If
None
, uses the lastassociations
call value. -
cmap
: Matplotlibcolormap
orNone
Default:
None
A colormap to be used for the heat-map. If
None
, uses the lastassociations
call value. -
sv_color
:string
Default:
None
A Matplotlib color. The color to be used when displaying single-value. If
None
, uses the lastassociations
call value. -
cbar
:Boolean
orNone
Default:
None
Display heat-map's color-bar. If
None
, uses the lastassociations
call value. -
vmax
:float
orNone
Default:
None
Set heat-map
vmax
option. IfNone
, uses the lastassociations
call value. -
vmin
:float
orNone
Default:
None
Set heat-map
vmin
option. IfNone
, uses the lastassociations
call value. -
plot
:Boolean
Default:
True
Plot a heat-map of the correlation matrix. If False, plotting still happens, but the heat-map will not be displayed.
-
title
:string
orNone
Default:
None
Plotted graph title. If
None
, uses the lastassociations
call value. -
filename
:string
orNone
Default:
None
If not
None
, plot will be saved to the given file name. Note: in order to avoid accidental file overwrites, the lastassociations
call value is never used, and when filename is set to None, no writing to file occurs.
Returns: A Matplotlib Axe
theils_u
¶
theils_u(x, y, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Calculates Theil's U statistic (Uncertainty coefficient) for categorical-categorical association, defined as:
where \(S(X)\) is the entropy of \(X\) and \(S(X|Y)\) is the conditional entropy of \(X\) given \(Y\).
This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: \(U(x,y) \neq U(y,x)\). Read more on Wikipedia.
-
x
:list / NumPy ndarray / Pandas Series
A sequence of categorical measurements
-
y
:list / NumPy ndarray / Pandas Series
A sequence of categorical measurements
-
nan_strategy
:string
Default: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value
:any
Default: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
Returns: float in the range of [0,1]