nominal¶
associations¶
associations(dataset, nominal_columns='auto', numerical_columns=None, mark_columns=False,nom_nom_assoc='cramer', num_num_assoc='pearson', nom_num_assoc='correlation_ratio', symmetric_nom_nom=True, symmetric_num_num=True, display_rows='all', display_columns='all', hide_rows=None, hide_columns=None, cramers_v_bias_correction=True, nan_strategy=_REPLACE, nan_replace_value=_DEFAULT_REPLACE_VALUE, ax=None, figsize=None, annot=True, fmt='.2f', cmap=None, sv_color='silver', cbar=True, vmax=1.0, vmin=None, plot=True, compute_only=False, clustering=False, title=None, filename=None, multiprocessing=False, max_cpu_cores=None)
Calculate the correlation/strength-of-association of features in data-set with both categorical and continuous features using: * Pearson's R for continuous-continuous cases * Correlation Ratio for categorical-continuous cases * Cramer's V or Theil's U for categorical-categorical cases
-
dataset:NumPy ndarray / Pandas DataFrameThe data-set for which the features' correlation is computed
-
nominal_columns:string / list / NumPy ndarrayDefault: 'auto'
Names of columns of the data-set which hold categorical values. Can also be the string 'all' to state that all columns are categorical, 'auto' (default) to identify nominal columns automatically, or None to state none are categorical. Only used if
numerical_columnsisNone. -
numerical_columns:string / list / NumPy ndarrayDefault: None
To be used instead of
nominal_columns. Names of columns of the data-set which hold numerical values. Can also be the string 'all' to state that all columns are numerical (equivalent tonominal_columns=None) or 'auto' to try to identify numerical columns (equivalent tonominal_columns=auto). IfNone,nominal_columnsis used. -
mark_columns:BooleanDefault: False
if True, output's columns' names will have a suffix of '(nom)' or '(con)' based on their type (nominal or continuous), as provided by nominal_columns
-
nom_nom_assoc:callable / stringDefault: 'cramer'
Method signature change
This replaces the
theil_uflag which was used till version 0.6.6.If callable, a function which recieves two
pd.Seriesand returns a single number.If string, name of nominal-nominal (categorical-categorical) association to use:
-
cramer: Cramer's V -
theil: Theil's U. When selected, heat-map columns are the provided information (meaning: \(U = U(row|col)\))
-
-
num_num_assoc:callable / stringDefault: 'pearson'
If callable, a function which recieves two
pd.Seriesand returns a single number.If string, name of numerical-numerical association to use:
-
pearson: Pearson's R -
spearman: Spearman's R -
kendall: Kendall's Tau
-
-
nom_num_assoc:callable / stringDefault: 'correlation_ratio'
If callable, a function which recieves two
pd.Seriesand returns a single number.If string, name of nominal-numerical association to use:
correlation_ratio: correlation ratio
-
symmetric_nom_nom:BooleanDefault: True
Relevant only if
nom_nom_associs a callable. If so, declare whether the function is symmetric (\(f(x,y) = f(y,x)\)). If False, heat-map values should be interpreted as \(f(row,col)\). -
symmetric_num_num:BooleanDefault: True
Relevant only if
num_num_associs a callable. If so, declare whether the function is symmetric (\(f(x,y) = f(y,x)\)). If False, heat-map values should be interpreted as \(f(row,col)\). -
display_rows:list / stringDefault: 'all'
Choose which of the dataset's features will be displyed in the output's correlations table rows. If string, can either be a single feature's name or 'all'. Only used if
hide_rowsisNone. -
display_columns:list / stringDefault: 'all'
Choose which of the dataset's features will be displyed in the output's correlations table columns. If string, can either be a single feature's name or 'all'. Only used if
hide_columnsisNone. -
hide_rows:list / stringDefault: None
choose which of the dataset's features will not be displyed in the output's correlations table rows. If string, must be a single feature's name. If
None,display_rowsis used. -
hide_columns:list / stringDefault: None
choose which of the dataset's features will not be displyed in the output's correlations table columns. If string, must be a single feature's name. If
None,display_columnsis used. -
cramers_v_bias_correction:BooleanDefault: True
Method signature change
This replaces the
bias_correctionflag which was used till version 0.6.6.Use bias correction for Cramer's V from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
-
nan_strategy:stringDefault: 'replace'
How to handle missing values: can be either
'drop_samples'to remove samples with missing values,'drop_features'to remove features (columns) with missing values,'replace'to replace all missing values with thenan_replace_value, or'drop_sample_pairs'to drop each pair of missing observables separately before calculating the corresponding coefficient. Missing values areNoneandnp.nan. -
nan_replace_value:anyDefault: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'
-
ax: matplotlibAxeDefault: None
Matplotlib Axis on which the heat-map will be plotted
-
figsize:(float, float)orNoneDefault: None
A Matplotlib figure-size tuple. If
None, will attempt to set the size automatically. Only used ifax=None. -
annot:BooleanDefault: True
Plot number annotations on the heat-map
-
fmt:stringDefault: '.2f'
String formatting of annotations
-
cmap: Matplotlib colormap orNoneDefault: None
A colormap to be used for the heat-map. If None, falls back to Seaborn's heat-map default
-
sv_color:stringDefault: 'silver'
A Matplotlib color. The color to be used when displaying single-value features over the heat-map
-
cbar:BooleanDefault: True
Display heat-map's color-bar
-
vmax:floatDefault: 1.0
Set heat-map
vmaxoption -
vmin:floatorNoneDefault: None
Set heat-map
vminoption. If set toNone,vminwill be chosen automatically between 0 and -1.0, depending on the types of associations used (-1.0 if Pearson's R is used, 0 otherwise) -
plot:BooleanDefault: True
Plot a heat-map of the correlation matrix. If False, heat-map will still be drawn, but not shown. The heat-map's
axis part of this function's output. -
compute_only:BooleanDefault: False
Use this flag only if you have no need of the plotting at all. This skips the entire plotting mechanism (similar to the old
compute_associationsmethod). -
clustering:BooleanDefault: False
If True, the computed associations will be sorted into groups by similar correlations
-
title:stringorNoneDefault: None
Plotted graph title.
-
filename:stringorNoneDefault: None
If not None, plot will be saved to the given file name.
-
multiprocessing:BooleanDefault: False
If True, use multiprocessing to speed up computations. If None, falls back to single core computation
-
max_cpu_cores:intorNoneDefault:
NoneIf not
None,ProcessPoolExecutorwill use the given number of CPU cores
Returns: A dictionary with the following keys:
corr: A DataFrame of the correlation/strength-of-association between all featuresax: A MatplotlibAxe
Example: See examples.
cluster_correlations¶
cluster_correlations(corr_mat, indexes=None)
Apply agglomerative clustering in order to sort a correlation matrix. Based on this clustering example.
-
corr_mat:Pandas DataFrameA correlation matrix (as output from
associations) -
indexes:list / NumPy ndarray / Pandas SeriesA sequence of cluster indexes for sorting. If not present, a clustering is performed.
Returns:
- a sorted correlation matrix (
pd.DataFrame) - cluster indexes based on the original dataset (
list)
Example:
>>> assoc = associations(
customers,
plot=False
)
>>> correlations = assoc['corr']
>>> correlations, _ = cluster_correlations(correlations)
compute_associations¶
Deprecated
compute_associations was deprecated and removed. Use associations(compute_only=True)['corr'].
conditional_entropy¶
conditional_entropy(x, y, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE, log_base=math.e)
Given measurements x and y of random variables \(X\) and \(Y\), calculates the conditional entropy of \(X\) given \(Y\):
Read more on Wikipedia.
-
x:list / NumPy ndarray / Pandas SeriesA sequence of measurements
-
y:list / NumPy ndarray / Pandas SeriesA sequence of measurements
-
nan_strategy:stringDefault: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value:anyDefault: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
-
log_base:floatDefault:
math.eSpecifying base for calculating entropy.
Returns: float
correlation_ratio¶
correlation_ratio(categories, measurements, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Calculates the Correlation Ratio (\(\eta\)) for categorical-continuous association:
where \(n_x\) is the number of observations in category \(x\), and we define:
Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty. Read more on Wikipedia.
-
categories:list / NumPy ndarray / Pandas SeriesA sequence of categorical measurements
-
measurements:list / NumPy ndarray / Pandas SeriesA sequence of continuous measurements
-
nan_strategy:stringDefault: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value:anyDefault: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
Returns: float in the range of [0,1]
cramers_v¶
cramers_v(x, y, bias_correction=True, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Calculates Cramer's V statistic for categorical-categorical association. This is a symmetric coefficient: \(V(x,y) = V(y,x)\). Read more on Wikipedia.
Original function taken from this answer on StackOverflow.
Cramer's V limitations when applied on skewed or small datasets
As the Cramer's V measure of association depends directly on the counts of each samples-pair in the data, it tends to be suboptimal when applied on skewed or small datasets.
Consider each of the following cases, where we would expect Cramer's V to reach a high value, yet this only happens in the first scenario:
>>> x = ['a'] * 400 + ['b'] * 100
>>> y = ['X'] * 400 + ['Y'] * 100
>>> cramers_v(x,y)
0.9937374102534072
# skewed dataset
>>> x = ['a'] * 500 + ['b'] * 1
>>> y = ['X'] * 500 + ['Y'] * 1
>>> cramers_v(x,y)
0.4974896903293253
# very small dataset
>>> x = ['a'] * 4 + ['b'] * 1
>>> y = ['X'] * 4 + ['Y'] * 1
>>> cramers_v(x,y)
0.0
-
x:list / NumPy ndarray / Pandas SeriesA sequence of categorical measurements
-
y:list / NumPy ndarray / Pandas SeriesA sequence of categorical measurements
-
bias_correction:BooleanDefault: True
Use bias correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328.
-
nan_strategy:stringDefault: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value:anyDefault: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
Returns: float in the range of [0,1]
identify_nominal_columns¶
identify_nominal_columns(dataset)
Given a dataset, identify categorical columns. This is used internally in associations and numerical_encoding,
but can also be used directly.
Note:
This is a shortcut for data_utils.identify_columns_by_type(dataset, include=['object', 'category'])
dataset:np.ndarray/pd.DataFrame
Returns: list of categorical columns
Example:
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1]})
>>> identify_nominal_columns(df)
['col1']
identify_numeric_columns¶
identify_numeric_columns(dataset)
Given a dataset, identify numeric columns.
Note:
This is a shortcut for data_utils.identify_columns_by_type(dataset, include=['int64', 'float64'])
dataset:np.ndarray/pd.DataFrame
Returns: list of numerical columns
Example:
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1], 'col3': [1., 2., 3., 4.]})
>>> identify_numeric_columns(df)
['col2', 'col3']
numerical_encoding¶
numerical_encoding(dataset, nominal_columns='auto', drop_single_label=False, drop_fact_dict=True, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Encoding a data-set with mixed data (numerical and categorical) to a numerical-only data-set, using the following logic:
-
categorical with only a single value will be marked as zero (or dropped, if requested)
-
categorical with two values will be replaced with the result of Pandas
factorize -
categorical with more than two values will be replaced with the result of Pandas
get_dummies -
numerical columns will not be modified
-
dataset:NumPy ndarray / Pandas DataFrameThe data-set to encode
-
nominal_columns:sequence / stringDefault: 'auto'
Names of columns of the data-set which hold categorical values. Can also be the string 'all' to state that all columns are categorical, 'auto' (default) to identify nominal columns automatically, or None to state none are categorical (nothing happens)
-
drop_single_label:BooleanDefault: False
If True, nominal columns with a only a single value will be dropped.
-
drop_fact_dict:BooleanDefault: True
If True, the return value will be the encoded DataFrame alone. If False, it will be a tuple of the DataFrame and the dictionary of the binary factorization (originating from pd.factorize)
-
nan_strategy:stringDefault: 'replace'
How to handle missing values: can be either 'drop_samples' to remove samples with missing values, 'drop_features' to remove features (columns) with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value:anyDefault: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'
Returns: pd.DataFrame or (pd.DataFrame, dict). If drop_fact_dict is True, returns the encoded DataFrame.
else, returns a tuple of the encoded DataFrame and dictionary, where each key is a two-value column, and the
value is the original labels, as supplied by Pandas factorize. Will be empty if no two-value columns are
present in the data-set
replot_last_associations¶
replot_last_associations(ax=None, figsize=None, annot=None, fmt=None, cmap=None, sv_color=None, cbar=None, vmax=None, vmin=None, plot=True, title=None, filename=None)
Re-plot last computed associations heat-map. This method performs no new computations, but only allows to change the visual output of the last computed heat-map.
-
ax: matplotlibAxeDefault:
NoneMatplotlib Axis on which the heat-map will be plotted
-
figsize:(int,int)orNoneDefault:
NoneA Matplotlib figure-size tuple. If
None, uses the lastassociationscall value. Only used ifax=None. -
annot:BooleanorNoneDefault:
NonePlot number annotations on the heat-map. If
None, uses the lastassociationscall value. -
fmt:stringDefault:
NoneString formatting of annotations. If
None, uses the lastassociationscall value. -
cmap: MatplotlibcolormaporNoneDefault:
NoneA colormap to be used for the heat-map. If
None, uses the lastassociationscall value. -
sv_color:stringDefault:
NoneA Matplotlib color. The color to be used when displaying single-value. If
None, uses the lastassociationscall value. -
cbar:BooleanorNoneDefault:
NoneDisplay heat-map's color-bar. If
None, uses the lastassociationscall value. -
vmax:floatorNoneDefault:
NoneSet heat-map
vmaxoption. IfNone, uses the lastassociationscall value. -
vmin:floatorNoneDefault:
NoneSet heat-map
vminoption. IfNone, uses the lastassociationscall value. -
plot:BooleanDefault:
TruePlot a heat-map of the correlation matrix. If False, plotting still happens, but the heat-map will not be displayed.
-
title:stringorNoneDefault:
NonePlotted graph title. If
None, uses the lastassociationscall value. -
filename:stringorNoneDefault:
NoneIf not
None, plot will be saved to the given file name. Note: in order to avoid accidental file overwrites, the lastassociationscall value is never used, and when filename is set to None, no writing to file occurs.
Returns: A Matplotlib Axe
theils_u¶
theils_u(x, y, nan_strategy=REPLACE, nan_replace_value=DEFAULT_REPLACE_VALUE)
Calculates Theil's U statistic (Uncertainty coefficient) for categorical-categorical association, defined as:
where \(S(X)\) is the entropy of \(X\) and \(S(X|Y)\) is the conditional entropy of \(X\) given \(Y\).
This is the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x. This is an asymmetric coefficient: \(U(x,y) \neq U(y,x)\). Read more on Wikipedia.
-
x:list / NumPy ndarray / Pandas SeriesA sequence of categorical measurements
-
y:list / NumPy ndarray / Pandas SeriesA sequence of categorical measurements
-
nan_strategy:stringDefault: 'replace'
How to handle missing values: can be either 'drop' to remove samples with missing values, or 'replace' to replace all missing values with the nan_replace_value. Missing values are None and np.nan.
-
nan_replace_value:anyDefault: 0.0
The value used to replace missing values with. Only applicable when nan_strategy is set to 'replace'.
Returns: float in the range of [0,1]