data_utils¶
identify_columns_with_na
¶
identify_columns_with_na(dataset)
Given a dataset, return columns names having NA values, sorted in descending order by their number of NAs.
dataset
:np.ndarray
/pd.DataFrame
Returns: A pd.DataFrame
of two columns (['column', 'na_count']
), consisting of only
the names of columns with NA values, sorted by their number of NA values.
Example:
>>> df = pd.DataFrame({'col1': ['a', np.nan, 'a', 'a'], 'col2': [3, np.nan, 2, np.nan], 'col3': [1., 2., 3., 4.]})
>>> identify_columns_with_na(df)
column na_count
1 col2 2
0 col1 1
identify_columns_by_type
¶
identify_columns_by_type(dataset, include)
Given a dataset, identify columns of the types requested.
-
dataset
:np.ndarray
/pd.DataFrame
-
include
:list
which column types to filter by.
Returns: list of categorical columns
Example:
>>> df = pd.DataFrame({'col1': ['a', 'b', 'c', 'a'], 'col2': [3, 4, 2, 1], 'col3': [1., 2., 3., 4.]})
>>> identify_columns_by_type(df, include=['int64', 'float64'])
['col2', 'col3']
one_hot_encode
¶
one_hot_encode(arr, classes=None)
One-hot encode a 1D array. Based on this StackOverflow answer.
-
arr
: array-likeAn array to be one-hot encoded. Must contain only non-negative integers
-
classes
:int
orNone
number of classes. if None, max value of the array will be used
Returns: 2D one-hot encoded array
Example:
>>> one_hot_encode([1,0,5])
[[0. 1. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1.]]
split_hist
¶
split_hist(dataset, values, split_by, title='', xlabel='', ylabel=None, figsize=None, legend='best', plot=True, **hist_kwargs)
Plot a histogram of values from a given dataset, split by the values of a chosen column
-
dataset
:pd.DataFrame
-
values
:string
The column name of the values to be displayed in the histogram
-
split_by
:string
The column name of the values to split the histogram by
-
title
:string
orNone
, default = ''The plot's title. If empty string, will be '{values} by {split_by}'
-
xlabel
:string
orNone
, default = ''x-axis label. If empty string, will be '{values}'
-
ylabel
:string
orNone
, default:None
y-axis label
-
figsize
: (int
,int
) orNone
, default =None
A Matplotlib figure-size tuple. If
None
, falls back to Matplotlib's default. -
legend
:string
orNone
, default = 'best'A Matplotlib legend location string. See Matplotlib documentation for possible options
-
plot
:Boolean
, default = TruePlot the histogram
-
hist_kwargs
: key-value pairsA key-value pairs to be passed to Matplotlib hist method. See Matplotlib documentation for possible options
Returns: A Matplotlib Axe
Example: See examples.