The first step in getting started with a new dataset is to understand it. In order to do this, you need, for example, to find out the ranges of values accepted by the variables, their types, and also find out about the number of missing values.
The pandas library provides us with many useful tools to perform Exploratory Data Analysis (EDA). But before you use them, you usually need to start with more general functions such as df.describe(). True, it should be noted that the possibilities provided by such functions are limited, and the initial stages of working with any data sets when performing EDA are very often very similar to each other.
The author of the material that we are publishing today says that he is not a fan of performing repetitive actions. As a result, in search of tools to quickly and efficiently perform exploratory data analysis, he found a library
Here we will consider the features of using the pandas-profiling library using the Titanic dataset as an example.
Exploratory data analysis with pandas
I decided to experiment with pandas-profiling on the Titanic dataset because it has different data types and missing values. I believe that the pandas-profiling library is especially interesting in cases where the data has not yet been cleaned and requires further processing depending on its characteristics. In order to successfully perform such processing, you need to know where to start and what to look for. This is where pandas-profiling comes in handy.
First, let's import the data and use pandas to get the descriptive statistic scores:
# импорт необходимых пакетов
import pandas as pd
import pandas_profiling
import numpy as np
# импорт данных
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')
# вычисление показателей описательной статистики
df.describe()
After executing this piece of code, you will get what is shown in the following figure.
Descriptive statistic scores obtained using standard pandas tools
Although there is a lot of useful information here, there is not everything that would be interesting to know about the data under study. For example, we can assume that in the data frame, in the structure DataFrame
, there are 891 lines. If this needs to be checked, then one more line of code is required to determine the size of the frame. Although these calculations are not particularly resource intensive, repeating them all the time is bound to waste time, which would probably be better spent cleaning up the data.
Exploratory data analysis with pandas-profiling
Now let's do the same using pandas-profiling:
pandas_profiling.ProfileReport(df)
The execution of the above line of code will generate a report with indicators of exploratory data analysis. The code shown above will output the data details it finds, but it can be done in such a way that the result would be an HTML file that, for example, can be shown to someone.
The first part of the report will contain an Overview section giving basic information about the data (number of observations, number of variables, and so on). In addition, it will contain a list of warnings notifying the analyst of what to pay special attention to. These warnings can serve as clues as to where to focus your data cleansing efforts.
Overview report section
Exploratory analysis of variables
Behind the Overview section of the report, you can find useful information about each variable. They include, among other things, small charts describing the distribution of each variable.
Information about the numeric variable Age
As you can see from the previous example, pandas-profiling gives us several useful indicators, such as the percentage and number of missing values, as well as the descriptive statistics we have already seen. Because Age
is a numeric variable, visualizing its distribution as a histogram allows us to conclude that we have a distribution skewed to the right.
When considering a categorical variable, the output measures are slightly different from those found for a numeric variable.
Information about the categorical variable Sex
Namely, instead of finding the average, minimum and maximum, the pandas-profiling library found the number of classes. Because Sex
is a binary variable, its values are represented by two classes.
If you're like me and love to explore code, then you might be interested in how exactly the pandas-profiling library calculates these metrics. Finding out about this, given that the library code is open and available on GitHub, is not so difficult. Since I'm not a big fan of using "black boxes" in my projects, I took a look at the source code of the library. For example, this is how the mechanism for processing numeric variables looks like, represented by the function
def describe_numeric_1d(series, **kwargs):
"""Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
Also create histograms (mini an full) of its distribution.
Parameters
----------
series : Series
The variable to describe.
Returns
-------
Series
The description of the variable as a Series with index being stats keys.
"""
# Format a number as a percentage. For example 0.25 will be turned to 25%.
_percentile_format = "{:.0%}"
stats = dict()
stats['type'] = base.TYPE_NUM
stats['mean'] = series.mean()
stats['std'] = series.std()
stats['variance'] = series.var()
stats['min'] = series.min()
stats['max'] = series.max()
stats['range'] = stats['max'] - stats['min']
# To avoid to compute it several times
_series_no_na = series.dropna()
for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
# The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
stats['iqr'] = stats['75%'] - stats['25%']
stats['kurtosis'] = series.kurt()
stats['skewness'] = series.skew()
stats['sum'] = series.sum()
stats['mad'] = series.mad()
stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
stats['n_zeros'] = (len(series) - np.count_nonzero(series))
stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
# Histograms
stats['histogram'] = histogram(series, **kwargs)
stats['mini_histogram'] = mini_histogram(series, **kwargs)
return pd.Series(stats, name=series.name)
Although this piece of code may seem rather large and complex, it is actually very easy to understand. We are talking about the fact that in the source code of the library there is a function that determines the types of variables. If it turned out that the library encountered a numeric variable, the above function will find the indicators that we considered. This function uses the standard pandas operations for working with objects of type Series
like series.mean()
. Calculation results are stored in a dictionary stats
. Histograms are formed using an adapted version of the function matplotlib.pyplot.hist
. Adaptation is aimed at ensuring that the function could work with different types of data sets.
Correlation scores and data sample
After the results of variable analysis, pandas-profiling, in the Correlations section, will output the Pearson and Spearman correlation matrices.
Pearson correlation matrix
If necessary, you can, in the line of code that starts the report generation, set the indicators of the threshold values used in the calculation of the correlation. By doing this, you can specify how much correlation strength is considered important for your analysis.
Finally, in the pandas-profiling report, in the Sample section, a piece of data taken from the beginning of the dataset is displayed as an example. This approach can lead to unpleasant surprises, since the first few observations may represent a sample that does not reflect the features of the entire data set.
Section containing a sample of the researched data
As a result, I do not recommend paying attention to this last section. Instead, it is better to use the command df.sample(5)
, which will randomly select 5 observations from the dataset.
Results
In summary, the pandas-profiling library provides the analyst with some useful features that will come in handy in cases where you need to quickly get a general rough idea of the data or give someone an intelligence analysis report. At the same time, the real work with the data, taking into account their features, is performed manually, as well as without using pandas-profiling.
If you want to take a look at what the whole intelligence analysis looks like in one Jupyter notebook, take a look at
Dear Readers, How do you start analyzing new datasets?
Source: habr.com