🥇Speed up exploratory data analysis using the pandas-profiling library

The first step in getting started with a new dataset is to understand it. In order to do this, you need, for example, to find out the ranges of values accepted by the variables, their types, and also find out about the number of missing values.

The pandas library provides us with many useful tools to perform Exploratory Data Analysis (EDA). But before you use them, you usually need to start with more general functions such as df.describe(). True, it should be noted that the possibilities provided by such functions are limited, and the initial stages of working with any data sets when performing EDA are very often very similar to each other.

The author of the material that we are publishing today says that he is not a fan of performing repetitive actions. As a result, in search of tools to quickly and efficiently perform exploratory data analysis, he found a library pandas-profiling. The results of her work are expressed not in the form of some individual indicators, but in the form of a fairly detailed HTML report containing most of the information about the analyzed data that you may need to know before starting to work more closely with them.

Here we will consider the features of using the pandas-profiling library using the Titanic dataset as an example.

Exploratory data analysis with pandas

I decided to experiment with pandas-profiling on the Titanic dataset because it has different data types and missing values. I believe that the pandas-profiling library is especially interesting in cases where the data has not yet been cleaned and requires further processing depending on its characteristics. In order to successfully perform such processing, you need to know where to start and what to look for. This is where pandas-profiling comes in handy.

First, let's import the data and use pandas to get the descriptive statistic scores:

# импорт необходимых пакетов
import pandas as pd
import pandas_profiling
import numpy as np

# импорт данных
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычисление показателей описательной статистики
df.describe()

After executing this piece of code, you will get what is shown in the following figure.

Descriptive statistic scores obtained using standard pandas tools

Although there is a lot of useful information here, there is not everything that would be interesting to know about the data under study. For example, we can assume that in the data frame, in the structure DataFrame, there are 891 lines. If this needs to be checked, then one more line of code is required to determine the size of the frame. Although these calculations are not particularly resource intensive, repeating them all the time is bound to waste time, which would probably be better spent cleaning up the data.

Exploratory data analysis with pandas-profiling

Now let's do the same using pandas-profiling:

pandas_profiling.ProfileReport(df)

The execution of the above line of code will generate a report with indicators of exploratory data analysis. The code shown above will output the data details it finds, but it can be done in such a way that the result would be an HTML file that, for example, can be shown to someone.

The first part of the report will contain an Overview section giving basic information about the data (number of observations, number of variables, and so on). In addition, it will contain a list of warnings notifying the analyst of what to pay special attention to. These warnings can serve as clues as to where to focus your data cleansing efforts.

Overview report section

Exploratory analysis of variables

Behind the Overview section of the report, you can find useful information about each variable. They include, among other things, small charts describing the distribution of each variable.

Information about the numeric variable Age

As you can see from the previous example, pandas-profiling gives us several useful indicators, such as the percentage and number of missing values, as well as the descriptive statistics we have already seen. Because Age is a numeric variable, visualizing its distribution as a histogram allows us to conclude that we have a distribution skewed to the right.

When considering a categorical variable, the output measures are slightly different from those found for a numeric variable.

Information about the categorical variable Sex

Namely, instead of finding the average, minimum and maximum, the pandas-profiling library found the number of classes. Because Sex is a binary variable, its values are represented by two classes.

If you're like me and love to explore code, then you might be interested in how exactly the pandas-profiling library calculates these metrics. Finding out about this, given that the library code is open and available on GitHub, is not so difficult. Since I'm not a big fan of using "black boxes" in my projects, I took a look at the source code of the library. For example, this is how the mechanism for processing numeric variables looks like, represented by the function describe_numeric_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Although this piece of code may seem rather large and complex, it is actually very easy to understand. We are talking about the fact that in the source code of the library there is a function that determines the types of variables. If it turned out that the library encountered a numeric variable, the above function will find the indicators that we considered. This function uses the standard pandas operations for working with objects of type Serieslike series.mean(). Calculation results are stored in a dictionary stats. Histograms are formed using an adapted version of the function matplotlib.pyplot.hist. Adaptation is aimed at ensuring that the function could work with different types of data sets.

Correlation scores and data sample

After the results of variable analysis, pandas-profiling, in the Correlations section, will output the Pearson and Spearman correlation matrices.

Pearson correlation matrix

If necessary, you can, in the line of code that starts the report generation, set the indicators of the threshold values used in the calculation of the correlation. By doing this, you can specify how much correlation strength is considered important for your analysis.

Finally, in the pandas-profiling report, in the Sample section, a piece of data taken from the beginning of the dataset is displayed as an example. This approach can lead to unpleasant surprises, since the first few observations may represent a sample that does not reflect the features of the entire data set.

Section containing a sample of the researched data

As a result, I do not recommend paying attention to this last section. Instead, it is better to use the command df.sample(5), which will randomly select 5 observations from the dataset.

Results

In summary, the pandas-profiling library provides the analyst with some useful features that will come in handy in cases where you need to quickly get a general rough idea of the data or give someone an intelligence analysis report. At the same time, the real work with the data, taking into account their features, is performed manually, as well as without using pandas-profiling.

If you want to take a look at what the whole intelligence analysis looks like in one Jupyter notebook, take a look at this my project created with nbviewer. And in This GitHub repositories can find the corresponding code.

Dear Readers, How do you start analyzing new datasets?

Source: habr.com

Speed ​​up exploratory data analysis using the pandas-profiling library

Exploratory data analysis with pandas

Exploratory data analysis with pandas-profiling

Exploratory analysis of variables

Correlation scores and data sample

Results

Add a comment Отменить ответ

Speed up exploratory data analysis using the pandas-profiling library