æ°ããããŒã¿ã»ããã®æäœãéå§ãããšãã®æåã®ã¹ãããã¯ããããç解ããããšã§ãã ãããè¡ãã«ã¯ãããšãã°ãå€æ°ã«ãã£ãŠåãå ¥ããããå€ã®ç¯å²ãšãã®åã調ã¹ãæ¬ æå€ã®æ°ã調ã¹ãå¿ èŠããããŸãã
pandas ã©ã€ãã©ãªã¯ãæ¢çŽ¢çããŒã¿åæ (EDA) ãå®è¡ããããã®å€ãã®äŸ¿å©ãªããŒã«ãæäŸããŸãã ãã ãããããã䜿çšããåã«ãé垞㯠df.describe() ãªã©ã®ããäžè¬çãªé¢æ°ããå§ããå¿ èŠããããŸãã ãã ãããã®ãããªé¢æ°ã«ãã£ãŠæäŸãããæ©èœã¯éãããŠãããEDA ãå®è¡ããéã®ããŒã¿ ã»ããã®æäœã®åæ段éã¯ãéåžžã«å€ãã®å Žåãäºãã«éåžžã«äŒŒãŠããããšã«æ³šæããå¿ èŠããããŸãã
ç§ãã¡ãä»æ¥å
¬éããè³æã®èè
ã¯ãå埩çãªã¢ã¯ã·ã§ã³ãå®è¡ããã®ã奜ãã§ã¯ãªããšèšã£ãŠããŸãã ãã®çµæãæ¢çŽ¢çããŒã¿åæãè¿
éãã€å¹ççã«å®è¡ããããŒã«ãæ¢ããŠãããšããã次ã®ã©ã€ãã©ãªãèŠã€ããŸããã
ããã§ã¯ãã¿ã€ã¿ããã¯å·ã®ããŒã¿ã»ãããäŸãšããŠãpandas-profiling ã©ã€ãã©ãªã䜿çšããæ©èœãèŠãŠãããŸãã
pandas ã䜿çšããæ¢çŽ¢çããŒã¿åæ
ã¿ã€ã¿ããã¯å·ã®ããŒã¿ã»ããã«ã¯ããŸããŸãªçš®é¡ã®ããŒã¿ãå«ãŸããŠãããæ¬ æå€ãååšãããããã¿ã€ã¿ããã¯å·ã®ããŒã¿ã»ããã§ãã³ã ãããã¡ã€ãªã³ã°ãå®éšããããšã«ããŸããã pandas-profiling ã©ã€ãã©ãªã¯ãããŒã¿ããŸã ã¯ãªãŒã³åãããŠãããããã®ç¹æ§ã«å¿ããŠãããªãåŠçãå¿ èŠãªå Žåã«ç¹ã«èå³æ·±ããšæããŸãã ãã®ãããªåŠçãæ£åžžã«å®è¡ããã«ã¯ãã©ãããå§ããŠäœã«æ³šæãæãå¿ èŠãããããç¥ãå¿ èŠããããŸãã ããã§ããã³ãã®ãããã¡ã€ãªã³ã°æ©èœã圹ã«ç«ã¡ãŸãã
ãŸããããŒã¿ãã€ã³ããŒããããã³ãã䜿çšããŠèšè¿°çµ±èšãååŸããŸãã
# ОЌпПÑÑ ÐœÐµÐŸÐ±Ñ
ПЎОЌÑÑ
пакеÑПв
import pandas as pd
import pandas_profiling
import numpy as np
# ОЌпПÑÑ ÐŽÐ°ÐœÐœÑÑ
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')
# вÑÑОÑлеМОе пПказаÑелей ПпОÑаÑелÑМПй ÑÑаÑОÑÑОкО
df.describe()
ãã®ã³ãŒããå®è¡ãããšã次ã®å³ã«ç€ºãå 容ãåŸãããŸãã
æšæºã®ãã³ãããŒã«ã䜿çšããŠååŸãããèšè¿°çµ±èš
ããã«ã¯æçšãªæ
å ±ããããããããŸãããç 究äžã®ããŒã¿ã«ã€ããŠç¥ã£ãŠãããšèå³æ·±ãæ
å ±ããã¹ãŠå«ãŸããŠããããã§ã¯ãããŸããã ããšãã°ãããŒã¿ ãã¬ãŒã å
ãæ§é å
ã§æ¬¡ã®ããã«ä»®å®ãããããããŸããã DataFrame
ã891è¡ãããŸãã ããã確èªããå¿
èŠãããå Žåã¯ããã¬ãŒã ã®ãµã€ãºã決å®ããããã«å¥ã®ã³ãŒãè¡ãå¿
èŠã«ãªããŸãã ãããã®èšç®ã¯ç¹ã«ãªãœãŒã¹ã倧éã«æ¶è²»ããããã§ã¯ãããŸãããããããåžžã«ç¹°ãè¿ããšæéã®ç¡é§ãçºçãããããããŒã¿ã®ã¯ãªãŒãã³ã°ã«è²»ãããã»ããããã§ãããã
pandas-profiling ã䜿çšããæ¢çŽ¢çããŒã¿åæ
次ã«ãpandas-profiling ã䜿çšããŠåãããšãå®è¡ããŠã¿ãŸãããã
pandas_profiling.ProfileReport(df)
äžèšã®ã³ãŒãè¡ãå®è¡ãããšãæ¢çŽ¢çããŒã¿åæã€ã³ãžã±ãŒã¿ãŒãå«ãã¬ããŒããçæãããŸãã äžèšã®ã³ãŒãã¯èŠã€ãã£ãããŒã¿ãåºåããŸãããããšãã°èª°ãã«èŠããããšãã§ãã HTML ãã¡ã€ã«ãåºåãããããšãã§ããŸãã
ã¬ããŒãã®æåã®éšåã«ã¯ãããŒã¿ã«é¢ããåºæ¬æ å ± (芳枬å€ã®æ°ãå€æ°ã®æ°ãªã©) ãæäŸããæŠèŠã»ã¯ã·ã§ã³ãå«ãŸããŸãã ãŸããã¢ããªã¹ãã«ç¹å¥ãªæ³šæãæãå¿ èŠãããããšãéç¥ããã¢ã©ãŒãã®ãªã¹ããå«ãŸããŸãã ãããã®ã¢ã©ãŒãã¯ãããŒã¿ ã¯ãªãŒã³ã¢ããã®åãçµã¿ãã©ãã«éäžãããã¹ããã«é¢ããæããããæäŸããŸãã
æŠèŠã¬ããŒãã»ã¯ã·ã§ã³
æ¢çŽ¢çå€æ°åæ
ã¬ããŒãã®ãæŠèŠãã»ã¯ã·ã§ã³ã®äžã«ãåå€æ°ã«é¢ãã圹ç«ã€æ å ±ã衚瀺ãããŸãã ãããã«ã¯ãç¹ã«ãåå€æ°ã®ååžã説æããå°ããªã°ã©ããå«ãŸããŸãã
幎霢æ°å€å€æ°ã«ã€ããŠ
åã®äŸãããããããã«ãpandas-profiling ã¯ãæ¬ æå€ã®ããŒã»ã³ããŒãžãæ°ãªã©ã®ããã€ãã®æçšãªææšãããã§ã«èŠãèšè¿°çµ±èšéãæäŸããŸãã ãªããªã Age
ã¯æ°å€å€æ°ã§ããããããã®ååžããã¹ãã°ã©ã ã®åœ¢åŒã§èŠèŠåãããšãååžãå³ã«åã£ãŠãããšçµè«ä»ããããšãã§ããŸãã
ã«ããŽãªå€æ°ãèæ ®ããå Žåãåºåçµæã¯æ°å€å€æ°ã®å Žåãšã¯è¥å¹²ç°ãªããŸãã
æ§å¥ã«ããŽãªå€æ°ã«ã€ããŠ
ã€ãŸããpandas-profiling ã©ã€ãã©ãªã¯ãå¹³åãæå°ãæ倧ãèŠã€ãã代ããã«ãã¯ã©ã¹ã®æ°ãèŠã€ããŸããã ãªããªã Sex
â ãã€ããªå€æ°ããã®å€ã¯ XNUMX ã€ã®ã¯ã©ã¹ã§è¡šãããŸãã
ç§ãšåãããã«ã³ãŒãã調ã¹ãã®ã奜ããªäººã¯ãpandas-profiling ã©ã€ãã©ãªããããã®ã¡ããªã¯ã¹ãã©ã®ããã«æ£ç¢ºã«èšç®ãããã«èå³ããããããããŸããã ã©ã€ãã©ãª ã³ãŒãã GitHub ã§å
¬éãããŠããå
¥æã§ãããããããã調ã¹ãã®ã¯ããã»ã©é£ãããããŸããã ç§ã¯ãããžã§ã¯ãã§ãã©ã㯠ããã¯ã¹ã䜿çšããã®ãããŸã奜ãã§ã¯ãªãã®ã§ãã©ã€ãã©ãªã®ãœãŒã¹ ã³ãŒãã調ã¹ãŠã¿ãŸããã ããšãã°ãé¢æ°ã§è¡šãããæ°å€å€æ°ãåŠçããã¡ã«ããºã ã¯æ¬¡ã®ããã«ãªããŸãã
def describe_numeric_1d(series, **kwargs):
"""Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
Also create histograms (mini an full) of its distribution.
Parameters
----------
series : Series
The variable to describe.
Returns
-------
Series
The description of the variable as a Series with index being stats keys.
"""
# Format a number as a percentage. For example 0.25 will be turned to 25%.
_percentile_format = "{:.0%}"
stats = dict()
stats['type'] = base.TYPE_NUM
stats['mean'] = series.mean()
stats['std'] = series.std()
stats['variance'] = series.var()
stats['min'] = series.min()
stats['max'] = series.max()
stats['range'] = stats['max'] - stats['min']
# To avoid to compute it several times
_series_no_na = series.dropna()
for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
# The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
stats['iqr'] = stats['75%'] - stats['25%']
stats['kurtosis'] = series.kurt()
stats['skewness'] = series.skew()
stats['sum'] = series.sum()
stats['mad'] = series.mad()
stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
stats['n_zeros'] = (len(series) - np.count_nonzero(series))
stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
# Histograms
stats['histogram'] = histogram(series, **kwargs)
stats['mini_histogram'] = mini_histogram(series, **kwargs)
return pd.Series(stats, name=series.name)
ãã®ã³ãŒãéšåã¯éåžžã«å€§ããè€éã«èŠãããããããŸããããå®éã«ã¯ç解ããã®ã¯éåžžã«ç°¡åã§ãã éèŠãªã®ã¯ãã©ã€ãã©ãªã®ãœãŒã¹ã³ãŒãã®äžã«ãå€æ°ã®åã決å®ããé¢æ°ããããšããããšã§ãã ã©ã€ãã©ãªãæ°å€å€æ°ã«ééããããšãå€æããå Žåãäžèšã®é¢æ°ã¯èª¿ã¹ãŠããã¡ããªã¯ã¹ãèŠã€ããŸãã ãã®é¢æ°ã¯ã次ã®ã¿ã€ãã®ãªããžã§ã¯ããæäœããããã«æšæºã®ãã³ãæäœã䜿çšããŸãã Series
ã ã®ããã« series.mean()
ã èšç®çµæã¯èŸæžã«ä¿åããã stats
ã ãã¹ãã°ã©ã ã¯ãé¢æ°ã®é©åããŒãžã§ã³ã䜿çšããŠçæãããŸãã matplotlib.pyplot.hist
ã é©å¿ã®ç®çã¯ãé¢æ°ãããŸããŸãªçš®é¡ã®ããŒã¿ ã»ãããåŠçã§ããããã«ããããšã§ãã
çžé¢ææšãšèª¿æ»ããããµã³ãã«ããŒã¿
å€æ°ã®åæçµæã®åŸãpandas-profiling ã® [çžé¢] ã»ã¯ã·ã§ã³ã«ãã¢ãœã³çžé¢è¡åãšã¹ãã¢ãã³çžé¢è¡åã衚瀺ãããŸãã
ãã¢ãœã³çžé¢è¡å
å¿ èŠã«å¿ããŠãã¬ããŒãã®çæãããªã¬ãŒããã³ãŒãè¡ã§ãçžé¢é¢ä¿ã®èšç®æã«äœ¿çšããããããå€ã®ã€ã³ãžã±ãŒã¿ãŒãèšå®ã§ããŸãã ããã«ãããåæã«ãããŠã©ã®çšåºŠã®çžé¢ã®åŒ·ããéèŠã§ãããšèããããããæå®ã§ããŸãã
æåŸã«ãpandas-profiling ã¬ããŒãã® [ãµã³ãã«] ã»ã¯ã·ã§ã³ã«ã¯ãäŸãšããŠããŒã¿ ã»ããã®å é ããååŸããããŒã¿ã衚瀺ãããŸãã ãã®ã¢ãããŒãã§ã¯ãæåã®ããã€ãã®èŠ³æž¬å€ãããŒã¿ã»ããå šäœã®ç¹æ§ãåæ ããŠããªããµã³ãã«ãè¡šãå¯èœæ§ããããããäžå¿«ãªé©ããåŒãèµ·ããå¯èœæ§ããããŸãã
ç 究äžã®ãµã³ãã«ããŒã¿ãå«ãã»ã¯ã·ã§ã³
ãããã£ãŠããã®æåŸã®ã»ã¯ã·ã§ã³ã«æ³šæãæãããšã¯ãå§ãããŸããã 代ããã«ã次ã®ã³ãã³ãã䜿çšããããšããå§ãããŸãã df.sample(5)
ãããŒã¿ã»ãããã 5 ã€ã®èŠ³æž¬å€ãã©ã³ãã ã«éžæãããŸãã
çµæ
èŠçŽãããšãpandas-profiling ã©ã€ãã©ãªã¯ãããŒã¿ã®å€§ãŸããªã¢ã€ãã¢ãããã«ååŸããããã€ã³ããªãžã§ã³ã¹åæã¬ããŒãã誰ãã«æž¡ãããããå¿ èŠãããå Žåã«åœ¹ç«ã€ããã€ãã®äŸ¿å©ãªæ©èœãã¢ããªã¹ãã«æäŸããŸãã åæã«ãããŒã¿ã®å®éã®äœæ¥ã¯ããã®ç¹åŸŽãèæ ®ããŠãpandas ãããã¡ã€ãªã³ã°ã䜿çšããªãå Žåãšåæ§ã«æåã§å®è¡ãããŸãã
ãã¹ãŠã®ã€ã³ããªãžã§ã³ã¹ ããŒã¿åæã XNUMX ã€ã® Jupyter ããŒãããã¯ã§ã©ã®ããã«è¡ããããã確èªãããå Žåã¯ã以äžãã芧ãã ããã
芪æãªãèªè ïŒ æ°ããããŒã¿ã»ããã®åæãã©ãããå§ããŸãã?
åºæïŒ habr.com