Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv

Thawj kauj ruam thaum pib ua haujlwm nrog cov ntaub ntawv tshiab yog kom nkag siab nws. Txhawm rau ua qhov no, koj xav tau, piv txwv li, txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau txhawm rau, koj yuav tsum ua kom tiav cov txiaj ntsig ntawm cov txiaj ntsig tau txais txiaj ntsig los ntawm kev hloov pauv, lawv hom, thiab tseem paub txog cov txiaj ntsig uas ploj lawm.

Lub tsev qiv ntawv pandas muab peb cov cuab yeej siv tau zoo rau kev tshawb nrhiav cov ntaub ntawv tshawb fawb (EDA). Tab sis ua ntej koj siv lawv, koj feem ntau yuav tsum pib nrog ntau cov haujlwm xws li df.describe(). Txawm li cas los xij, nws yuav tsum raug sau tseg tias lub peev xwm muab los ntawm cov haujlwm zoo li no raug txwv, thiab cov theem pib ntawm kev ua haujlwm nrog cov ntaub ntawv teev thaum ua haujlwm EDA feem ntau zoo sib xws.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv

Tus sau cov ntaub ntawv uas peb tab tom tshaj tawm hnub no hais tias nws tsis yog tus kiv cua ntawm kev ua yeeb yam rov ua dua. Yog li ntawd, hauv kev tshawb nrhiav cov cuab yeej kom sai thiab ua tau zoo ua cov ntaub ntawv tshawb xyuas, nws pom lub tsev qiv ntawv pandas-profiling. Cov txiaj ntsig ntawm nws txoj haujlwm tsis yog nyob rau hauv daim ntawv ntawm qee tus neeg ntsuas, tab sis nyob rau hauv daim ntawv qhia meej meej HTML uas muaj feem ntau ntawm cov ntaub ntawv hais txog cov ntaub ntawv txheeb xyuas uas koj yuav tsum paub ua ntej pib ua haujlwm nrog nws.

Ntawm no peb yuav saib cov yam ntxwv ntawm kev siv lub tsev qiv ntawv pandas-profiling siv Titanic dataset ua piv txwv.

Kev tshawb nrhiav cov ntaub ntawv siv pandas

Kuv txiav txim siab los sim nrog pandas-profiling ntawm Titanic dataset vim muaj ntau hom ntaub ntawv nws muaj thiab muaj qhov tsis muaj qhov tseem ceeb hauv nws. Kuv ntseeg tias lub tsev qiv ntawv pandas-profiling yog qhov tshwj xeeb tshaj yog nyob rau hauv cov ntaub ntawv uas tseem tsis tau raug ntxuav thiab yuav tsum tau ua ntxiv nyob ntawm seb nws cov yam ntxwv. Yuav kom ua tiav cov txheej txheem zoo li no, koj yuav tsum paub tias yuav pib qhov twg thiab yuav tsum them rau dab tsi. Qhov no yog qhov uas pandas-profiling muaj peev xwm los ua ke.

Ua ntej, peb import cov ntaub ntawv thiab siv pandas kom tau txais cov txheeb cais piav qhia:

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΡ‹Ρ… ΠΏΠ°ΠΊΠ΅Ρ‚ΠΎΠ²
import pandas as pd
import pandas_profiling
import numpy as np

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π΄Π°Π½Π½Ρ‹Ρ…
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычислСниС ΠΏΠΎΠΊΠ°Π·Π°Ρ‚Π΅Π»Π΅ΠΉ ΠΎΠΏΠΈΡΠ°Ρ‚Π΅Π»ΡŒΠ½ΠΎΠΉ статистики
df.describe()

Tom qab ua tiav daim code no, koj yuav tau txais dab tsi tshwm sim hauv daim duab hauv qab no.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv
Cov txheeb cais piav qhia tau siv cov cuab yeej pandas txheem

Txawm hais tias muaj ntau cov ntaub ntawv tseem ceeb ntawm no, nws tsis muaj txhua yam uas yuav nthuav kom paub txog cov ntaub ntawv hauv qab no. Piv txwv li, ib tug yuav xav tias nyob rau hauv cov ntaub ntawv ncej, nyob rau hauv ib tug qauv DataFrame, muaj 891 kab. Yog tias qhov no yuav tsum tau kuaj xyuas, ces lwm txoj kab ntawm txoj cai yuav tsum tau txiav txim siab qhov loj ntawm tus ncej. Txawm hais tias cov kev suav no tsis yog cov peev txheej tshwj xeeb, rov ua dua lawv txhua lub sijhawm yog khi rau nkim sij hawm uas tej zaum yuav zoo dua siv tu cov ntaub ntawv.

Tshawb nrhiav cov ntaub ntawv tshawb fawb siv pandas-profiling

Tam sim no cia peb ua tib yam siv pandas-profiling:

pandas_profiling.ProfileReport(df)

Ua raws li kab lus saum toj no ntawm cov cai yuav tsim ib daim ntawv tshaj tawm nrog cov ntaub ntawv tshawb xyuas cov ntsuas ntsuas. Cov cai qhia saum toj no yuav tso tawm cov ntaub ntawv pom, tab sis koj tuaj yeem ua rau nws tso tawm cov ntaub ntawv HTML uas koj tuaj yeem qhia rau ib tus neeg, piv txwv li.

Thawj ntu ntawm tsab ntawv ceeb toom yuav muaj ib qho Kev Tshaj Tawm, muab cov ntaub ntawv yooj yim txog cov ntaub ntawv (tus naj npawb ntawm kev soj ntsuam, tus lej ntawm qhov hloov pauv, thiab lwm yam). Nws tseem yuav muaj cov npe ceeb toom, ceeb toom rau tus kws tshuaj ntsuam txog yam uas yuav tsum tau them tshwj xeeb rau. Cov lus ceeb toom no tuaj yeem muab cov lus qhia txog qhov twg koj tuaj yeem tsom koj cov ntaub ntawv tu kom huv.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv
Txheej txheem cej luam qhia seem

Kev Tshawb Nrhiav Variable Analysis

Hauv qab seem Saib Xyuas ntawm daim ntawv tshaj tawm koj tuaj yeem nrhiav cov ntaub ntawv tseem ceeb ntawm txhua qhov sib txawv. Lawv suav nrog, ntawm lwm yam, cov kab kos me me piav qhia txog kev faib tawm ntawm txhua qhov sib txawv.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv
Hais txog Hnub Nyoog Tus lej hloov pauv

Raws li koj tuaj yeem pom los ntawm cov piv txwv yav dhau los, pandas-profiling muab peb ntau yam kev qhia muaj txiaj ntsig, xws li feem pua ​​​​thiab tus naj npawb ntawm qhov tseem ceeb uas ploj lawm, nrog rau cov kev ntsuas ntsuas uas peb tau pom. Vim Age yog ib tug lej sib txawv, kev pom ntawm nws qhov kev faib tawm hauv daim ntawv histogram tso cai rau peb txiav txim siab tias peb muaj kev faib tawm mus rau sab xis.

Thaum xav txog qhov sib txawv categorical, cov txiaj ntsig tau tshwm sim txawv me ntsis ntawm cov uas pom muaj tus lej sib txawv.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv
Hais txog kev sib deev categorical variable

Namely, es tsis txhob nrhiav qhov nruab nrab, qhov tsawg kawg nkaus thiab siab tshaj plaws, lub tsev qiv ntawv pandas-profiling pom cov chav kawm. Vim Sex - qhov sib txawv binary, nws cov txiaj ntsig tau sawv cev los ntawm ob chav kawm.

Yog tias koj nyiam tshuaj xyuas cov lej zoo li kuv ua, koj yuav xav paub yuav ua li cas raws nraim lub tsev qiv ntawv pandas-profiling suav cov ntsuas no. Nrhiav kom paub txog qhov no, muab tias lub tsev qiv ntawv code qhib thiab muaj nyob ntawm GitHub, tsis yog qhov nyuaj. Txij li thaum kuv tsis yog tus kiv cua loj ntawm kev siv lub thawv dub hauv kuv cov haujlwm, kuv tau saib lub tsev qiv ntawv qhov chaws. Piv txwv li, qhov no yog qhov txheej txheem rau kev ua cov lej hloov pauv zoo li, sawv cev los ntawm kev ua haujlwm piav_numeric_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Txawm hais tias daim ntawv no yuav zoo li loj thiab nyuaj, nws yog qhov yooj yim heev kom nkag siab. Lub ntsiab lus yog tias nyob rau hauv qhov chaws code ntawm lub tsev qiv ntawv muaj ib tug muaj nuj nqi uas txiav txim cov hom ntawm variables. Yog tias nws hloov tawm tias lub tsev qiv ntawv tau ntsib tus lej sib txawv, cov haujlwm saum toj no yuav pom cov ntsuas peb tau saib. Txoj haujlwm no siv cov qauv pandas ua haujlwm rau kev ua haujlwm nrog cov khoom ntawm hom Series, zoo li series.mean(). Cov txiaj ntsig suav tau muab khaws cia rau hauv phau ntawv txhais lus stats. Histograms yog tsim los siv ib qho kev hloov kho ntawm cov haujlwm matplotlib.pyplot.hist. Kev hloov kho yog tsom xyuas kom ntseeg tau tias cov haujlwm tuaj yeem ua haujlwm nrog ntau hom ntaub ntawv teev.

Kev sib txheeb ntsuas thiab cov ntaub ntawv piv txwv tau kawm

Tom qab cov txiaj ntsig ntawm kev txheeb xyuas ntawm qhov sib txawv, pandas-profiling, nyob rau hauv ntu Kev sib raug zoo, yuav tso saib Pearson thiab Spearman correlation matrices.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv
Pearson correlation matrix

Yog tias tsim nyog, koj tuaj yeem ua tau, nyob rau hauv txoj kab ntawm cov cai uas ua rau lub cim ntawm daim ntawv tshaj tawm, teeb tsa cov cim ntawm qhov pib qhov tseem ceeb siv thaum xam qhov sib txheeb. Los ntawm kev ua qhov no, koj tuaj yeem qhia meej tias lub zog ntawm kev sib raug zoo yog qhov tseem ceeb rau koj qhov kev tshuaj xyuas.

Thaum kawg, pandas-profiling tsab ntawv ceeb toom, nyob rau hauv cov qauv seem, qhia, ua piv txwv, ib daim ntawm cov ntaub ntawv coj los ntawm qhov pib ntawm cov ntaub ntawv teev. Txoj hauv kev no tuaj yeem ua rau tsis txaus siab xav tsis thoob, txij li thawj ob peb qhov kev soj ntsuam tuaj yeem sawv cev rau cov qauv uas tsis cuam tshuam cov yam ntxwv ntawm tag nrho cov ntaub ntawv teev.

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv
Tshooj uas muaj cov qauv ntaub ntawv hauv qab kev kawm

Yog li ntawd, kuv tsis pom zoo kom ua tib zoo saib rau ntu kawg no. Hloov chaw, nws yog qhov zoo dua los siv cov lus txib df.sample(5), uas yuav random xaiv 5 kev soj ntsuam los ntawm cov ntaub ntawv teev.

Cov txiaj ntsim tau los

Txhawm rau txiav txim siab, lub tsev qiv ntawv pandas-profiling muab cov kws tshuaj ntsuam qee qhov muaj peev xwm muaj txiaj ntsig uas yuav los ua ke thaum koj xav tau sai sai tau txais lub tswv yim ntxhib ntawm cov ntaub ntawv lossis dhau ntawm daim ntawv qhia kev txawj ntse rau ib tus neeg. Nyob rau tib lub sijhawm, kev ua haujlwm tiag tiag nrog cov ntaub ntawv, suav nrog nws cov yam ntxwv, ua tiav, xws li tsis siv pandas-profiling, manually.

Yog tias koj xav ua tib zoo saib seb tag nrho cov ntaub ntawv kev txawj ntse zoo li cas hauv ib phau ntawv Jupyter, ua tib zoo saib qhov no kuv qhov project tsim siv nbviewer. Thiab hauv qhov no Koj tuaj yeem pom cov lej sib thooj hauv GitHub repositories.

Nyob zoo nyeem! Koj pib txheeb xyuas cov ntaub ntawv tshiab nyob qhov twg?

Ua kom nrawm tshawb nrhiav cov ntaub ntawv txheeb xyuas siv pandas-profiling tsev qiv ntawv

Tau qhov twg los: www.hab.com

Ntxiv ib saib