Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling

Chinthu choyamba pamene mukuyamba kugwira ntchito ndi deta yatsopano ndikumvetsetsa. Kuti muchite izi, muyenera, mwachitsanzo, kuti mudziwe kuchuluka kwa zinthu zomwe zimavomerezedwa ndi mitundu, mitundu yawo, komanso kudziwa kuchuluka kwa zinthu zomwe zikusowa.

Laibulale ya pandas imatipatsa zida zambiri zothandiza popanga kafukufuku wa data (EDA). Koma musanagwiritse ntchito, nthawi zambiri mumayenera kuyamba ndi zina zambiri monga df.describe(). Komabe, ziyenera kukumbukiridwa kuti kuthekera koperekedwa ndi ntchito zotere kumakhala kochepa, ndipo magawo oyambira ogwirira ntchito ndi seti iliyonse ya data pochita EDA nthawi zambiri amakhala ofanana kwambiri.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling

Wolemba nkhani zomwe tikusindikiza masiku ano akunena kuti sakonda kuchita zinthu mobwerezabwereza. Chotsatira chake, pofufuza zida kuti azichita mofulumira komanso moyenera kufufuza deta yofufuza, adapeza laibulale pandas-mbiri. Zotsatira za ntchito yake zimawonetsedwa osati mwanjira ya zizindikiro zina, koma mu mawonekedwe a lipoti latsatanetsatane la HTML lomwe lili ndi zambiri zokhudzana ndi zomwe zafufuzidwa zomwe mungafunike kuzidziwa musanayambe kugwira nawo ntchito.

Apa tiwona mawonekedwe ogwiritsira ntchito laibulale ya pandas-profiling pogwiritsa ntchito dataset ya Titanic mwachitsanzo.

Kusanthula kwa data pogwiritsa ntchito pandas

Ndidaganiza zoyesa mbiri ya pandas pa dataset ya Titanic chifukwa chamitundu yosiyanasiyana yomwe ili nayo komanso kupezeka kwa zinthu zomwe zikusowa momwemo. Ndikukhulupirira kuti laibulale ya pandas-profiling imakhala yosangalatsa makamaka ngati detayo sinayeretsedwe ndipo imafuna kukonzedwanso kutengera mawonekedwe ake. Kuti muthe kuchita bwino ntchitoyi, muyenera kudziwa komwe mungayambire komanso zomwe muyenera kulabadira. Apa ndipamene luso la pandas-profiling limakhala lothandiza.

Choyamba, timalowetsa deta ndikugwiritsa ntchito pandas kuti tipeze ziwerengero zofotokozera:

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΡ‹Ρ… ΠΏΠ°ΠΊΠ΅Ρ‚ΠΎΠ²
import pandas as pd
import pandas_profiling
import numpy as np

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π΄Π°Π½Π½Ρ‹Ρ…
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычислСниС ΠΏΠΎΠΊΠ°Π·Π°Ρ‚Π΅Π»Π΅ΠΉ ΠΎΠΏΠΈΡΠ°Ρ‚Π΅Π»ΡŒΠ½ΠΎΠΉ статистики
df.describe()

Mukamaliza kachidutswa ka code iyi, mupeza zomwe zikuwonetsedwa pachithunzi chotsatira.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling
Ziwerengero zofotokozera zomwe zimapezeka pogwiritsa ntchito zida zokhazikika za pandas

Ngakhale pali zambiri zothandiza pano, mulibe chilichonse chomwe chingasangalatse kudziwa za zomwe tikuphunzira. Mwachitsanzo, wina angaganize kuti mu data frame, mu dongosolo DataFrame, pali mizere 891. Ngati izi zikuyenera kufufuzidwa, ndiye kuti mzere wina wa code ukufunika kuti mudziwe kukula kwa chimango. Ngakhale kuwerengera uku sikuli kogwiritsa ntchito kwambiri, kubwereza nthawi zonse kumawononga nthawi yomwe ingagwiritsidwe ntchito bwino kuyeretsa deta.

Kusanthula kwa data pogwiritsa ntchito pandas-profiling

Tsopano tiyeni tichite zomwezo pogwiritsa ntchito pandas-profiling:

pandas_profiling.ProfileReport(df)

Kugwiritsa ntchito mzere womwe uli pamwambapa kutulutsa lipoti lokhala ndi zizindikiro zowunikira deta. Khodi yomwe yawonetsedwa pamwambapa itulutsa zomwe zapezeka, koma mutha kuzipanga kuti zitulutse fayilo ya HTML yomwe mungawonetse wina, mwachitsanzo.

Gawo loyamba la lipotilo lidzakhala ndi gawo la Overview, lomwe limapereka chidziwitso chofunikira cha deta (chiwerengero cha zomwe zawonedwa, chiwerengero cha zosinthika, ndi zina zotero). Idzakhalanso ndi mndandanda wa zidziwitso, zodziwitsa katswiri wa zinthu kuti apereke chidwi chapadera. Zidziwitso izi zitha kukupatsani chidziwitso cha komwe mungayang'anire kuyesetsa kwanu kuyeretsa deta.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling
Chidule cha lipoti gawo

Exploratory Variable Analysis

Pansi pa Chidule cha lipotili mutha kupeza zambiri zothandiza pazosintha zilizonse. Amaphatikizapo, mwa zina, ma chart ang'onoang'ono omwe amafotokoza kugawidwa kwa kusintha kulikonse.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling
Za Age Numeric Variable

Monga mukuwonera kuchokera ku chitsanzo chapitachi, pandas-profiling imatipatsa zizindikiro zingapo zothandiza, monga kuchuluka ndi chiwerengero cha ziwerengero zomwe zikusowa, komanso ndondomeko zofotokozera zomwe taziwona kale. Chifukwa Age ndi kusintha kwa manambala, kuwonekera kwa kugawidwa kwake mu mawonekedwe a histogram kumatilola kuganiza kuti tili ndi kugawa kokhotakhota kumanja.

Poganizira kusinthika kwamagulu, zotsatira zake zimakhala zosiyana pang'ono ndi zomwe zimapezeka pamitundu yosiyanasiyana.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling
Za Sex categorical variable

Mwakuti, m'malo mopeza avareji, zochepa ndi zopambana, laibulale yolemba mbiri ya panda idapeza kuchuluka kwa makalasi. Chifukwa Sex - kusinthika kwapawiri, zikhalidwe zake zimayimiriridwa ndi magulu awiri.

Ngati mukufuna kuyang'ana ma code monga ine ndimachitira, mungakhale ndi chidwi ndi momwe laibulale ya mbiri ya pandas imawerengera ma metric awa. Kudziwa za izi, popeza nambala ya library ndi yotseguka komanso ikupezeka pa GitHub, sizovuta. Popeza sindine wokonda kugwiritsa ntchito mabokosi akuda pamapulojekiti anga, ndinayang'ana kachidindo kochokera ku library. Mwachitsanzo, izi ndi momwe makina osinthira manambala amawonekera, oimiridwa ndi ntchitoyo anafotokoza_nambala_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Ngakhale kachidutswa kameneka kakuwoneka ngati kakang'ono komanso kovutirapo, ndikosavuta kumvetsetsa. Mfundo ndi yakuti mu code source ya laibulale pali ntchito yomwe imasankha mitundu ya mitundu. Zikapezeka kuti laibulale yakumana ndi zosintha zamawerengero, ntchito yomwe ili pamwambapa ipeza ma metric omwe timayang'ana. Izi zimagwiritsa ntchito ma pandas okhazikika pogwira ntchito ndi zinthu zamtundu Series, monga series.mean(). Zotsatira zowerengera zimasungidwa mudikishonale stats. Histograms amapangidwa pogwiritsa ntchito mtundu wosinthidwa wa ntchitoyi matplotlib.pyplot.hist. Kusintha kumafuna kuonetsetsa kuti ntchitoyi ikugwira ntchito ndi mitundu yosiyanasiyana ya ma data.

Zizindikiro zamalumikizidwe ndi data yachitsanzo yophunziridwa

Pambuyo pa zotsatira za kusanthula kwa mitundu, pandas-profiling, mu gawo la Correlations, idzawonetsa matrices ogwirizana a Pearson ndi Spearman.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling
Pearson correlation matrix

Ngati ndi kotheka, mutha, pamzere wamakhodi omwe amayambitsa kubadwa kwa lipotilo, kuyika ziwonetsero zamakhalidwe omwe amagwiritsidwa ntchito powerengera kulumikizana. Pochita izi, mutha kufotokoza kuti ndi mphamvu yanji yolumikizana yomwe imawonedwa kuti ndi yofunika pakuwunika kwanu.

Pomaliza, lipoti la pandas-profiling, mu gawo la Zitsanzo, likuwonetsa, mwachitsanzo, chidutswa cha deta chomwe chatengedwa kuyambira pachiyambi cha deta. Njirayi ingayambitse zodabwitsa zosasangalatsa, popeza zowonera zochepa zoyambirira zitha kuyimira chitsanzo chomwe sichikuwonetsa mawonekedwe a deta yonse.

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling
Gawo lomwe lili ndi zitsanzo zomwe zikuphunziridwa

Chifukwa chake, sindikupangira kulabadira gawo lomalizali. M'malo mwake, ndi bwino kugwiritsa ntchito lamulo df.sample(5), yomwe idzasankha mwachisawawa zowonera 5 kuchokera pagulu la data.

Zotsatira

Mwachidule, laibulale ya pandas-profiling imapatsa wowunikirayo maluso ena othandiza omwe angakhale othandiza nthawi zina pomwe muyenera kudziwa mwachangu za datayo kapena kupereka lipoti la kusanthula kwanzeru kwa wina. Panthawi imodzimodziyo, ntchito yeniyeni ndi deta, poganizira mawonekedwe ake, ikuchitika, monga popanda kugwiritsa ntchito pandas-profiling, pamanja.

Ngati mukufuna kuyang'ana momwe kusanthula kwa data kwanzeru kumawonekera mu kope limodzi la Jupyter, yang'anani izi pulojekiti yanga idapangidwa pogwiritsa ntchito nbviewer. Ndipo mu izi Mutha kupeza nambala yofananira muzosungira za GitHub.

Wokondedwa owerenga! Kodi mumayambira kuti kusanthula ma data atsopano?

Limbikitsani kusanthula kwa data pogwiritsa ntchito laibulale ya pandas-profiling

Source: www.habr.com

Kuwonjezera ndemanga