Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari

Nhanho yekutanga kana uchitanga kushanda neseti nyowani data ndeyekuinzwisisa. Kuti uite izvi, iwe unoda, semuenzaniso, kuwana huwandu hwehukoshi hunogamuchirwa nemhando dzakasiyana, mhando dzadzo, uye zvakare kuziva nezvehuwandu hwekushaikwa kwehunhu.

Iyo raibhurari yepandas inotipa akawanda anobatsira maturusi ekuita exploratory data analysis (EDA). Asi usati waashandisa, kazhinji unoda kutanga nemamwe mabasa akadai se df.describe(). Nekudaro, zvinofanirwa kucherechedzwa kuti kugona kunopihwa nemabasa akadaro kune mashoma, uye matanho ekutanga ekushanda nechero data seti paunenge uchiita EDA kazhinji kazhinji akafanana kune mumwe nemumwe.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari

Munyori wenyaya yatiri kutsikisa nhasi anoti haasi munhu anoda zvekudzokorora. Nekuda kweizvozvo, mukutsvaga maturusi ekukurumidza uye nemazvo kuita ongororo yedata, akawana raibhurari pandas-profiling. Mhedzisiro yebasa rayo inoratidzwa kwete nenzira yezvimwe zviratidzo, asi muchimiro cherondedzero yeHTML ine ruzivo rwakawanda nezve data rakaongororwa raungade kuziva usati watanga kushanda zvakanyanya nayo.

Pano tichatarisa maficha ekushandisa pandas-profiling raibhurari uchishandisa Titanic dataset semuenzaniso.

Kuongorora kwedata data uchishandisa pandas

Ndakafunga kuyedza pandas-profiling paTitanic dataset nekuda kwemhando dzakasiyana dze data yainayo uye kuvapo kwekushaikwa kwehunhu mairi. Ini ndinotenda kuti raibhurari ye-pandas-profiling inonyanya kunakidza mumamiriro ezvinhu apo iyo data haisati yacheneswa uye inoda kumwe kugadziridzwa zvichienderana nehunhu hwayo. Kuti ubudirire kuita kugadzirisa kwakadaro, unofanirwa kuziva kuti unotanga kupi uye chii chaunofanira kuteerera. Apa ndipo panowanikwa pandas-profiling kugona.

Kutanga, isu tinopinza iyo data uye tinoshandisa pandas kuwana inotsanangura manhamba:

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΡ‹Ρ… ΠΏΠ°ΠΊΠ΅Ρ‚ΠΎΠ²
import pandas as pd
import pandas_profiling
import numpy as np

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π΄Π°Π½Π½Ρ‹Ρ…
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычислСниС ΠΏΠΎΠΊΠ°Π·Π°Ρ‚Π΅Π»Π΅ΠΉ ΠΎΠΏΠΈΡΠ°Ρ‚Π΅Π»ΡŒΠ½ΠΎΠΉ статистики
df.describe()

Mushure mekuita ichi chidimbu chekodhi, iwe unowana izvo zvinoratidzwa mumufananidzo unotevera.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari
Nhamba dzinotsanangura dzakawanikwa uchishandisa zvakajairwa pandas maturusi

Kunyangwe paine ruzivo rwakawanda runobatsira pano, harina zvese zvingave zvinonakidza kuziva nezve data riri pakudzidza. Semuenzaniso, mumwe anogona kufunga kuti mune data data, mune chimiro DataFrame, kune 891 mitsetse. Kana izvi zvichida kuongororwa, saka imwe mutsara wekodhi inodiwa kuti uone ukuru hwechimiro. Nepo masvomhu aya asina kunyanya kushandisa zviwanikwa, kuzvidzokorora nguva dzese kunosungirwa kutambisa nguva inogona kunge iri nani kushandiswa kuchenesa data.

Yekuongorora data data uchishandisa pandas-profiling

Zvino ngatiite zvimwechete tichishandisa pandas-profiling:

pandas_profiling.ProfileReport(df)

Kuita mutsara wepamusoro wekodhi kuchaita chirevo chine zviratidzo zvekuongorora data. Iyo kodhi yakaratidzwa pamusoro inoburitsa iyo data yakawanikwa, asi iwe unogona kuita kuti ibudise iyo HTML faira yaunogona kuratidza kune mumwe munhu, semuenzaniso.

Chikamu chekutanga chemushumo chichava nechikamu cheOngororo, ichipa ruzivo rwekutanga nezve data (nhamba yekutarisa, nhamba yezvakasiyana, nezvimwewo). Ichave zvakare iine rondedzero yezviziviso, ichizivisa muongorori wezvinhu kuti atarise zvakanyanya. Aya machenjedzo anogona kupa mazano ekuti iwe ungatarise kupi yako data kuchenesa kuedza.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari
Overview report chikamu

Exploratory Variable Analysis

Pazasi peChikamu cheKutarisisa cheshumo iwe unogona kuwana ruzivo runobatsira nezve musiyano wega wega. Zvinosanganisira, pakati pezvimwe zvinhu, machati madiki anotsanangura kugoverwa kwekusiyana kwega kwega.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari
Nezve Age Numeric Variable

Sezvauri kuona kubva pamuenzaniso wapfuura, pandas-profiling inotipa akati wandei anobatsira zviratidzo, senge muzana uye nhamba yezvisina kukosha, pamwe neanotsanangura manhamba matanho atakatoona. Nokuti Age inhamba yenhamba, kuona kwekugoverwa kwayo nenzira yehistogram inotibvumira kugumisa kuti tine kugovera kwakatsvedza kurudyi.

Paunenge uchitarisa mutsauko wechikamu, mibairo inobuda yakati siyanei neiyo inowanikwa yenhamba yenhamba.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari
Pamusoro peBonde categorical variable

Sezvineiwo, pachinzvimbo chekutsvaga avhareji, hushoma uye huwandu, raibhurari yePandas-profiling yakawana huwandu hwemakirasi. Nokuti Sex - musiyano webhinari, hunhu hwayo hunomiririrwa nemakirasi maviri.

Kana iwe uchida kuongorora kodhi sezvandinoita, unogona kufarira kuti sei chaizvo iyo pandas-profiling raibhurari inoverenga aya metrics. Kutsvaga nezve izvi, kupihwa kuti raibhurari kodhi yakavhurika uye inowanikwa paGitHub, haina kuoma. Sezvo ini ndisiri mukuru wekushandisa mabhokisi matema mumapurojekiti angu, ndakatarisa kuraibhurari's source code. Semuyenzaniso, izvi ndizvo zvinoita magadzirirwo enhamba dzakasiyana-siyana, anomiririrwa nebasa tsanangura_nhamba_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Kunyangwe ichi chidimbu chekodhi chingaite sechikuru uye chakaomarara, chiri nyore kunzwisisa. Icho chiripo ndechekuti mune kodhi kodhi yeraibhurari pane basa rinotarisa mhando dzemhando. Kana zvikazoitika kuti raibhurari yasangana nekusiyana kwenhamba, basa riri pamusoro richawana metrics yatakatarisa. Iri basa rinoshandisa yakajairwa pandas mashandiro ekushanda nezvinhu zvemhando Series, kufanana series.mean(). Zviverengero zvawanikwa zvakachengetwa muduramazwi stats. Histograms inogadzirwa uchishandisa yakagadziridzwa vhezheni yebasa matplotlib.pyplot.hist. Kugadzirisa kune chinangwa chekuona kuti basa racho rinogona kushanda nemhando dzakasiyana dze data seti.

Correlation zviratidzo uye sampuli data yakadzidzwa

Mushure memhedzisiro yekuongororwa kwezvakasiyana, pandas-profiling, muchikamu cheCorrelations, icharatidza iyo Pearson uye Spearman correlation matrices.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari
Pearson correlation matrix

Kana zvichidikanwa, iwe unogona, mumutsara wekodhi iyo inokonzeresa chizvarwa cheshumo, isa zviratidzi zvechikumbaridzo tsika dzinoshandiswa pakuverenga kuwirirana. Nekuita izvi, iwe unogona kutsanangura kuti ndeapi simba rekubatanidza rinoonekwa rakakosha pakuongorora kwako.

Pakupedzisira, pandas-profiling report, muSample section, inoratidza, semuenzaniso, chidimbu che data chakatorwa kubva pakutanga kwe data set. Iyi nzira inogona kutungamirira kune kushamisika kusingafadzi, sezvo zvinyorwa zvishoma zvekutanga zvingamiririra muenzaniso usingatauri maitiro ehuwandu hwe data.

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari
Chikamu chine data rekuenzanisira riri kudzidza

Nekuda kweizvozvo, ini handikurudzire kuteerera kune ino yekupedzisira chikamu. Pane kudaro, zviri nani kushandisa murairo df.sample(5), iyo ichasarudza zvisizvo 5 kutarisa kubva pane data set.

Migumisiro

Kupfupisa, raibhurari yePandas-profiling inopa muongorori mamwe maitiro anobatsira ayo anouya anobatsira mumamiriro ezvinhu apo iwe unofanirwa kukurumidza kuwana pfungwa yakaoma yedata kana kupfuudza hungwaru yekuongorora rondedzero kune mumwe munhu. Panguva imwecheteyo, basa rechokwadi rine data, richifunga nezvemaitiro aro, rinoitwa, sepasina kushandisa pandas-profiling, nemaoko.

Kana iwe uchida kutarisa kuti kutarisisa kwese kwehungwaru data kunoratidzika sei mune rimwe bhuku reJupyter, tarisa izvi purojekiti yangu yakagadzirwa nenbviewer. Uye mu izvi Iwe unogona kuwana iyo inoenderana kodhi muGitHub repositories.

Vanodiwa vaverengi! Unotangira papi kuongorora seti nyowani dzedata?

Kurumidza kuongorora data data uchishandisa pandas-profiling raibhurari

Source: www.habr.com

Voeg