Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling

Mohato oa pele ha u qala ho sebetsa le sete e ncha ea data ke ho e utloisisa. E le hore u etse sena, u lokela ho etsa mohlala, ho fumana mefuta e mengata ea litekanyetso tse amoheloang ke mefuta-futa, mefuta ea tsona, le ho tseba ka palo ea litekanyetso tse sieo.

Laeborari ea li-pandas e re fa lisebelisoa tse ngata tsa bohlokoa tsa ho etsa tlhahlobo ea data ea tlhahlobo (EDA). Empa pele o li sebelisa, hangata o hloka ho qala ka mesebetsi e akaretsang joalo ka df.describe(). Leha ho le joalo, hoa lokela ho hlokomeloa hore bokhoni bo fanoeng ke mesebetsi e joalo bo fokotsehile, 'me mehato ea pele ea ho sebetsa le lisebelisoa leha e le life tsa data ha ho etsoa EDA hangata e tšoana haholo.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling

Mongoli oa litaba tseo re li phatlalatsang kajeno o re ha se motho ea ratang ho pheta-pheta lintho. Ka lebaka leo, ha a ntse a batla lisebelisoa tsa ho etsa tlhahlobo ea data ea lipatlisiso ka potlako le ka mokhoa o nepahetseng, o ile a fumana laebrari pandas-profiling. Liphetho tsa mosebetsi oa eona ha li hlahisoe ka mokhoa oa matšoao a itseng, empa ka mokhoa oa tlaleho e hlakileng ea HTML e nang le lintlha tse ngata mabapi le data e hlahlobiloeng eo u ka hlokang ho e tseba pele u qala ho sebetsa haufi-ufi le eona.

Mona re tla sheba likarolo tsa ho sebelisa laeborari ea profiling ea pandas ho sebelisa dataset ea Titanic joalo ka mohlala.

Tlhahlobo ea data ea lipatlisiso e sebelisa li-pandas

Ke nkile qeto ea ho etsa liteko tsa pandas-profiling ho dataset ea Titanic ka lebaka la mefuta e fapaneng ea data eo e nang le eona le boteng ba boleng bo sieo ho eona. Ke lumela hore laeborari ea li-pandas-profiling e khahla haholo maemong ao data e so kang e hloekisoa mme e hloka ts'ebetso e eketsehileng ho latela litšobotsi tsa eona. E le hore u atlehe ho etsa ts'ebetso e joalo, u lokela ho tseba hore na u qale hokae le hore na u ele hloko eng. Mona ke moo bokhoni ba ho hlahisa li-pandas bo sebetsang hantle.

Taba ea pele, re kenya data mme re sebelisa li-pandas ho fumana lipalo-palo tse hlalosang:

# импорт необходимых пакетов
import pandas as pd
import pandas_profiling
import numpy as np

# импорт данных
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычисление показателей описательной статистики
df.describe()

Ka mor'a ho phetha karolo ena ea khoutu, u tla fumana se bontšitsoeng setšoantšong se latelang.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling
Lipalopalo tse hlalosang tse fumanoeng ho sebelisoa lisebelisoa tse tloaelehileng tsa pandas

Le hoja ho na le tlhahisoleseding e ngata ea bohlokoa mona, ha e na ntho e 'ngoe le e' ngoe e ka khahlisang ho tseba ka data e ithutoang. Ka mohlala, motho a ka 'na a nahana hore ka moralo oa data, ka sebopeho DataFrame, ho na le mela e 891. Haeba sena se hloka ho hlahlojoa, joale ho hlokahala mola o mong oa khoutu ho fumana boholo ba foreimi. Le hoja lipalo tsena li sa hloke lisebelisoa tse ngata haholo, ho li pheta ka linako tsohle ho tla senya nako eo mohlomong e ka sebelisoang hamolemo ho hloekisa lintlha.

Tlhahlobo ea data ea lipatlisiso e sebelisa pandas-profiling

Joale ha re etseng se tšoanang re sebelisa pandas-profiling:

pandas_profiling.ProfileReport(df)

Ho phethahatsa mola o ka holimo oa khoutu ho tla hlahisa tlaleho e nang le matšoao a tlhahlobo ea data. Khoutu e bontšitsoeng ka holimo e tla hlahisa data e fumanoeng, empa u ka e etsa hore e hlahise faele ea HTML eo u ka e bontšang motho e mong, mohlala.

Karolo ea pele ea tlaleho e tla ba le karolo ea Overview, e fanang ka tlhahisoleseding ea motheo mabapi le data (palo ea litebello, palo ea mefuta-futa, joalo-joalo). E tla boela e be le lethathamo la litlhokomeliso, ho tsebisa mohlahlobi oa lintho tseo a lokelang ho li ela hloko ka ho khetheha. Litlhokomeliso tsena li ka fana ka leseli la hore na u ka tsepamisa boiteko ba hau ba ho hloekisa data hokae.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling
Karolo ea tlaleho ea kakaretso

Tlhahlobo e Fetohang ea Tlhahlobo

Ka tlase ho karolo ea Overview ea tlaleho u ka fumana lintlha tsa bohlokoa mabapi le mofuta o mong le o mong. Li kenyelletsa, har'a lintho tse ling, lichate tse nyenyane tse hlalosang kabo ea mofuta o mong le o mong.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling
Mabapi le Phapang ea Lilemo tsa Numeric

Joalokaha u bona mohlaleng o fetileng, pandas-profiling e re fa matšoao a 'maloa a bohlokoa, joalo ka peresente le palo ea litekanyetso tse sieo, hammoho le mehato e hlalosang ea lipalo-palo eo re seng re e bone. Hobane Age ke palo e fapaneng, pono ea kabo ea eona ka mokhoa oa histogram e re lumella ho etsa qeto ea hore re na le kabo e khelohileng ho le letona.

Ha ho nahanoa ka phapang ea likarolo, liphetho tsa tlhahiso li fapane hanyane le tse fumanoang bakeng sa phapang ea linomoro.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling
Mabapi le Phapang ea thobalano

E leng, sebakeng sa ho fumana karolelano, bonyane le boholo, laebrari ea boitsebiso ba pandas e fumane palo ea lihlopha. Hobane Sex - phapang ea binary, litekanyetso tsa eona li emeloa ke lihlopha tse peli.

Haeba u rata ho hlahloba khoutu joalo ka 'na, u kanna oa khahloa ke hore na laeborari ea profiling ea pandas e bala metrics ena joang. Ho tseba ka sena, kaha khoutu ea laeborari e bulehile ebile e fumaneha ho GitHub, ha ho thata hakaalo. Kaha ha ke motho ea ratang ho sebelisa mabokose a matšo mererong ea ka, ke ile ka sheba khoutu ea mohloli oa laeborari. Ka mohlala, sena ke tsela eo mochine oa ho sebetsana le mefuta-futa ea linomoro e shebahalang ka eona, e emeloang ke mosebetsi hlalosa_nomoro_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Leha sekhechana sena sa khoutu se ka bonahala se le seholo ebile se rarahane, ha e le hantle se bonolo haholo ho se utloisisa. Taba ke hore khoutu ea mohloli oa laeborari ho na le ts'ebetso e khethollang mefuta ea mefuta. Haeba ho fumaneha hore laeborari e kopane le phapang ea linomoro, ts'ebetso e kaholimo e tla fumana metrics eo re neng re e shebile. Ts'ebetso ena e sebelisa ts'ebetso e tloaelehileng ea li-pandas bakeng sa ho sebetsa le lintho tsa mofuta Series, joalo ka series.mean(). Liphetho tsa lipalo li bolokoa bukeng ea bukantswe stats. Histograms e hlahisoa ho sebelisoa mofuta o fetotsoeng oa ts'ebetso matplotlib.pyplot.hist. Ho ikamahanya le maemo ho reretsoe ho netefatsa hore ts'ebetso e ka sebetsa le mefuta e fapaneng ea li-data.

Matšoao a khokahano le data ea sampole e ithutoang

Kamora liphetho tsa tlhahlobo ea mefuta e fapaneng, profiling ea pandas, karolong ea Correlations, e tla bonts'a matrices a khokahano ea Pearson le Spearman.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling
Pearson correlation matrix

Haeba ho hlokahala, o ka khona, moleng oa khoutu e hlahisang tlhahiso ea tlaleho, ho beha matšoao a litekanyetso tse sebelisoang ha ho baloa kamano. Ka ho etsa sena, o ka hlakisa hore na matla a ho hokahanya a nkoa e le a bohlokoa bakeng sa tlhahlobo ea hau.

Qetellong, tlaleho ea pandas-profiling, karolong ea Mohlala, e bonts'a, e le mohlala, karolo ea data e nkiloeng ho tloha qalong ea sete ea data. Mokhoa ona o ka lebisa linthong tse makatsang tse sa thabiseng, kaha litlhaloso tse seng kae tsa pele li ka 'na tsa emela mohlala o sa bontšeng litšoaneleho tsa boitsebiso bohle ba data.

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling
Karolo e nang le lintlha tsa mohlala tse ntseng li ithutoa

Ka lebaka leo, ha ke khothaletse ho ela hloko karolo ena ea ho qetela. Ho e-na le hoo, ho molemo ho sebelisa taelo df.sample(5), e tla khetha ka mokhoa o sa reroang litebello tse 5 ho tsoa ho sete ea data.

Liphello

Ho akaretsa, laeborari ea profiling ea pandas e fa mohlahlobisisi bokhoni bo bong ba bohlokoa bo tla sebetsa maemong ao ho ona o hlokang ho fumana mohopolo o fosahetseng oa data kapa ho fetisetsa tlaleho ea tlhahlobo ea bohlale ho motho e mong. Ka nako e ts'oanang, mosebetsi oa sebele o nang le data, ho nahanela likarolo tsa oona, o etsoa, ​​​​joalo ka ntle le ho sebelisa pandas-profiling, ka letsoho.

Haeba u batla ho sheba hore na tlhahlobo eohle ea data ea bohlale e shebahala joang bukeng e le 'ngoe ea Jupyter, sheba sena morero oa ka o entsoe ka nbviewer. Mme ka hare sena U ka fumana khoutu e tsamaellanang sebakeng sa polokelo ea GitHub.

Babali ba ratehang! O qala hokae ho sekaseka li-data tse ncha?

Potlakisa tlhahlobo ea data ea boithuto u sebelisa laeborari ea pandas-profiling

Source: www.habr.com

Eketsa ka tlhaloso