Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas

Isinyathelo sokuqala lapho uqala ukusebenza ngesethi yedatha entsha ukuyiqonda. Ukuze wenze lokhu, udinga, isibonelo, ukuthola ububanzi bamanani amukelwa yiziguquguqukayo, izinhlobo zawo, futhi uthole mayelana nenani lamanani angekho.

Umtapo wezincwadi we-panda usinikeza amathuluzi amaningi awusizo okwenza ukuhlaziya idatha yokuhlola (EDA). Kodwa ngaphambi kokuthi uwasebenzise, ​​ngokuvamile udinga ukuqala ngemisebenzi ejwayelekile efana ne-df.describe(). Kodwa-ke, kufanele kuqashelwe ukuthi amandla ahlinzekwa yile misebenzi anomkhawulo, futhi izigaba zokuqala zokusebenza nanoma yimaphi amasethi wedatha lapho kwenziwa i-EDA zivame ukufana kakhulu komunye nomunye.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas

Umbhali wezindaba esizishicilela namuhla uthi akayena umuntu othanda ukwenza izinto eziphindaphindayo. Ngenxa yalokho, lapho efuna amathuluzi okwenza ngokushesha nangempumelelo ukuhlaziya idatha yokuhlola, wathola umtapo wolwazi pandas-profiling. Imiphumela yomsebenzi wayo ayivezwanga ngendlela yezinkomba ezithile, kodwa ngendlela yombiko we-HTML onemininingwane eminingi equkethe ulwazi oluningi mayelana nedatha ehlaziyiwe ongase udinge ukuyazi ngaphambi kokuqala ukusebenza eduze nayo.

Lapha sizobheka izici zokusebenzisa umtapo wolwazi we-pandas usebenzisa idathasethi ye-Titanic njengesibonelo.

Ukuhlaziya idatha yokuhlola kusetshenziswa ama-panda

Nginqume ukuzama ukwenza iphrofayili ye-pandas kudathasethi ye-Titanic ngenxa yezinhlobo ezahlukene zedatha equkethe kanye nokuba khona kwamanani angekho kuyo. Ngikholelwa ukuthi umtapo wolwazi we-pandas uthakazelisa ikakhulukazi ezimeni lapho idatha ingakahlanzwa futhi idinga ukucutshungulwa okwengeziwe kuye ngezici zayo. Ukuze wenze ngempumelelo ukucubungula okunjalo, udinga ukwazi ukuthi ungaqala kuphi nokuthi yini okufanele uyinake. Yilapho amakhono okuphrofayili we-pandas esiza khona.

Okokuqala, singenisa idatha futhi sisebenzisa ama-panda ukuze sithole izibalo ezichazayo:

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΡ‹Ρ… ΠΏΠ°ΠΊΠ΅Ρ‚ΠΎΠ²
import pandas as pd
import pandas_profiling
import numpy as np

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π΄Π°Π½Π½Ρ‹Ρ…
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычислСниС ΠΏΠΎΠΊΠ°Π·Π°Ρ‚Π΅Π»Π΅ΠΉ ΠΎΠΏΠΈΡΠ°Ρ‚Π΅Π»ΡŒΠ½ΠΎΠΉ статистики
df.describe()

Ngemva kokwenza lesi siqeshana sekhodi, uzothola lokho okuboniswe esithombeni esilandelayo.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas
Izibalo ezichazayo ezitholwe kusetshenziswa amathuluzi e-pandas ajwayelekile

Nakuba kunolwazi oluningi oluwusizo lapha, aluqukethe konke okungajabulisa ukwazi ngedatha esacwaningwayo. Isibonelo, umuntu angase acabange ukuthi kuhlaka lwedatha, esakhiweni DataFrame, kunemigqa engama-891. Uma lokhu kudinga ukuhlolwa, khona-ke omunye umugqa wekhodi uyadingeka ukuze kunqunywe usayizi wohlaka. Nakuba lezi zibalo zingasebenzisi kakhulu izinsiza, ukuziphinda ngaso sonke isikhathi kuzomosha isikhathi esingase sisisebenzise kangcono ukuhlanza idatha.

Ukuhlaziya idatha yokuhlola kusetshenziswa i-pandas-profiling

Manje masenze okufanayo sisebenzisa i-pandas-profiling:

pandas_profiling.ProfileReport(df)

Ukusebenzisa umugqa wekhodi ongenhla kuzokhiqiza umbiko onezinkomba zokuhlaziya idatha. Ikhodi eboniswe ngenhla izokhipha idatha etholiwe, kodwa ungayenza ikhiphe ifayela le-HTML ongalibonisa othile, isibonelo.

Ingxenye yokuqala yombiko izoqukatha isigaba Sokubuka konke, esinikeza ulwazi oluyisisekelo mayelana nedatha (inombolo yokubhekwa, inombolo yokuguquguquka, njll.). Futhi izoqukatha uhlu lwezaziso, ukwazisa umhlaziyi wezinto okufanele azinake ngokukhethekile. Lezi zixwayiso zinganikeza izinkomba zokuthi ungagxilisa kuphi imizamo yakho yokuhlanza idatha.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas
Isigaba sombiko wokubuka konke

Ukuhlaziya Okuguquguqukayo Kokuhlola

Ngezansi kwesigaba Sokubuka konke sombiko ungathola ulwazi oluwusizo mayelana nokuhluka ngakunye. Zihlanganisa, phakathi kwezinye izinto, amashadi amancane achaza ukusatshalaliswa kokuguquguquka ngakunye.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas
Mayelana ne-Age Numeric Variable

Njengoba ungabona esibonelweni sangaphambilini, i-pandas-profiling isinika izinkomba ezimbalwa eziwusizo, njengephesenti nenani lamanani ashodayo, kanye nezibalo ezichazayo esizibonile kakade. Ngoba Age iwukuguquguquka kwezinombolo, ukubonwa kokusatshalaliswa kwayo ngendlela ye-histogram kusivumela ukuthi siphethe ngokuthi sinokusabalaliswa okutshekele kwesokudla.

Uma kucutshungulwa okuguquguqukayo kwesigaba, imiphumela yokukhiphayo ihluke kancane kuleyo etholakala ngokuguquguquka kwezinombolo.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas
Mayelana nokuguquguquka kwesigaba socansi

Okungukuthi, esikhundleni sokuthola isilinganiso, ubuncane kanye nobukhulu, umtapo wolwazi we-pandas wathola inani lamakilasi. Ngoba Sex - okuguquguqukayo kanambambili, amanani ayo amelelwa amakilasi amabili.

Uma uthanda ukuhlola ikhodi njengoba ngenza, ungase ube nentshisekelo yokuthi umtapo wezincwadi we-pandas-profiling uwabala kanjani lawa mamethrikhi. Ukuthola ngalokhu, uma kubhekwa ukuthi ikhodi yelabhulali ivuliwe futhi iyatholakala ku-GitHub, akunzima kangako. Njengoba ngingeyena umlandeli omkhulu wokusebenzisa amabhokisi amnyama kumaphrojekthi ami, ngibheke ikhodi yomthombo yelabhulali. Isibonelo, ibukeka kanje indlela yokucubungula okuguquguqukayo kwezinombolo, emelelwa umsebenzi chaza_inombolo_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Nakuba le ngxenye yekhodi ingase ibonakale inkulu futhi iyinkimbinkimbi, empeleni ilula kakhulu ukuyiqonda. Iphuzu liwukuthi kukhodi yomthombo womtapo wolwazi kunomsebenzi onquma izinhlobo zokuguquguquka. Uma kuvela ukuthi ilabhulali ihlangabezane nokuhluka kwezinombolo, umsebenzi ongenhla uzothola amamethrikhi ebesiwabhekile. Lo msebenzi usebenzisa imisebenzi ye-pandas ejwayelekile yokusebenza ngezinto zohlobo Series, njenga series.mean(). Imiphumela yokubala igcinwa kusichazamazwi stats. Ama-histograms akhiqizwa kusetshenziswa inguqulo eguquliwe yomsebenzi matplotlib.pyplot.hist. Ukuvumelanisa kuhloswe ngayo ukuqinisekisa ukuthi umsebenzi ungasebenza nezinhlobo ezahlukene zamasethi edatha.

Izinkomba zokuhlobana kanye nedatha yesampula efundiwe

Ngemva kwemiphumela yokuhlaziywa kwezinto eziguquguqukayo, i-pandas-profiling, esigabeni esithi Correlations, izobonisa u-matrices wokuhlobana ka-Pearson no-Spearman.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas
Pearson correlation matrix

Uma kunesidingo, ungakwazi, emgqeni wekhodi obangela ukukhiqizwa kombiko, usethe izinkomba zamanani omkhawulo asetshenziswa lapho kubalwa ukuhlobana. Ngokwenza lokhu, ungacacisa ukuthi yimaphi amandla okuhlobana athathwa njengokubalulekile ekuhlaziyeni kwakho.

Ekugcineni, umbiko we-pandas-profiling, esigabeni Sample, ubonisa, njengesibonelo, ucezu lwedatha oluthathwe ekuqaleni kwesethi yedatha. Le ndlela ingaholela ekumangaleni okungajabulisi, njengoba ukubonwa okumbalwa kokuqala kungase kubonise isampula elingabonisi izici zayo yonke isethi yedatha.

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas
Isigaba esiqukethe idatha yesampula esacwaningwayo

Ngenxa yalokho, angincomi ukunaka lesi sigaba sokugcina. Kunalokho, kungcono ukusebenzisa umyalo df.sample(5), okuzokhetha ngokungahleliwe ukubonwa okungu-5 kusethi yedatha.

Imiphumela

Ukufingqa, umtapo wolwazi we-pandas unikeza umhlaziyi amakhono athile awusizo azosebenza ezimeni lapho udinga ukuthola ngokushesha umbono ongemuhle wedatha noma udlulisele umbiko wokuhlaziya ubuhlakani kothile. Ngesikhathi esifanayo, umsebenzi wangempela ngedatha, ngokucabangela izici zayo, wenziwa, njengokuthi ngaphandle kokusebenzisa i-pandas-profiling, ngesandla.

Uma ufuna ukubheka ukuthi kubukeka kanjani konke ukuhlaziywa kwedatha yezobuhlakani encwadini eyodwa yeJupyter, bheka lokhu iphrojekthi yami idalwe kusetshenziswa i-nbviewer. Futhi ku lokhu Ungathola ikhodi ehambisanayo kumakhosombe e-GitHub.

Bafundi abathandekayo! Uqala kuphi ukuhlaziya amasethi edatha amasha?

Sheshisa ukuhlaziya idatha yokuhlola usebenzisa umtapo wolwazi we-pandas

Source: www.habr.com

Engeza amazwana