Isinyathelo sokuqala lapho uqala ukusebenza ngesethi yedatha entsha ukuyiqonda. Ukuze wenze lokhu, udinga, isibonelo, ukuthola ububanzi bamanani amukelwa yiziguquguqukayo, izinhlobo zawo, futhi uthole mayelana nenani lamanani angekho.
Umtapo wezincwadi we-panda usinikeza amathuluzi amaningi awusizo okwenza ukuhlaziya idatha yokuhlola (EDA). Kodwa ngaphambi kokuthi uwasebenzise, ββngokuvamile udinga ukuqala ngemisebenzi ejwayelekile efana ne-df.describe(). Kodwa-ke, kufanele kuqashelwe ukuthi amandla ahlinzekwa yile misebenzi anomkhawulo, futhi izigaba zokuqala zokusebenza nanoma yimaphi amasethi wedatha lapho kwenziwa i-EDA zivame ukufana kakhulu komunye nomunye.
Umbhali wezindaba esizishicilela namuhla uthi akayena umuntu othanda ukwenza izinto eziphindaphindayo. Ngenxa yalokho, lapho efuna amathuluzi okwenza ngokushesha nangempumelelo ukuhlaziya idatha yokuhlola, wathola umtapo wolwazi
Lapha sizobheka izici zokusebenzisa umtapo wolwazi we-pandas usebenzisa idathasethi ye-Titanic njengesibonelo.
Ukuhlaziya idatha yokuhlola kusetshenziswa ama-panda
Nginqume ukuzama ukwenza iphrofayili ye-pandas kudathasethi ye-Titanic ngenxa yezinhlobo ezahlukene zedatha equkethe kanye nokuba khona kwamanani angekho kuyo. Ngikholelwa ukuthi umtapo wolwazi we-pandas uthakazelisa ikakhulukazi ezimeni lapho idatha ingakahlanzwa futhi idinga ukucutshungulwa okwengeziwe kuye ngezici zayo. Ukuze wenze ngempumelelo ukucubungula okunjalo, udinga ukwazi ukuthi ungaqala kuphi nokuthi yini okufanele uyinake. Yilapho amakhono okuphrofayili we-pandas esiza khona.
Okokuqala, singenisa idatha futhi sisebenzisa ama-panda ukuze sithole izibalo ezichazayo:
# ΠΈΠΌΠΏΠΎΡΡ Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΡΡ
ΠΏΠ°ΠΊΠ΅ΡΠΎΠ²
import pandas as pd
import pandas_profiling
import numpy as np
# ΠΈΠΌΠΏΠΎΡΡ Π΄Π°Π½Π½ΡΡ
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')
# Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΠ΅ ΠΏΠΎΠΊΠ°Π·Π°ΡΠ΅Π»Π΅ΠΉ ΠΎΠΏΠΈΡΠ°ΡΠ΅Π»ΡΠ½ΠΎΠΉ ΡΡΠ°ΡΠΈΡΡΠΈΠΊΠΈ
df.describe()
Ngemva kokwenza lesi siqeshana sekhodi, uzothola lokho okuboniswe esithombeni esilandelayo.
Izibalo ezichazayo ezitholwe kusetshenziswa amathuluzi e-pandas ajwayelekile
Nakuba kunolwazi oluningi oluwusizo lapha, aluqukethe konke okungajabulisa ukwazi ngedatha esacwaningwayo. Isibonelo, umuntu angase acabange ukuthi kuhlaka lwedatha, esakhiweni DataFrame
, kunemigqa engama-891. Uma lokhu kudinga ukuhlolwa, khona-ke omunye umugqa wekhodi uyadingeka ukuze kunqunywe usayizi wohlaka. Nakuba lezi zibalo zingasebenzisi kakhulu izinsiza, ukuziphinda ngaso sonke isikhathi kuzomosha isikhathi esingase sisisebenzise kangcono ukuhlanza idatha.
Ukuhlaziya idatha yokuhlola kusetshenziswa i-pandas-profiling
Manje masenze okufanayo sisebenzisa i-pandas-profiling:
pandas_profiling.ProfileReport(df)
Ukusebenzisa umugqa wekhodi ongenhla kuzokhiqiza umbiko onezinkomba zokuhlaziya idatha. Ikhodi eboniswe ngenhla izokhipha idatha etholiwe, kodwa ungayenza ikhiphe ifayela le-HTML ongalibonisa othile, isibonelo.
Ingxenye yokuqala yombiko izoqukatha isigaba Sokubuka konke, esinikeza ulwazi oluyisisekelo mayelana nedatha (inombolo yokubhekwa, inombolo yokuguquguquka, njll.). Futhi izoqukatha uhlu lwezaziso, ukwazisa umhlaziyi wezinto okufanele azinake ngokukhethekile. Lezi zixwayiso zinganikeza izinkomba zokuthi ungagxilisa kuphi imizamo yakho yokuhlanza idatha.
Isigaba sombiko wokubuka konke
Ukuhlaziya Okuguquguqukayo Kokuhlola
Ngezansi kwesigaba Sokubuka konke sombiko ungathola ulwazi oluwusizo mayelana nokuhluka ngakunye. Zihlanganisa, phakathi kwezinye izinto, amashadi amancane achaza ukusatshalaliswa kokuguquguquka ngakunye.
Mayelana ne-Age Numeric Variable
Njengoba ungabona esibonelweni sangaphambilini, i-pandas-profiling isinika izinkomba ezimbalwa eziwusizo, njengephesenti nenani lamanani ashodayo, kanye nezibalo ezichazayo esizibonile kakade. Ngoba Age
iwukuguquguquka kwezinombolo, ukubonwa kokusatshalaliswa kwayo ngendlela ye-histogram kusivumela ukuthi siphethe ngokuthi sinokusabalaliswa okutshekele kwesokudla.
Uma kucutshungulwa okuguquguqukayo kwesigaba, imiphumela yokukhiphayo ihluke kancane kuleyo etholakala ngokuguquguquka kwezinombolo.
Mayelana nokuguquguquka kwesigaba socansi
Okungukuthi, esikhundleni sokuthola isilinganiso, ubuncane kanye nobukhulu, umtapo wolwazi we-pandas wathola inani lamakilasi. Ngoba Sex
- okuguquguqukayo kanambambili, amanani ayo amelelwa amakilasi amabili.
Uma uthanda ukuhlola ikhodi njengoba ngenza, ungase ube nentshisekelo yokuthi umtapo wezincwadi we-pandas-profiling uwabala kanjani lawa mamethrikhi. Ukuthola ngalokhu, uma kubhekwa ukuthi ikhodi yelabhulali ivuliwe futhi iyatholakala ku-GitHub, akunzima kangako. Njengoba ngingeyena umlandeli omkhulu wokusebenzisa amabhokisi amnyama kumaphrojekthi ami, ngibheke ikhodi yomthombo yelabhulali. Isibonelo, ibukeka kanje indlela yokucubungula okuguquguqukayo kwezinombolo, emelelwa umsebenzi
def describe_numeric_1d(series, **kwargs):
"""Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
Also create histograms (mini an full) of its distribution.
Parameters
----------
series : Series
The variable to describe.
Returns
-------
Series
The description of the variable as a Series with index being stats keys.
"""
# Format a number as a percentage. For example 0.25 will be turned to 25%.
_percentile_format = "{:.0%}"
stats = dict()
stats['type'] = base.TYPE_NUM
stats['mean'] = series.mean()
stats['std'] = series.std()
stats['variance'] = series.var()
stats['min'] = series.min()
stats['max'] = series.max()
stats['range'] = stats['max'] - stats['min']
# To avoid to compute it several times
_series_no_na = series.dropna()
for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
# The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
stats['iqr'] = stats['75%'] - stats['25%']
stats['kurtosis'] = series.kurt()
stats['skewness'] = series.skew()
stats['sum'] = series.sum()
stats['mad'] = series.mad()
stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
stats['n_zeros'] = (len(series) - np.count_nonzero(series))
stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
# Histograms
stats['histogram'] = histogram(series, **kwargs)
stats['mini_histogram'] = mini_histogram(series, **kwargs)
return pd.Series(stats, name=series.name)
Nakuba le ngxenye yekhodi ingase ibonakale inkulu futhi iyinkimbinkimbi, empeleni ilula kakhulu ukuyiqonda. Iphuzu liwukuthi kukhodi yomthombo womtapo wolwazi kunomsebenzi onquma izinhlobo zokuguquguquka. Uma kuvela ukuthi ilabhulali ihlangabezane nokuhluka kwezinombolo, umsebenzi ongenhla uzothola amamethrikhi ebesiwabhekile. Lo msebenzi usebenzisa imisebenzi ye-pandas ejwayelekile yokusebenza ngezinto zohlobo Series
, njenga series.mean()
. Imiphumela yokubala igcinwa kusichazamazwi stats
. Ama-histograms akhiqizwa kusetshenziswa inguqulo eguquliwe yomsebenzi matplotlib.pyplot.hist
. Ukuvumelanisa kuhloswe ngayo ukuqinisekisa ukuthi umsebenzi ungasebenza nezinhlobo ezahlukene zamasethi edatha.
Izinkomba zokuhlobana kanye nedatha yesampula efundiwe
Ngemva kwemiphumela yokuhlaziywa kwezinto eziguquguqukayo, i-pandas-profiling, esigabeni esithi Correlations, izobonisa u-matrices wokuhlobana ka-Pearson no-Spearman.
Pearson correlation matrix
Uma kunesidingo, ungakwazi, emgqeni wekhodi obangela ukukhiqizwa kombiko, usethe izinkomba zamanani omkhawulo asetshenziswa lapho kubalwa ukuhlobana. Ngokwenza lokhu, ungacacisa ukuthi yimaphi amandla okuhlobana athathwa njengokubalulekile ekuhlaziyeni kwakho.
Ekugcineni, umbiko we-pandas-profiling, esigabeni Sample, ubonisa, njengesibonelo, ucezu lwedatha oluthathwe ekuqaleni kwesethi yedatha. Le ndlela ingaholela ekumangaleni okungajabulisi, njengoba ukubonwa okumbalwa kokuqala kungase kubonise isampula elingabonisi izici zayo yonke isethi yedatha.
Isigaba esiqukethe idatha yesampula esacwaningwayo
Ngenxa yalokho, angincomi ukunaka lesi sigaba sokugcina. Kunalokho, kungcono ukusebenzisa umyalo df.sample(5)
, okuzokhetha ngokungahleliwe ukubonwa okungu-5 kusethi yedatha.
Imiphumela
Ukufingqa, umtapo wolwazi we-pandas unikeza umhlaziyi amakhono athile awusizo azosebenza ezimeni lapho udinga ukuthola ngokushesha umbono ongemuhle wedatha noma udlulisele umbiko wokuhlaziya ubuhlakani kothile. Ngesikhathi esifanayo, umsebenzi wangempela ngedatha, ngokucabangela izici zayo, wenziwa, njengokuthi ngaphandle kokusebenzisa i-pandas-profiling, ngesandla.
Uma ufuna ukubheka ukuthi kubukeka kanjani konke ukuhlaziywa kwedatha yezobuhlakani encwadini eyodwa yeJupyter, bheka
Bafundi abathandekayo! Uqala kuphi ukuhlaziya amasethi edatha amasha?
Source: www.habr.com