Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas

Tallaabada ugu horreysa marka la bilaabayo in lagu shaqeeyo xog cusub waa in la fahmo. Si tan loo sameeyo, waxaad u baahan tahay, tusaale ahaan, si aad u ogaato kala duwanaanshaha qiyamka ay aqbaleen doorsoomayaasha, noocyadooda, iyo sidoo kale inaad ogaato tirada qiimayaasha maqan.

Maktabadda pandas waxay na siisaa qalab badan oo faa'iido leh oo lagu fulinayo falanqaynta xogta sahaminta (EDA). Laakin ka hor intaadan isticmaalin, waxaad caadi ahaan u baahan tahay inaad ku bilowdo hawlo guud oo badan sida df.describe(). Si kastaba ha ahaatee, waa in la ogaadaa in awoodaha ay bixiyaan hawlahan oo kale ay xaddidan yihiin, iyo marxaladaha bilowga ah ee la shaqeynta xog kasta marka la fulinayo EDA inta badan waxay aad isugu eg yihiin midba midka kale.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas

Qoraaga qoraalkan aan maanta daabaceyno ayaa sheegay in uusan ahayn qof jecel fulinta falalka soo noqnoqda. Natiijo ahaan, raadinta qalabka si dhakhso ah oo hufan loo sameeyo falanqaynta xogta sahaminta, wuxuu helay maktabadda pandas-profiling. Natiijooyinka shaqadeeda laguma muujiyo qaab tilmaamayaal gaar ah, laakiin qaab warbixin HTML ah oo faahfaahsan oo ka kooban inta badan macluumaadka ku saabsan xogta la falanqeeyay ee laga yaabo inaad u baahato inaad ogaato ka hor intaadan bilaabin inaad si dhow ula shaqeyso.

Halkan waxaan ku eegi doonaa sifooyinka isticmaalka maktabadda pandas-profiling iyadoo la isticmaalayo xogta Titanic tusaale ahaan.

Falanqaynta xogta sahaminta iyadoo la adeegsanayo pandas

Waxaan go'aansaday inaan tijaabiyo pandas-profiling on the Titanic dataset sababtoo ah noocyada kala duwan ee xogta ay ka kooban tahay iyo joogitaanka qiyamka maqan. Waxaan aaminsanahay in maktabadda-profiling pandas ay si gaar ah u xiiso badan tahay kiisaska aan xogta weli la nadiifin oo u baahan habayn dheeraad ah oo ku xiran sifooyinkeeda. Si aad si guul leh u fuliso habkan oo kale, waxaad u baahan tahay inaad ogaato meesha aad ka bilaabayso iyo waxa aad fiiro gaar ah u leedahay. Tani waa halka awoodaha-profiling pandas ay ku anfacaan.

Marka hore, waxaanu soo dajinaa xogta oo aanu isticmaalnaa pandas si aanu u helno tirooyin qeexan:

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΡ‹Ρ… ΠΏΠ°ΠΊΠ΅Ρ‚ΠΎΠ²
import pandas as pd
import pandas_profiling
import numpy as np

# ΠΈΠΌΠΏΠΎΡ€Ρ‚ Π΄Π°Π½Π½Ρ‹Ρ…
df = pd.read_csv('/Users/lukas/Downloads/titanic/train.csv')

# вычислСниС ΠΏΠΎΠΊΠ°Π·Π°Ρ‚Π΅Π»Π΅ΠΉ ΠΎΠΏΠΈΡΠ°Ρ‚Π΅Π»ΡŒΠ½ΠΎΠΉ статистики
df.describe()

Ka dib markaad fuliso gabalkan koodka, waxaad heli doontaa waxa ka muuqda shaxanka soo socda.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas
Tirakoobka qeexan ee la helay iyadoo la isticmaalayo qalabka pandas caadiga ah

Inkasta oo ay jiraan macluumaad badan oo faa'iido leh halkan, kuma jiraan wax kasta oo xiiso leh in la ogaado xogta daraasadda. Tusaale ahaan, mid ayaa laga yaabaa inuu u qaato in qaab-dhismeedka xogta, qaab-dhismeedka DataFrame, waxa jira 891 sadar. Haddii tan loo baahan yahay in la hubiyo, markaa khad kale oo kood ah ayaa loo baahan yahay si loo go'aamiyo cabbirka jirku leeyahay. Inkasta oo xisaabaadkani aanay si gaar ah u ahayn khayraad-dhaqaale, ku celcelinta wakhti kasta waxay ku xidhan tahay inay lumiso wakhti laga yaabo inay si fiican u isticmaasho nadiifinta xogta.

Falanqaynta xogta sahaminta iyadoo la isticmaalayo pandas-profiling

Hadda aynu isla sidaas oo kale samayno annagoo isticmaalaya pandas-profiling:

pandas_profiling.ProfileReport(df)

Fulinta xariiqda koodka sare waxay soo saari doontaa warbixin leh tilmaamayaasha falanqaynta xogta sahaminta. Koodhka kor ku xusan wuxuu soo saari doonaa xogta la helay, laakiin waxaad ka dhigi kartaa inuu soo saaro faylka HTML oo aad tusi karto qof, tusaale ahaan.

Qaybta hore ee warbixintu waxa ay ka koobnaan doontaa qayb dulmar ah, oo siinaya macluumaadka aasaasiga ah ee ku saabsan xogta (tirada indho-indhaynta, tirada doorsoomayaasha, iwm.). Waxa kale oo ay ka koobnaan doontaa liiska digniinaha, ogeysiinta falanqeeyaha waxyaabaha ay tahay in fiiro gaar ah la siiyo. Ogeysiisyadani waxay ku siin karaan tilmaamo ku saabsan halka aad diiradda saari karto dadaalkaaga nadiifinta xogta.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas
Qaybta warbixinta guud

Falanqaynta Isbeddelka Sahanka ah

Hoosta qaybta guud ee warbixinta waxaad ka heli kartaa macluumaad faa'iido leh oo ku saabsan doorsoome kasta. Waxay ka mid yihiin, waxyaabo kale, jaantusyo yaryar oo qeexaya qaybinta doorsoome kasta.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas
Wax ku saabsan Da'da Kala duwanaanshiyaha

Sida aad ka arki karto tusaalihii hore, pandas-profiling waxay ina siinaysaa tilmaameyaal badan oo faa'iido leh, sida boqolleyda iyo tirada qiimayaasha maqan, iyo sidoo kale cabbirrada tirakoobyada qeexan ee aan horay u aragnay. Sababtoo ah Age waa doorsoome nambareed, muuqaalka qaybintiisa qaabka histogram wuxuu noo ogolaanayaa inaan ku soo gabagabeyno inaan haysano qaybinta u qalloocan dhanka midig.

Marka la tixgelinayo doorsoomayaasha kala duwan, natiijooyinka wax soo saarku wax yar bay ka duwan yihiin kuwa lagu helo doorsoomaha tirada.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas
Ku saabsan doorsoomayaasha kala duwan ee Jinsiga

Magac ahaan, halkii laga heli lahaa celceliska, ugu yar iyo ugu badnaan, maktabadda-profiling pandas waxay heshay tirada fasallada. Sababtoo ah Sex - doorsoome binary ah, qiyamkiisa waxaa matalaya laba fasal.

Haddii aad jeceshahay inaad u baarto koodka sidaan sameeyo, waxaa laga yaabaa inaad xiisaynayso sida saxda ah ee maktabadda-profiling pandas u xisaabiso cabbiradan. Helitaanka tan, marka la eego in koodka maktabaddu uu furan yahay oo laga heli karo GitHub, ma aha mid aad u adag. Maadaama aanan taageere weyn u ahayn isticmaalka sanduuqyada madow ee mashaariicdayda, waxaan eegay koodka isha maktabadda. Tusaale ahaan, kani waa sida uu u eg yahay habka habaynta doorsoomayaasha tirooyinka, oo uu matalo shaqada sharax_numeric_1d:

def describe_numeric_1d(series, **kwargs):
    """Compute summary statistics of a numerical (`TYPE_NUM`) variable (a Series).
    Also create histograms (mini an full) of its distribution.
    Parameters
    ----------
    series : Series
        The variable to describe.
    Returns
    -------
    Series
        The description of the variable as a Series with index being stats keys.
    """
    # Format a number as a percentage. For example 0.25 will be turned to 25%.
    _percentile_format = "{:.0%}"
    stats = dict()
    stats['type'] = base.TYPE_NUM
    stats['mean'] = series.mean()
    stats['std'] = series.std()
    stats['variance'] = series.var()
    stats['min'] = series.min()
    stats['max'] = series.max()
    stats['range'] = stats['max'] - stats['min']
    # To avoid to compute it several times
    _series_no_na = series.dropna()
    for percentile in np.array([0.05, 0.25, 0.5, 0.75, 0.95]):
        # The dropna() is a workaround for https://github.com/pydata/pandas/issues/13098
        stats[_percentile_format.format(percentile)] = _series_no_na.quantile(percentile)
    stats['iqr'] = stats['75%'] - stats['25%']
    stats['kurtosis'] = series.kurt()
    stats['skewness'] = series.skew()
    stats['sum'] = series.sum()
    stats['mad'] = series.mad()
    stats['cv'] = stats['std'] / stats['mean'] if stats['mean'] else np.NaN
    stats['n_zeros'] = (len(series) - np.count_nonzero(series))
    stats['p_zeros'] = stats['n_zeros'] * 1.0 / len(series)
    # Histograms
    stats['histogram'] = histogram(series, **kwargs)
    stats['mini_histogram'] = mini_histogram(series, **kwargs)
    return pd.Series(stats, name=series.name)

Inkasta oo gabal kood u ekaan karo mid weyn oo adag, dhab ahaantii aad bay u fududahay in la fahmo. Xaqiiqdu waxay tahay in koodhka isha ee maktabadda uu jiro shaqo go'aaminaya noocyada doorsoomayaasha. Haddii ay soo baxdo in maktabaddu ay la kulantay doorsoome nambareed, shaqada kore waxay heli doontaa cabbirradii aan eeginnay. Shaqadani waxay isticmaashaa hawlgallada pandas caadiga ah si ay ugula shaqeyso walxaha nooca Series, sida series.mean(). Natiijooyinka xisaabinta waxaa lagu kaydiyaa qaamuus stats. Histograms waxaa la soo saaray iyadoo la isticmaalayo nooca shaqada matplotlib.pyplot.hist. La qabsiga waxaa loogu talagalay in lagu hubiyo in shaqadu ay la shaqayn karto noocyada kala duwan ee xogta.

Tilmaamayaasha isku xidhka iyo xogta muunad ee la darsay

Ka dib natiijooyinka falanqaynta doorsoomayaasha, pandas-profiling, ee qaybta Xidhiidhka, waxay soo bandhigi doontaa isku xidhka Pearson iyo Spearman.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas
Matrix xiriirinta Pearson

Haddii loo baahdo, waxaad awoodi kartaa, in line code ee kiciya jiilka warbixinta, dejiso tilmaamayaasha qiyamka bilowga ah ee la isticmaalo marka la xisaabinayo isku xidhka. Markaad tan sameyso, waxaad qeexi kartaa xoogga isku xirnaanta ee loo arko inay muhiim u tahay falanqayntaada.

Ugu dambeyntii, warbixinta pandas-profiling, ee qaybta Tusaalaha, waxay soo bandhigaysaa, tusaale ahaan, qayb xog ah oo laga soo qaatay bilawga xogta. Habkani wuxuu u horseedi karaa yaabab aan fiicneyn, maadaama dhowrka indhood ee ugu horreeya laga yaabo inay matalaan muunad aan ka tarjumaynin sifooyinka xogta oo dhan.

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas
Qaybta ay ku jirto xogta muunad ee daraasaddu socoto

Natiijo ahaan, kuma talinayo inaad u fiirsato qaybtan u dambaysa. Taa bedelkeeda, way fiicantahay in la isticmaalo amarka df.sample(5), kaas oo si aan kala sooc lahayn u dooran doona 5 indho-indheyn ka mid ah xogta la dhigay.

Natiijooyinka

Si loo soo koobo, maktabadda astaanta u ah pandas-ku waxay siinaysaa falanqeeyaha xoogaa awoodo waxtar leh oo ku anfacaya xaaladaha aad u baahan tahay inaad si dhakhso leh fikrad adag uga hesho xogta ama aad qof ugu gudbiso warbixinta falanqaynta sirta. Isla mar ahaantaana, shaqada dhabta ah ee xogta, iyada oo la tixgelinayo sifooyinka, ayaa la sameeyaa, iyada oo aan la isticmaalin pandas-profiling, gacanta.

Haddii aad rabto inaad eegto sida ay u eg yihiin dhammaan falanqaynta xogta sirdoonku ee ku jira hal buug Jupyter ah, eeg tan mashruucaygu wuxuu abuuray iyadoo la adeegsanayo nbviewer. Oo gudaha tan Waxaad ka heli kartaa koodka u dhigma meelaha GitHub.

Akhristayaasha sharafta leh! Xagee ka bilaabeysaa falanqaynta xogta cusub?

Dar-dar gelinta falanqaynta xogta sahaminta iyadoo la adeegsanayo maktabadda sifada pandas

Source: www.habr.com

Add a comment