Rhoqo abantu abangena kwicandelo leNzululwazi yeDatha bangaphantsi kokulindela okusengqiqweni koko kubalindileyo. Abantu abaninzi bacinga ukuba ngoku baya kubhala iinethiwekhi ezipholileyo ze-neural, benze umncedisi welizwi ovela kwi-Iron Man, okanye babethe wonke umntu kwiimarike zemali.
Kodwa umsebenzi Iinkcukacha Isazinzulu siqhutywa yidatha, kwaye enye yezona zinto zibaluleke kakhulu kwaye zichitha ixesha kukucubungula idatha ngaphambi kokuyondla kwinethiwekhi ye-neural okanye ukuyihlalutya ngendlela ethile.
Kule nqaku, iqela lethu liza kuchaza indlela ongayenza ngayo idatha ngokukhawuleza kwaye kulula kunye nemiyalelo yesinyathelo ngesinyathelo kunye nekhowudi. Sizame ukwenza ikhowudi ibe bhetyebhetye kwaye ingasetyenziselwa iiseti zedatha ezahlukeneyo.
Iingcali ezininzi zinokungafumani nto ingaqhelekanga kweli nqaku, kodwa abaqalayo baya kukwazi ukufunda into entsha, kwaye nabani na okudala ephupha ukwenza incwadana eyahlukileyo yokucocwa kwedatha ekhawulezileyo necwangcisiweyo unokukhuphela ikhowudi kwaye ayifomethe ngokwabo, okanye
Sifumene iseti yedatha. Kufuneka wenze ntoni ngokulandelayo?
Ngoko ke, umgangatho: kufuneka siqonde into esijongene nayo, umfanekiso opheleleyo. Ukwenza oku, sisebenzisa i-pandas ukuchaza ngokulula iindidi ezahlukeneyo zedatha.
import pandas as pd #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ pandas
import numpy as np #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ΡΠΈΡΠ°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ df
df.head(3) #ΡΠΌΠΎΡΡΠΈΠΌ Π½Π° ΠΏΠ΅ΡΠ²ΡΠ΅ 3 ΡΡΡΠΎΡΠΊΠΈ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ½ΡΡΡ, ΠΊΠ°ΠΊ Π²ΡΠ³Π»ΡΠ΄ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
df.info() #ΠΠ΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΠ΅ΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°Ρ
Makhe sijonge amaxabiso ekholamu:
- Ngaba inani lemigca kwikholamu nganye liyahambelana nenani elipheleleyo lemigca?
- Yintoni undoqo wedatha kwikholamu nganye?
- Yeyiphi ikholamu esifuna ukuyijolisa ukuze senze iingqikelelo zayo?
Iimpendulo zale mibuzo ziya kukuvumela ukuba uhlalutye iseti yedatha kwaye uzobe isicwangciso sezenzo zakho ezilandelayo.
Kwakhona, ukujonga nzulu kumaxabiso kumhlathi ngamnye, sinokusebenzisa i-pandas explain() umsebenzi. Nangona kunjalo, ukungalunganga kwalo msebenzi kukuba awuboneleli ngolwazi malunga neekholamu ezinamaxabiso omtya. Siza kujongana nabo kamva.
df.describe()
Umbono womlingo
Makhe sijonge apho singenaxabiso kwaphela:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Oku bekujongeka okufutshane ukusuka phezulu, ngoku siza kuqhubela phambili kwizinto ezinomdla ngakumbi
Masizame ukufumana kwaye, ukuba kunokwenzeka, sisuse iikholamu ezinexabiso elinye kuphela kuyo yonke imigca (aziyi kusichaphazela isiphumo nangayiphi na indlela):
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]] #ΠΠ΅ΡΠ΅Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ, ΠΎΡΡΠ°Π²Π»ΡΡ ΡΠΎΠ»ΡΠΊΠΎ ΡΠ΅ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ, Π² ΠΊΠΎΡΠΎΡΡΡ
Π±ΠΎΠ»ΡΡΠ΅ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
Ngoku siyazikhusela kunye nempumelelo yeprojekthi yethu kwimigca ephindwe kabini (imigca equlethe ulwazi olufanayo ngokulandelelana njengenye yemigca ekhoyo):
df.drop_duplicates(inplace=True) #ΠΠ΅Π»Π°Π΅ΠΌ ΡΡΠΎ, Π΅ΡΠ»ΠΈ ΡΡΠΈΡΠ°Π΅ΠΌ Π½ΡΠΆΠ½ΡΠΌ.
#Π Π½Π΅ΠΊΠΎΡΠΎΡΡΡ
ΠΏΡΠΎΠ΅ΠΊΡΠ°Ρ
ΡΠ΄Π°Π»ΡΡΡ ΡΠ°ΠΊΠΈΠ΅ Π΄Π°Π½Π½ΡΠ΅ Ρ ΡΠ°ΠΌΠΎΠ³ΠΎ Π½Π°ΡΠ°Π»Π° Π½Π΅ ΡΡΠΎΠΈΡ.
Sahlula-hlula isethi yedatha ibe zimbini: enye inamaxabiso asemgangathweni, kwaye enye ngezobungakanani
Apha kufuneka senze ingcaciso encinci: ukuba imigca eneenkcukacha ezilahlekileyo kwidatha esemgangathweni kunye nenani azihambelani kakhulu, ngoko kuya kufuneka senze isigqibo malunga nento esiyibingelelayo - yonke imigca eneenkcukacha ezilahlekileyo, inxalenye yazo kuphela, okanye iikholamu ezithile. Ukuba imigca inxibelelene, ngoko sinelungelo lonke lokwahlula isethi yedatha ibe zimbini. Kungenjalo, kuya kufuneka uqale ujongane nemigca engahambelaniyo nedatha elahlekileyo ngokomgangatho kunye nobungakanani, kwaye emva koko wahlulahlule kabini.
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])
Senza oku ukwenza kube lula ngathi ukucubungula ezi ntlobo zibini zedatha - kamva siya kuqonda ukuba kulula kangakanani oku ukwenza ubomi bethu.
Sisebenza ngedatha yobuninzi
Into yokuqala ekufuneka siyenzile kukuqinisekisa ukuba kukho "iikholamu zokuhlola" kwidatha yobungakanani. Sibiza ezi kholamu kuba zizibonakalisa njengedatha yobuninzi, kodwa zisebenza njengedatha esemgangathweni.
Sinokubazi njani? Ngokuqinisekileyo, konke kuxhomekeke kubume bedatha oyihlalutyayo, kodwa ngokubanzi iikholamu ezinjalo zinokuba nedatha ekhethekileyo (kwindawo ye-3-10 yamaxabiso ahlukeneyo).
print(df_numerical.nunique())
Sakuba sichonge iikholamu zentlola, siya kuzisusa ukusuka kwidatha yobungakanani ukuya kwidatha esemgangathweni:
spy_columns = df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]#Π²ΡΠ΄Π΅Π»ΡΠ΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ-ΡΠΏΠΈΠΎΠ½Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΎΡΠ΄Π΅Π»ΡΠ½ΡΡ dataframe
df_numerical.drop(labels=['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3'], axis=1, inplace = True)#Π²ΡΡΠ΅Π·Π°Π΅ΠΌ ΡΡΠΈ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΠΈΠ· ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
Π΄Π°Π½Π½ΡΡ
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΠΏΠ΅ΡΠ²ΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ Π²ΡΠΎΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΡΡΠ΅ΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
Ekugqibeleni, siye sahlula ngokupheleleyo idatha yobungakanani ukusuka kwidatha esemgangathweni kwaye ngoku sinokusebenza nayo ngokufanelekileyo. Into yokuqala kukuqonda apho sinexabiso elingenanto (NaN, kwaye kwezinye iimeko u-0 uya kwamkelwa njengamaxabiso angenanto).
for i in df_numerical.columns:
print(i, df[i][df[i]==0].count())
Ngeli nqanaba, kubalulekile ukuqonda ukuba zeziphi iikholomu zero zingabonisa amaxabiso alahlekileyo: oku kungenxa yendlela idatha eqokelelwe ngayo? Okanye ngaba inokunxulumana namaxabiso edatha? Le mibuzo mayiphendulwe ngokwemeko nganye.
Ke, ukuba sisathatha isigqibo sokuba silahlekile idatha apho kukho ooziro, kufuneka sibuyisele ooziro ngeNaN ukuze kube lula ukusebenza ngale datha ilahlekileyo kamva:
df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]] = df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]].replace(0, nan)
Ngoku makhe sibone apho siphosa khona idatha:
sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ΠΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ Π²ΠΎΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡΡΡ df_numerical.info()
Apha loo maxabiso angaphakathi kwiikholamu ezingekhoyo kufuneka iphawulwe ngomthubi. Kwaye ngoku ulonwabo luqala - indlela yokujongana nale milinganiselo? Ngaba kufuneka ndiyicime imiqolo enala maxabiso okanye iikholamu? Okanye ugcwalise la maxabiso angenanto kunye namanye?
Nanku umzobo oqikelelweyo onokukunceda wenze isigqibo malunga nokuba yintoni na, ngokwemigaqo, enokwenziwa ngamaxabiso angenanto:
0. Susa iikholamu ezingeyomfuneko
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Ngaba inani lamaxabiso angenanto kule kholamu lingaphezulu kwama-50%?
print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50 ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Cima imigca enamaxabiso angenanto
df_numerical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ, Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Ukufaka ixabiso elingalindelekanga
import random #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ random
df_numerical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True) #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΡΠ°Π½Π΄ΠΎΠΌΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Π² ΠΏΡΡΡΡΠ΅ ΠΊΠ»Π΅ΡΠΊΠΈ ΡΠ°Π±Π»ΠΈΡΡ
3.2. Ukufaka ixabiso elingaguqukiyo
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>") #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ SimpleImputer
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.3. Faka umndilili okanye ixabiso eliqhelekileyo
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #Π²ΠΌΠ΅ΡΡΠΎ mean ΠΌΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ most_frequent
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.4. Faka ixabiso elibalwe yenye imodeli
Ngamanye amaxesha amaxabiso anokubalwa kusetyenziswa imifuziselo yohlengahlengiso kusetyenziswa imifuziselo esuka kwithala leencwadi le-sklearn okanye amanye amathala eencwadi afanayo. Iqela lethu liza kunikezela ngenqaku elahlukileyo lokuba oku kunokwenziwa njani kwixesha elizayo elingekude.
Ke, okwangoku, ingxelo malunga nedatha yobungakanani iya kuphazamiseka, kuba kukho ezinye izinto ezininzi malunga nendlela yokwenza ngcono ukulungiswa kwedatha kunye nokulungiswa kwangaphambili kwemisebenzi eyahlukeneyo, kunye nezinto ezisisiseko zedatha yobungakanani zithathelwe ingqalelo kweli nqaku, kwaye ngoku lixesha lokubuyela kwidatha esemgangathweni esiye sahlula amanyathelo amaninzi ukusuka kwinani. Ungayitshintsha le ncwadana yamanqaku njengoko uthanda, uyilungelelanise kwimisebenzi eyahlukeneyo, ukuze ukucubungula idatha kuhambe ngokukhawuleza!
Idatha esemgangathweni
Ngokusisiseko, kwidatha esemgangathweni, i-One-hot-encoding method isetyenziswa ukwenzela ukuyifomatha ukusuka kumtya (okanye into) ukuya kwinani. Phambi kokudlulela kweli nqanaba, masisebenzise umzobo kunye nekhowudi engentla ukujongana namaxabiso angenanto.
df_categorical.nunique()
sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')
0. Susa iikholamu ezingeyomfuneko
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Ngaba inani lamaxabiso angenanto kule kholamu lingaphezulu kwama-50%?
print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True) #Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°
#ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50% ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Cima imigca enamaxabiso angenanto
df_categorical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ,
#Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Ukufaka ixabiso elingalindelekanga
import random
df_categorical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True)
3.2. Ukufaka ixabiso elingaguqukiyo
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>")
df_categorical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_categorical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']])
df_categorical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True)
Ke, ekugqibeleni sifumene umqheba kwii-nulls kwidatha esemgangathweni. Ngoku lixesha lokwenza i-encoding enye-eshushu kumaxabiso akwidatabase yakho. Le ndlela isetyenziswa rhoqo ukuqinisekisa ukuba i-algorithm yakho inokufunda kwidatha ekumgangatho ophezulu.
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
features_to_encode = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"]
for feature in features_to_encode:
df_categorical = encode_and_bind(df_categorical, feature))
Ke, ekugqibeleni sigqibile ukusetyenzwa ngokwahlukeneyo ngokomgangatho kunye nedatha yobungakanani-ixesha lokuzidibanisa kwakhona
new_df = pd.concat([df_numerical,df_categorical], axis=1)
Emva kokuba sidibanise iiseti zedatha zibe nye, ekugqibeleni sinokusebenzisa ukuguqulwa kwedatha usebenzisa i-MinMaxScaler kwilayibrari ye-sklearn. Oku kuya kwenza amaxabiso ethu phakathi kwe-0 kunye ne-1, eya kunceda xa siqeqesha imodeli kwixesha elizayo.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)
Le datha ngoku ilungele nantoni na - i-neural networks, standard ML algorithms, njl.!
Kweli nqaku, asikhange sithathele ngqalelo ukusebenza ngedatha yedatha, kuba kwidatha enjalo kufuneka usebenzise ubuchule bokucwangcisa obahluke kancinci, kuxhomekeke kumsebenzi wakho. Kwixesha elizayo, iqela lethu liya kunikela inqaku elahlukileyo kwesi sihloko, kwaye siyathemba ukuba iya kuba nako ukuzisa into enomdla, entsha kunye luncedo ebomini bakho, njengale.
umthombo: www.habr.com