Imvamisa abantu abangena emkhakheni we-Data Science banamathemba angaphansi kwamaqiniso alokho okubalindile. Abantu abaningi bacabanga ukuthi manje bazobhala amanethiwekhi apholile we-neural, benze umsizi wezwi ovela ku-Iron Man, noma bashaye wonke umuntu ezimakethe zezimali.
Kodwa sebenza Idatha Usosayensi uqhutshwa yidatha, futhi enye yezinto ezibaluleke kakhulu nezidla isikhathi ukucubungula idatha ngaphambi kokuyiphakela kunethiwekhi ye-neural noma ukuyihlaziya ngendlela ethile.
Kulesi sihloko, ithimba lethu lizochaza ukuthi ungayicubungula kanjani idatha ngokushesha futhi kalula ngemiyalo yesinyathelo nesinyathelo kanye nekhodi. Sizame ukwenza ikhodi ivumelane nezimo futhi ingase isetshenziselwe amasethi edatha ahlukene.
Ochwepheshe abaningi bangase bangatholi lutho olungavamile kulesi sihloko, kodwa abaqalayo bazokwazi ukufunda okuthile okusha, futhi noma ubani osenesikhathi eside ephupha ukwenza incwajana ehlukile yokucubungula idatha esheshayo nehlelekile angakopisha ikhodi futhi azifomethe yona, noma
Sithole idathasethi. Yini okufanele uyenze ngokulandelayo?
Ngakho-ke, indinganiso: sidinga ukuqonda ukuthi sibhekene nani, isithombe sisonke. Ukwenza lokhu, sisebenzisa ama-panda ukuze sivele sichaze izinhlobo ezahlukene zedatha.
import pandas as pd #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ pandas
import numpy as np #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ΡΠΈΡΠ°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ df
df.head(3) #ΡΠΌΠΎΡΡΠΈΠΌ Π½Π° ΠΏΠ΅ΡΠ²ΡΠ΅ 3 ΡΡΡΠΎΡΠΊΠΈ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ½ΡΡΡ, ΠΊΠ°ΠΊ Π²ΡΠ³Π»ΡΠ΄ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
df.info() #ΠΠ΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΠ΅ΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°Ρ
Ake sibheke amanani ekholomu:
- Ingabe inombolo yemigqa kukholomu ngayinye ihambisana nenani eliphelele lemigqa?
- Iyini ingqikithi yedatha kukholomu ngayinye?
- Iyiphi ikholomu esifuna ukuyikhomba ukuze senze izibikezelo zayo?
Izimpendulo zale mibuzo zizokuvumela ukuthi uhlaziye idathasethi bese udweba cishe uhlelo lwezenzo zakho ezilandelayo.
Futhi, ukuze sibheke ngokujulile amanani kukholomu ngayinye, singasebenzisa umsebenzi we-pandas explain(). Nokho, okubi kwalo msebenzi ukuthi awunikezi ulwazi mayelana namakholomu anamanani eyunithi yezinhlamvu. Sizobhekana nazo ngokuhamba kwesikhathi.
df.describe()
Ukubona ngomlingo
Ake sibheke lapho singenawo nhlobo amanani:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Lokhu bekuwukubukeka okufushane okuvela phezulu, manje sizodlulela ezintweni ezithakazelisayo kakhulu
Ake sizame ukuthola futhi, uma kungenzeka, sisuse amakholomu anenani elilodwa kuphela kuyo yonke imigqa (ngeke athinte umphumela nganoma iyiphi indlela):
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]] #ΠΠ΅ΡΠ΅Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ, ΠΎΡΡΠ°Π²Π»ΡΡ ΡΠΎΠ»ΡΠΊΠΎ ΡΠ΅ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ, Π² ΠΊΠΎΡΠΎΡΡΡ
Π±ΠΎΠ»ΡΡΠ΅ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
Manje siyazivikela kanye nempumelelo yephrojekthi yethu emigqeni eyimpinda (imigqa equkethe ulwazi olufanayo ngokulandelana okufanayo nomunye wemigqa ekhona):
df.drop_duplicates(inplace=True) #ΠΠ΅Π»Π°Π΅ΠΌ ΡΡΠΎ, Π΅ΡΠ»ΠΈ ΡΡΠΈΡΠ°Π΅ΠΌ Π½ΡΠΆΠ½ΡΠΌ.
#Π Π½Π΅ΠΊΠΎΡΠΎΡΡΡ
ΠΏΡΠΎΠ΅ΠΊΡΠ°Ρ
ΡΠ΄Π°Π»ΡΡΡ ΡΠ°ΠΊΠΈΠ΅ Π΄Π°Π½Π½ΡΠ΅ Ρ ΡΠ°ΠΌΠΎΠ³ΠΎ Π½Π°ΡΠ°Π»Π° Π½Π΅ ΡΡΠΎΠΈΡ.
Sihlukanisa idathasethi ibe kabili: eyodwa enamanani ekhwalithi, futhi enye ngamanani
Lapha sidinga ukucacisa okuncane: uma imigqa enedatha engekho kudatha yekhwalithi nenani ingahlobene kakhulu, khona-ke kuzodingeka sinqume ukuthi yini esiyidelayo - yonke imigqa enedatha engekho, ingxenye yayo kuphela, noma amakholomu athile. Uma imigqa ihlotshaniswa, khona-ke sinelungelo lokuhlukanisa idathasethi ibe kabili. Uma kungenjalo, uzodinga kuqala ukubhekana nemigqa engahlobanisi idatha elahlekile ngekhwalithi nenani, bese kuphela uhlukanisa idathasethi ibe kabili.
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])
Senza lokhu ukuze kube lula ngathi ukucubungula lezi zinhlobo ezimbili ezahlukene zedatha - kamuva sizoqonda ukuthi lokhu kwenza impilo yethu ibe lula kangakanani.
Sisebenza ngedatha yobuningi
Into yokuqala okufanele siyenze ukunquma ukuthi akhona yini βamakholomu ezinhloliβ kudatha yobuningi. Lawa makholomu siwabiza kanjalo ngoba azethula njengedatha yobuningi, kodwa asebenza njengedatha yekhwalithi.
Singababona kanjani? Yiqiniso, konke kuncike kumvelo yedatha oyihlaziyayo, kodwa ngokuvamile amakholomu anjalo angase abe nedatha encane eyingqayizivele (esifundeni samanani ayingqayizivele angu-3-10).
print(df_numerical.nunique())
Uma sesihlonze amakholomu ezinhloli, sizowasusa kudatha yobuningi siye kudatha yekhwalithi:
spy_columns = df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]#Π²ΡΠ΄Π΅Π»ΡΠ΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ-ΡΠΏΠΈΠΎΠ½Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΎΡΠ΄Π΅Π»ΡΠ½ΡΡ dataframe
df_numerical.drop(labels=['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3'], axis=1, inplace = True)#Π²ΡΡΠ΅Π·Π°Π΅ΠΌ ΡΡΠΈ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΠΈΠ· ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
Π΄Π°Π½Π½ΡΡ
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΠΏΠ΅ΡΠ²ΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ Π²ΡΠΎΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΡΡΠ΅ΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
Okokugcina, siyihlukanise ngokuphelele idatha yobuningi kudatha yekhwalithi futhi manje singasebenza nayo ngendlela efanele. Into yokuqala ukuqonda lapho sinamanani angenalutho (NaN, futhi kwezinye izimo u-0 uzokwamukelwa njengamanani angenalutho).
for i in df_numerical.columns:
print(i, df[i][df[i]==0].count())
Kuleli qophelo, kubalulekile ukuqonda ukuthi yimaphi amakholomu oziro abangabonisa amanani angekho: ingabe lokhu kungenxa yokuthi idatha iqoqwe kanjani? Noma ingabe ihlobene namanani edatha? Le mibuzo kufanele iphendulwe ecaleni ngalinye.
Ngakho-ke, uma sisanquma ukuthi kungenzeka silahlekelwe yidatha lapho kunoziro, kufanele simiselele oziro sifake i-NaN ukuze kube lula ukusebenza ngale datha elahlekile kamuva:
df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]] = df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]].replace(0, nan)
Manje ake sibone lapho sishoda khona idatha:
sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ΠΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ Π²ΠΎΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡΡΡ df_numerical.info()
Lapha lawo manani angaphakathi kwamakholomu angekho kufanele amakwe ngokuphuzi. Futhi manje ubumnandi buqala - kanjani ukubhekana nalezi zindinganiso? Ingabe kufanele ngisuse imigqa ngalawa manani noma amakholomu? Noma ugcwalise la manani angenalutho namanye?
Nawu umdwebo olinganiselwe ongakusiza ukuthi unqume ukuthi yini engenziwa ngamavelu angenalutho:
0. Susa amakholomu angadingekile
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Ingabe inombolo yamanani angenalutho kule kholomu ingaphezu kuka-50%?
print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50 ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Susa imigqa enamanani angenalutho
df_numerical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ, Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Ukufaka inani elingahleliwe
import random #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ random
df_numerical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True) #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΡΠ°Π½Π΄ΠΎΠΌΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Π² ΠΏΡΡΡΡΠ΅ ΠΊΠ»Π΅ΡΠΊΠΈ ΡΠ°Π±Π»ΠΈΡΡ
3.2. Ukufaka inani elingaguquki
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>") #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ SimpleImputer
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.3. Faka inani elimaphakathi noma elivame kakhulu
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #Π²ΠΌΠ΅ΡΡΠΎ mean ΠΌΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ most_frequent
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.4. Faka inani elibalwe ngenye imodeli
Kwesinye isikhathi amanani angabalwa kusetshenziswa amamodeli wokuhlehla kusetshenziswa amamodeli asuka kulabhulali ye-sklearn noma eminye imitapo yolwazi efanayo. Ithimba lethu lizonikela ngendatshana ehlukile yokuthi lokhu kungenziwa kanjani esikhathini esizayo esiseduze.
Ngakho-ke, okwamanje, ukulandisa mayelana nedatha yobuningi kuzophazamiseka, ngoba kukhona amanye ama-nuances amaningi mayelana nendlela yokwenza kangcono ukulungiselelwa kwedatha nokucubungula ngaphambili kwemisebenzi ehlukene, futhi izinto eziyisisekelo zedatha yobuningi zicatshangelwe kulesi sihloko, futhi manje yisikhathi sokubuyela kudatha yekhwalithi.esihlukanise izinyathelo ezimbalwa emuva kwenani. Ungashintsha le notebook ngokuthanda kwakho, uyivumelanise nemisebenzi eyahlukene, ukuze ukucubungula idatha kuhambe ngokushesha okukhulu!
Idatha yekhwalithi
Ngokuyisisekelo, ngedatha yekhwalithi, indlela ye-One-hot-encoding isetshenziswa ukuze ifomethwe isuka kuyunithi yezinhlamvu (noma into) iye enombolweni. Ngaphambi kokudlulela kuleli phuzu, masisebenzise umdwebo nekhodi engenhla ukuze sibhekane namanani angenalutho.
df_categorical.nunique()
sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')
0. Susa amakholomu angadingekile
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Ingabe inombolo yamanani angenalutho kule kholomu ingaphezu kuka-50%?
print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True) #Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°
#ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50% ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Susa imigqa enamanani angenalutho
df_categorical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ,
#Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Ukufaka inani elingahleliwe
import random
df_categorical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True)
3.2. Ukufaka inani elingaguquki
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>")
df_categorical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_categorical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']])
df_categorical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True)
Ngakho-ke, ekugcineni sithole isibambo sama-nulls kudatha yekhwalithi. Manje sekuyisikhathi sokufaka ikhodi eyodwa-okushisayo kumanani akusizindalwazi sakho. Le ndlela ivame ukusetshenziswa kakhulu ukuqinisekisa ukuthi i-algorithm yakho ingafunda kudatha yekhwalithi ephezulu.
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
features_to_encode = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"]
for feature in features_to_encode:
df_categorical = encode_and_bind(df_categorical, feature))
Ngakho-ke, ekugcineni sesiqedile ukucubungula idatha ehlukene yekhwalithi nenani - isikhathi sokuyihlanganisa futhi
new_df = pd.concat([df_numerical,df_categorical], axis=1)
Ngemuva kokuthi sihlanganise amasethi edatha ndawonye abe yinye, ekugcineni singasebenzisa ukuguqulwa kwedatha sisebenzisa i-MinMaxScaler kusuka kumtapo wezincwadi we-sklearn. Lokhu kuzokwenza amanani ethu abe phakathi kuka-0 no-1, okuzosiza lapho siqeqesha imodeli esikhathini esizayo.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)
Le datha manje isilungele noma yini - amanethiwekhi e-neural, ama-algorithms ajwayelekile e-ML, njll.!
Kulesi sihloko, asizange sikucabangele ukusebenza nedatha yochungechunge lwesikhathi, ngoba kudatha enjalo kufanele usebenzise amasu okucubungula ahluke kancane, kuye ngomsebenzi wakho. Ngokuzayo, ithimba lethu lizonikela ngesihloko esihlukile kulesi sihloko, futhi sithemba ukuthi lizokwazi ukuletha okuthile okuthakazelisayo, okusha nokuwusizo empilweni yakho, njengalena.
Source: www.habr.com