Inta badan dadka galaya goobta Sayniska Xogta waxay leeyihiin wax ka yar rajooyinka dhabta ah ee waxa iyaga sugaya. Dad badan ayaa u maleynaya in hadda ay qori doonaan shabakadaha neerfaha ee qabow, ay abuuri doonaan caawiye cod ka Iron Man, ama garaaci doonaan qof kasta oo ku jira suuqyada maaliyadeed.
Laakin shaqee Data Saynis yahanku waa xog ku dhisan, mid ka mid ah dhinacyada ugu muhiimsan uguna waqti badan ayaa ah habaynta xogta ka hor inta aan lagu quudin shabakad neerfaha ah ama si gaar ah u falanqayn.
Maqaalkan, kooxdeenu waxay ku qeexi doontaa sida aad si dhakhso leh oo sahlan ugu socodsiin karto xogta adigoo raacaya tilmaan-tallaabo iyo kood. Waxaan isku daynay inaan ka dhigno koodka mid dabacsan oo loo isticmaali karo xog-ururinta kala duwan.
Xirfadlayaal badan ayaa laga yaabaa inaysan ka helin wax aan caadi ahayn maqaalkan, laakiin bilawga waxay awoodi doonaan inay bartaan wax cusub, qof kasta oo muddo dheer ku hamiyay inuu sameeyo buug gaar ah oo loogu talagalay habaynta xogta degdegga ah iyo habaysan ayaa koobi kara koodka oo u qaabayn kara naftiisa, ama
Waxaan helnay xogta. Maxaa la sameeyaa marka xigta?
Marka, halbeegga: waxaan u baahanahay inaan fahamno waxa aan la macaamileyno, sawirka guud. Si tan loo sameeyo, waxaan isticmaalnaa pandas si aan si fudud u qeexno noocyada xogta ee kala duwan.
import pandas as pd #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ pandas
import numpy as np #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ΡΠΈΡΠ°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ df
df.head(3) #ΡΠΌΠΎΡΡΠΈΠΌ Π½Π° ΠΏΠ΅ΡΠ²ΡΠ΅ 3 ΡΡΡΠΎΡΠΊΠΈ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ½ΡΡΡ, ΠΊΠ°ΠΊ Π²ΡΠ³Π»ΡΠ΄ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
df.info() #ΠΠ΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΠ΅ΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°Ρ
Aynu eegno qiimayaasha tiirka:
- Tirada xariiqyada tiir kasta miyay u dhigantaa tirada guud ee xariiqyada?
- Waa maxay nuxurka xogta ku jirta tiir kasta?
- Tiirkee ayaan rabnaa inaan beegsanno si aan saadaalin ugu samayno?
Jawaabaha su'aalahan waxay kuu oggolaanayaan inaad falanqeyso xog-ururinta oo aad qiyaas ahaan u sawirto qorshe ficilladaada xiga.
Sidoo kale, si qoto dheer u eegno qiyamka ku jira tiir kasta, waxaan isticmaali karnaa pandas qeexitaanka () function. Si kastaba ha ahaatee, faa'iido darrada shaqadani waa in aysan bixin macluumaadka ku saabsan tiirarka leh qiimaha xargaha. Hadhow ayaan la macaamili doonaa.
df.describe()
Aragtida sixirka
Aynu eegno halka aynaan wax qiimo ah ku lahayn haba yaraatee:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Tani waxay ahayd muuqaal gaaban oo xagga sare ah, hadda waxaan u gudbi doonaa waxyaabo badan oo xiiso leh
Aynu isku dayno inaan helno oo, hadday suurtagal tahay, ka saarno tiirarka leh qiimaha kaliya ee safafka oo dhan (ma saameyn doonaan natiijada sinaba):
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]] #ΠΠ΅ΡΠ΅Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ, ΠΎΡΡΠ°Π²Π»ΡΡ ΡΠΎΠ»ΡΠΊΠΎ ΡΠ΅ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ, Π² ΠΊΠΎΡΠΎΡΡΡ
Π±ΠΎΠ»ΡΡΠ΅ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
Hadda waxaan ka ilaalinaa nafteena iyo guusha mashruucayaga khadadka nuqul ka mid ah (khadka ay ku jiraan macluumaadka isku midka ah ee isku xiga ee mid ka mid ah khadadka jira):
df.drop_duplicates(inplace=True) #ΠΠ΅Π»Π°Π΅ΠΌ ΡΡΠΎ, Π΅ΡΠ»ΠΈ ΡΡΠΈΡΠ°Π΅ΠΌ Π½ΡΠΆΠ½ΡΠΌ.
#Π Π½Π΅ΠΊΠΎΡΠΎΡΡΡ
ΠΏΡΠΎΠ΅ΠΊΡΠ°Ρ
ΡΠ΄Π°Π»ΡΡΡ ΡΠ°ΠΊΠΈΠ΅ Π΄Π°Π½Π½ΡΠ΅ Ρ ΡΠ°ΠΌΠΎΠ³ΠΎ Π½Π°ΡΠ°Π»Π° Π½Π΅ ΡΡΠΎΠΈΡ.
Waxaan u kala qaybinnaa xogta laba: mid leh qiyam tayo leh, iyo mid kale oo tiro leh
Halkan waxaan u baahan nahay caddayn yar: haddii khadadka leh xogta maqan ee xogta tayada iyo tirada aan si aad ah ula xiriirin midba midka kale, ka dibna waxaan u baahan doonaa inaan go'aan ka gaarno waxa aan allabaryo - dhammaan khadadka xogta maqan, kaliya qayb ka mid ah. ama tiirar gaar ah. Haddii khadadka ay isku xiran yihiin, markaa waxaan xaq u leenahay inaan u qaybinno xogta laba qaybood. Haddii kale, waxa aad marka hore u baahan doontaa in aad wax ka qabato xadhkaha aan isku xidhin xogta maqan ee tayo iyo tiraba, ka dibna kaliya waxa aad u qaybinaysaa kaydka xogta laba.
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])
Waxaan tan u sameynaa si aan noogu fududeyno in aan farsameyno labadan nooc ee xogta - hadhow waxaan fahmi doonaa sida ay tani nolosheenna u fududeyneyso.
Waxaan ku shaqaynaa xogta tirada
Waxa ugu horreeya ee ay tahay in aan samayno waa in aan go'aamino in ay jiraan "Tiirarka basaaska" ee xogta tirada. Waxaan ugu yeernaa tiirarkan sababtoo ah waxay isu soo bandhigaan xog tiro badan, laakiin waxay u dhaqmaan sidii xog tayo leh.
Sideen ku qeexnaa iyaga? Dabcan, wax walba waxay ku xiran yihiin nooca xogta aad falanqeynayso, laakiin guud ahaan tiirarka noocan oo kale ah waxay yeelan karaan xog yar oo gaar ah (ee gobolka 3-10 qiimaha gaarka ah).
print(df_numerical.nunique())
Marka aan aqoonsanno tiirarka basaaska, waxaan ka wareejin doonaa xogta tirada una guuri doonaa xogta tayada leh:
spy_columns = df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]#Π²ΡΠ΄Π΅Π»ΡΠ΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ-ΡΠΏΠΈΠΎΠ½Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΎΡΠ΄Π΅Π»ΡΠ½ΡΡ dataframe
df_numerical.drop(labels=['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3'], axis=1, inplace = True)#Π²ΡΡΠ΅Π·Π°Π΅ΠΌ ΡΡΠΈ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΠΈΠ· ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
Π΄Π°Π½Π½ΡΡ
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΠΏΠ΅ΡΠ²ΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ Π²ΡΠΎΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΡΡΠ΅ΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
Ugu dambeyntii, waxaan si buuxda u kala saarnay xogta tirada iyo xogta tayada oo hadda waxaan si sax ah ugu shaqeyn karnaa. Waxa ugu horreeya waa in la fahmo halka aan ku leenahay qiyamka madhan (NaN, iyo xaaladaha qaarkood 0 waxaa loo aqbali doonaa qiyam madhan).
for i in df_numerical.columns:
print(i, df[i][df[i]==0].count())
Halkaa marka ay marayso, waxaa muhiim ah in la fahmo tiirarka eber ee laga yaabo inay muujinayaan qiyamka maqan: tani ma waxay sabab u tahay sida xogta loo ururiyay? Mise waxay la xidhiidhi kartaa qiyamka xogta? Su'aalahan waa in looga jawaabaa kiis-kiis.
Marka, haddii aan wali go'aansanno in laga yaabo in aan xogta ka maqnaano halka eber ka jiro, waa in aan ku beddelnaa eber NaN si aan u fududeyno in aan xogtan luntay hadhow ku shaqeyso:
df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]] = df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]].replace(0, nan)
Hadda aan aragno halka xogta nooga maqan:
sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ΠΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ Π²ΠΎΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡΡΡ df_numerical.info()
Halkan qiimayaasha ku jira tiirarka maqan waa in lagu calaamadiyaa jaalaha. Oo hadda madadaalo ayaa bilaabmaysa - sida loola macaamilo qiyamkan? Miyaan tirtiraa safafka leh qiyamkan ama tiirarkan? Ama ka buuxi qiyamkan madhan qaar kale?
Halkan waxaa ah jaantus qiyaas ah oo kaa caawin kara inaad go'aansato waxa, mabda'a ahaan, lagu samayn karo qiyam madhan:
0. Ka saar tiirarka aan loo baahnayn
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Tirada madhan ee tiirkan miyay ka badan tahay 50%?
print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50 ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Tirtir khadadka qiimaha madhan
df_numerical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ, Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Gelida qiime aan toos ahayn
import random #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ random
df_numerical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True) #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΡΠ°Π½Π΄ΠΎΠΌΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Π² ΠΏΡΡΡΡΠ΅ ΠΊΠ»Π΅ΡΠΊΠΈ ΡΠ°Π±Π»ΠΈΡΡ
3.2. Gelida qiime joogto ah
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>") #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ SimpleImputer
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.3. Geli celceliska ama qiimaha ugu badan
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #Π²ΠΌΠ΅ΡΡΠΎ mean ΠΌΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ most_frequent
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.4. Geli qiimaha lagu xisaabiyay qaab kale
Mararka qaarkood qiyamka waxaa lagu xisaabin karaa iyadoo la isticmaalayo moodooyinka dib-u-celinta iyadoo la isticmaalayo moodooyinka maktabadda sklearn ama maktabadaha kale ee la midka ah. Kooxdayadu waxay u qoondayn doonaan maqaal gaar ah oo ku saabsan sida tan loo samayn karo mustaqbalka dhow.
Haddaba, hadda, sheekadii ku saabsanayd xogta tirada ayaa hakad geli doonta, sababtoo ah waxaa jira waxyaabo kale oo badan oo ku saabsan sida ugu wanaagsan ee diyaarinta xogta iyo u diyaarinta hawlaha kala duwan, iyo waxyaabaha aasaasiga ah ee xogta tirada ayaa lagu xisaabtamay qodobkan, iyo hadda waa waqtigii loo soo noqon lahaa xogta tayada leh, taas oo aan dhowr tallaabo dib uga soo celinnay kuwii tirada badnaa. Waxaad u bedeli kartaa buug-yarahaan sida aad rabto, adigoo la qabsanaya hawlo kala duwan, si diyaarinta xogta ay u socoto si degdeg ah!
Xogta tayada leh
Asal ahaan, xogta tayada leh, habka One-hot-encoding waxa loo isticmaalaa in laga soo habeeyo xadhig (ama shay) tiro. Kahor intaanan u gudbin bartan, aan isticmaalno jaantuska iyo koodka sare si aan ula macaamilno qiyamka madhan.
df_categorical.nunique()
sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')
0. Ka saar tiirarka aan loo baahnayn
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Tirada madhan ee tiirkan miyay ka badan tahay 50%?
print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True) #Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°
#ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50% ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Tirtir khadadka qiimaha madhan
df_categorical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ,
#Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Gelida qiime aan toos ahayn
import random
df_categorical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True)
3.2. Gelida qiime joogto ah
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>")
df_categorical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_categorical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']])
df_categorical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True)
Markaa, waxaanu ugu dambayntii gacanta ku haynay xogta tayada leh ee nulls. Hadda waa waqtigii lagu samayn lahaa hal-kuleed-ku-dhejin qiimayaasha ku jira xogtaada. Habkan waxaa badanaa loo isticmaalaa si loo hubiyo in algorithm-kaagu uu wax ka baran karo xogta tayada sare leh.
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
features_to_encode = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"]
for feature in features_to_encode:
df_categorical = encode_and_bind(df_categorical, feature))
Markaa, ugu dambayntii waxaanu dhammaynay habaynta xogta tayada iyo tirada gaarka ah - wakhtiga dib loogu daro
new_df = pd.concat([df_numerical,df_categorical], axis=1)
Kadib marka aan isku geyno xogta xogta mid mid ah, waxaan ugu dambeyntii isticmaali karnaa beddelka xogta anagoo adeegsanayna MinMaxScaler ee maktabadda sklearn. Tani waxay ka dhigi doontaa qiyamkayaga u dhexeeya 0 iyo 1, kaas oo ku caawin doona marka la tababarayo qaabka mustaqbalka.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)
Xogtan hadda waxay diyaar u tahay wax kasta - shabakadaha neerfaha, algorithms-ka caadiga ah ee ML, iwm.!
Maqaalkan, ma aanaan ku xisaabtamin la shaqaynta xogta taxanaha wakhtiga, maadaama xogtan oo kale waa inaad isticmaashaa farsamooyin wax-qabad oo kala duwan, iyadoo ku xidhan hawshaada. Mustaqbalka, kooxdayadu waxay mawduucan u qoondayn doonaan maqaal gaar ah, waxaanan rajaynaynaa inay awood u yeelan doonto inay noloshaada u keento wax xiiso leh, cusub oo waxtar leh, sida kan oo kale.
Source: www.habr.com