Kazhinji vanhu vanopinda mumunda weData Science vane zvishoma pane zvinotarisirwa zvinotarisirwa zvezvakavamirira. Vanhu vazhinji vanofunga kuti zvino vachanyora inotonhorera neural network, kugadzira mubatsiri wezwi kubva kuIron Man, kana kurova munhu wese mumisika yemari.
Asi basa Data Sainzi inofambiswa nedata, uye chimwe chezvinhu zvakakosha uye zvinotora nguva ndeyekugadzirisa iyo data isati yaidyisa muneural network kana kuiongorora neimwe nzira.
Muchikamu chino, timu yedu inotsanangura maitiro aungaita data nekukurumidza uye nyore nenhanho-nhanho mirairo uye kodhi. Takaedza kuita kuti kodhi inyatso shanduka uye inogona kushandiswa kune akasiyana dataset.
Nyanzvi dzakawanda dzinogona kusawana chero chinhu chinoshamisa muchinyorwa chino, asi vanotanga vanozokwanisa kudzidza chimwe chinhu chitsva, uye chero munhu anga achishuvira kugadzira kabhuku kakasiyana kekukurumidza uye kurongeka kwekugadzirisa data anogona kukopa iyo kodhi uye kuigadzira ivo pachavo, kana
Takagamuchira dataset. Chii chekuita?
Saka, mupimo: tinofanira kunzwisisa zvatiri kubata nazvo, mufananidzo wose. Kuti tiite izvi, tinoshandisa pandas kungotsanangura marudzi akasiyana e data.
import pandas as pd #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ pandas
import numpy as np #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ΡΠΈΡΠ°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ df
df.head(3) #ΡΠΌΠΎΡΡΠΈΠΌ Π½Π° ΠΏΠ΅ΡΠ²ΡΠ΅ 3 ΡΡΡΠΎΡΠΊΠΈ, ΡΡΠΎΠ±Ρ ΠΏΠΎΠ½ΡΡΡ, ΠΊΠ°ΠΊ Π²ΡΠ³Π»ΡΠ΄ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
df.info() #ΠΠ΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΠ΅ΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°Ρ
Ngatitarisei kukosha kwe column:
- Ko nhamba yemitsara mukoramu yega yega inoenderana nehuwandu hwemitsetse here?
- Chii chakakosha che data mune imwe neimwe column?
- Ndeipi column yatinoda kunanga kuitira kuti tiite fungidziro dzayo?
Mhinduro dzemibvunzo iyi dzichakubvumidza kuti uongorore dhatabheti uye utore hurongwa hwezviito zvako zvinotevera.
Zvakare, kuti titarise zvakadzama kukosha pane yega yega, isu tinogona kushandisa iyo pandas inotsanangura () basa. Nekudaro, iyo yakashata yeiyi basa ndeyekuti haipe ruzivo nezvemakoramu ane tambo tsika. Tichagadzirisana navo gare gare.
df.describe()
Kuonekwa kwemashiripiti
Ngatitarisei kwatisina kukosha zvachose:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Uku kwaive kutarisa kupfupi kubva kumusoro, ikozvino tichaenda kune zvimwe zvinonakidza zvinhu
Ngatiedzei kutsvaga uye, kana zvichibvira, bvisa makoramu ane kukosha kumwe chete mumitsara yese (haazokanganisa mhedzisiro neimwe nzira):
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]] #ΠΠ΅ΡΠ΅Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π΄Π°ΡΠ°ΡΠ΅Ρ, ΠΎΡΡΠ°Π²Π»ΡΡ ΡΠΎΠ»ΡΠΊΠΎ ΡΠ΅ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ, Π² ΠΊΠΎΡΠΎΡΡΡ
Π±ΠΎΠ»ΡΡΠ΅ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΠΎΠ³ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
Iye zvino tinozvidzivirira uye kubudirira kwepurojekiti yedu kubva kumitsara yakadzokororwa (mitsetse ine ruzivo rwakafanana mukurongeka kwakafanana neimwe yemitsara iripo):
df.drop_duplicates(inplace=True) #ΠΠ΅Π»Π°Π΅ΠΌ ΡΡΠΎ, Π΅ΡΠ»ΠΈ ΡΡΠΈΡΠ°Π΅ΠΌ Π½ΡΠΆΠ½ΡΠΌ.
#Π Π½Π΅ΠΊΠΎΡΠΎΡΡΡ
ΠΏΡΠΎΠ΅ΠΊΡΠ°Ρ
ΡΠ΄Π°Π»ΡΡΡ ΡΠ°ΠΊΠΈΠ΅ Π΄Π°Π½Π½ΡΠ΅ Ρ ΡΠ°ΠΌΠΎΠ³ΠΎ Π½Π°ΡΠ°Π»Π° Π½Π΅ ΡΡΠΎΠΈΡ.
Isu tinokamura dataset kuita maviri: imwe ine hunhu hunokosha, uye imwe ine huwandu.
Pano isu tinoda kujekesa diki: kana mitsara ine dhata inoshaikwa mune yemhando uye yehuwandu data isinganyatso wirirane, saka isu tichafanirwa kusarudza zvatinobayira - mitsara yese ine data isipo, chikamu chayo chete, kana mamwe makoramu. Kana mitsara yakabatana, saka isu tine kodzero yekukamura dataset kuita maviri. Zvikasadaro, iwe unozofanirwa kutanga wabata nemitsara isingabatanidze iyo yakarasika data mumhando uye huwandu, uye chete wozogovanisa iyo dataset kuita maviri.
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])
Isu tinoita izvi kuti zvive nyore kwatiri kugadzirisa aya marudzi maviri akasiyana edata - gare gare isu tichanzwisisa kuti izvi zvinorerutsa sei hupenyu hwedu.
Isu tinoshanda nehuwandu hwe data
Chinhu chekutanga chatinofanira kuita ndechekuona kana paine "spy columns" muhuwandu hwedata. Tinodaidza makoramu aya nekuti anozviratidza sehuwandu hwedata, asi achiita semhando yedata.
Tingaaziva sei? Ehe, zvese zvinoenderana nemhando yedata rauri kuongorora, asi kazhinji makoramu akadaro anogona kunge aine data shoma rakasiyana (munzvimbo ye3-10 yakasarudzika maitiro).
print(df_numerical.nunique())
Kana tangoona makoramu evasori, tinoafambisa kubva kuhuwandu hwe data kuenda kune yemhando data:
spy_columns = df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]#Π²ΡΠ΄Π΅Π»ΡΠ΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ-ΡΠΏΠΈΠΎΠ½Ρ ΠΈ Π·Π°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ Π² ΠΎΡΠ΄Π΅Π»ΡΠ½ΡΡ dataframe
df_numerical.drop(labels=['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3'], axis=1, inplace = True)#Π²ΡΡΠ΅Π·Π°Π΅ΠΌ ΡΡΠΈ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΠΈΠ· ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
Π΄Π°Π½Π½ΡΡ
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΠΏΠ΅ΡΠ²ΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ Π²ΡΠΎΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
df_categorical.insert(1, 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3', spy_columns['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']) #Π΄ΠΎΠ±Π°Π²Π»ΡΠ΅ΠΌ ΡΡΠ΅ΡΡΡ ΠΊΠΎΠ»ΠΎΠ½ΠΊΡ-ΡΠΏΠΈΠΎΠ½ Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΠ΅ Π΄Π°Π½Π½ΡΠ΅
Chekupedzisira, isu takapatsanura zvakakwana data yehuwandu kubva kune yemhando data uye ikozvino tinogona kushanda nayo nemazvo. Chinhu chekutanga kunzwisisa kwatine hunhu husina chinhu (NaN, uye mune dzimwe nguva 0 inogamuchirwa seyasina chinhu).
for i in df_numerical.columns:
print(i, df[i][df[i]==0].count())
Panguva ino, zvakakosha kuti unzwisise kuti ndeapi makoramu zero angaratidza kushaikwa hunhu: izvi nekuda kwekuunganidzwa kwakaitwa data? Kana kuti ingave yakabatana neiyo data values? Iyi mibvunzo inofanirwa kupindurwa pane imwe nyaya-ne-nyaya.
Saka, kana tichiri kufunga kuti tinogona kunge tisina data pane mazero, tinofanira kutsiva zero neNaN kuti zvive nyore kushanda nedata rakarasika gare gare:
df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]] = df_numerical[["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 1", "ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° 2"]].replace(0, nan)
Zvino ngationei patiri kurasikirwa nedata:
sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ΠΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ Π²ΠΎΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡΡΡ df_numerical.info()
Pano izvo zvakakosha mukati memakoramu asipo zvinofanirwa kumakwa neyero. Uye zvino kunakidzwa kunotanga - maitiro ekuita neaya maitiro? Ndinofanira kudzima mitsara nemakosheni aya kana makoramu? Kana kuzadza aya asina chinhu hunhu nemamwe mamwe?
Heino dhayagiramu yekufungidzira iyo inogona kukubatsira iwe kusarudza izvo zvingaite, mumusimboti, kuitwa nehunhu husina chinhu:
0. Bvisa makoramu asina basa
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Huwandu hwezvisina chinhu mukoramu iyi yakakura kupfuura 50%?
print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)
df_numerical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ° ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50 ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Bvisa mitsetse ine nhanho dzisina chinhu
df_numerical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ, Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Kuisa kukosha kusina kurongeka
import random #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ random
df_numerical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True) #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΡΠ°Π½Π΄ΠΎΠΌΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ Π² ΠΏΡΡΡΡΠ΅ ΠΊΠ»Π΅ΡΠΊΠΈ ΡΠ°Π±Π»ΠΈΡΡ
3.2. Kuisa kukosha kwekugara
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>") #Π²ΡΡΠ°Π²Π»ΡΠ΅ΠΌ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Ρ ΠΏΠΎΠΌΠΎΡΡΡ SimpleImputer
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.3. Isa kukosha kweavhareji kana kuwanda
from sklearn.impute import SimpleImputer #ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ SimpleImputer, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΠΎΠΌΠΎΠΆΠ΅Ρ Π²ΡΡΠ°Π²ΠΈΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #Π²ΠΌΠ΅ΡΡΠΎ mean ΠΌΠΎΠΆΠ½ΠΎ ΡΠ°ΠΊΠΆΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ most_frequent
df_numerical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_numerical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']]) #ΠΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌ ΡΡΠΎ Π΄Π»Ρ Π½Π°ΡΠ΅ΠΉ ΡΠ°Π±Π»ΠΈΡΡ
df_numerical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True) #Π£Π±ΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠΈ ΡΠΎ ΡΡΠ°ΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ
3.4. Isa kukosha kwakaverengerwa neimwe modhi
Dzimwe nguva kukosha kunogona kuverengerwa uchishandisa regression modhi uchishandisa mamodheru kubva kune sklearn raibhurari kana mamwe maraibhurari akafanana. Chikwata chedu chichapa chinyorwa chakasiyana chekuti izvi zvingaitwe sei munguva pfupi iri kutevera.
Saka, ikozvino, rondedzero yehuwandu hwe data ichakanganiswa, nekuti kune mamwe akawanda nuances pamusoro pekuita zvirinani kugadzirira data uye preprocessing yemabasa akasiyana, uye zvinhu zvakakosha zvehuwandu hwedata zvakaverengerwa munyaya ino, uye. ikozvino ndiyo nguva yekudzokera kune qualitative data.iyo yatakaparadzanisa nhanho dzinoverengeka kubva kune yehuwandu. Iwe unogona kushandura kabhuku aka sezvaunoda, uchigadzirisa kune akasiyana mabasa, kuitira kuti data preprocessing iende nekukurumidza!
Qualitative data
Chaizvoizvo, kune yemhando data, iyo One-hot-encoding nzira inoshandiswa kuitira kuigadzira kubva patambo (kana chinhu) kuenda kunhamba. Tisati taenderera mberi kusvika pano, ngatishandisei dhayagiramu nekodhi iri pamusoro kubata nehunhu husina chinhu.
df_categorical.nunique()
sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')
0. Bvisa makoramu asina basa
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True)
1. Huwandu hwezvisina chinhu mukoramu iyi yakakura kupfuura 50%?
print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)
df_categorical.drop(labels=["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2"], axis=1, inplace=True) #Π£Π΄Π°Π»ΡΠ΅ΠΌ, Π΅ΡΠ»ΠΈ ΠΊΠ°ΠΊΠ°Ρ-ΡΠΎ ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°
#ΠΈΠΌΠ΅Π΅Ρ Π±ΠΎΠ»ΡΡΠ΅ 50% ΠΏΡΡΡΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
2. Bvisa mitsetse ine nhanho dzisina chinhu
df_categorical.dropna(inplace=True)#Π£Π΄Π°Π»ΡΠ΅ΠΌ ΡΡΡΠΎΡΠΊΠΈ Ρ ΠΏΡΡΡΡΠΌΠΈ Π·Π½Π°ΡΠ΅Π½ΠΈΡΠΌΠΈ,
#Π΅ΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΌ ΠΎΡΡΠ°Π½Π΅ΡΡΡ Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π΄Π°Π½Π½ΡΡ
Π΄Π»Ρ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ
3.1. Kuisa kukosha kusina kurongeka
import random
df_categorical["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°"]), inplace=True)
3.2. Kuisa kukosha kwekugara
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ΠΠ°ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ Π·Π΄Π΅ΡΡ>")
df_categorical[["Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1",'Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2','Π½ΠΎΠ²Π°Ρ_ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']] = imputer.fit_transform(df_categorical[['ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2', 'ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3']])
df_categorical.drop(labels = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"], axis = 1, inplace = True)
Saka, isu pakupedzisira tave nekubata pane nulls mune yemhando data. Iye zvino yave nguva yekuita-inopisa-encoding pane zvakakosha zviri mudhatabhesi rako. Iyi nzira inonyanya kushandiswa kuve nechokwadi chekuti algorithm yako inogona kudzidza kubva kumhando yepamusoro data.
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
features_to_encode = ["ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°1","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°2","ΠΊΠΎΠ»ΠΎΠ½ΠΊΠ°3"]
for feature in features_to_encode:
df_categorical = encode_and_bind(df_categorical, feature))
Saka, isu takazopedzisira tapedza kugadzirisa yakaparadzana yemhando uye yehuwandu data - nguva yekuvasanganisa kumashure
new_df = pd.concat([df_numerical,df_categorical], axis=1)
Mushure mekunge tabatanidza ma dataset pamwe chete kuita rimwe, tinogona kupedzisira tashandisa shanduko yedata tichishandisa MinMaxScaler kubva kuraibhurari ye sklearn. Izvi zvichaita kuti kukosha kwedu kuve pakati pe0 ne1, izvo zvichabatsira pakudzidzisa modhi mune ramangwana.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)
Iyi data ikozvino yakagadzirira chero chinhu - neural network, yakajairwa ML algorithms, nezvimwe!
Muchinyorwa chino, isu hatina kufunga nezvekushanda nenguva yakatevedzana data, nekuti kune yakadaro data iwe unofanirwa kushandisa akasiyana maitiro ekugadzirisa, zvichienderana nebasa rako. Mune ramangwana, timu yedu ichapa chinyorwa chakasiyana kune iyi musoro, uye tinovimba ichakwanisa kuunza chimwe chinhu chinonakidza, chitsva uye chinobatsira muhupenyu hwako, senge ichi.
Source: www.habr.com