ืืขืชืื ืงืจืืืืช ืืื ืฉืื ืื ืื ืกืื ืืชืืื ืืืข ืื ืชืื ืื ืืฉ ืฆืืคืืืช ืคืืืช ืืืฆืืืืชืืืช ืืื ืฉืืฆืคื ืืื. ืื ืฉืื ืจืืื ืืืฉืืื ืฉืขืืฉืื ืื ืืืชืื ืจืฉืชืืช ื ืืืจืื ืื ืืื ืืืืช, ืืืฆืจื ืขืืืจ ืงืืื ืืืืืจืื ืื, ืื ืื ืฆืื ืืช ืืืื ืืฉืืืงืื ืืคืื ื ืกืืื.
ืืื ืขืืืื ื ืชืื ืื ืืืืขื ืืื ืข ื ืชืื ืื, ืืืื ืืืืืืื ืืืฉืืืื ืืืฆืืจืืื ืืืืชืจ ืืื ืขืืืื ืื ืชืื ืื ืืคื ื ืืื ืชื ืืจืฉืช ืขืฆืืืช ืื ื ืืชืืื ืืฆืืจื ืืกืืืืช.
ืืืืืจ ืื ืืฆืืืช ืฉืื ื ืืชืืจ ืืืฆื ืชืืื ืืขืื ื ืชืื ืื ืืืืืจืืช ืืืงืืืช ืขื ืืืจืืืช ืืงืื ืฉืื ืืืจ ืฉืื. ื ืืกืื ื ืืืคืื ืืช ืืงืื ืืืืืฉ ืืืื ืื ืืชื ืืื ืืืฉืชืืฉ ืื ืขืืืจ ืืขืจืื ื ืชืื ืื ืฉืื ืื.
ืื ืฉื ืืงืฆืืข ืจืืื ืืืื ืื ืืืฆืื ืืฉืื ืืืฆื ืืืคื ืืืืืจ ืื, ืืื ืืชืืืืื ืืืืื ืืืืื ืืฉืื ืืืฉ, ืืื ืื ืฉืืื ืืื ืจื ืืืฆืืจ ืืืืจืช ื ืคืจืืช ืืขืืืื ื ืชืื ืื ืืืืจ ืืืืื ื ืืืื ืืืขืชืืง ืืช ืืงืื ืืืขืฆื ืืืชื ืืขืฆืื, ืื
ืงืืืื ื ืืช ืืขืจื ืื ืชืื ืื. ืื ืืขืฉืืช ืืืจ ืื?
ืื, ืืกืื ืืจื: ืื ืื ื ืฆืจืืืื ืืืืื ืขื ืื ืื ืื ื ืืชืืืืืื, ืืช ืืชืืื ื ืืืืืืช. ืืฉื ืื, ืื ื ืืฉืชืืฉืื ืืคื ืืืช ืืื ืคืฉืื ืืืืืืจ ืกืืื ื ืชืื ืื ืฉืื ืื.
import pandas as pd #ะธะผะฟะพััะธััะตะผ pandas
import numpy as np #ะธะผะฟะพััะธััะตะผ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ัะธัะฐะตะผ ะดะฐัะฐัะตั ะธ ะทะฐะฟะธััะฒะฐะตะผ ะฒ ะฟะตัะตะผะตะฝะฝัั df
df.head(3) #ัะผะพััะธะผ ะฝะฐ ะฟะตัะฒัะต 3 ัััะพัะบะธ, ััะพะฑั ะฟะพะฝััั, ะบะฐะบ ะฒัะณะปัะดัั ะทะฝะฐัะตะฝะธั
df.info() #ะะตะผะพะฝัััะธััะตะผ ะธะฝัะพัะผะฐัะธั ะพ ะบะพะปะพะฝะบะฐั
ืืืื ื ืกืชืื ืขื ืขืจืื ืืขืืืืืช:
- ืืื ืืกืคืจ ืืฉืืจืืช ืืื ืขืืืื ืชืืื ืืืกืคืจ ืืฉืืจืืช ืืืืื?
- ืืื ืืืืช ืื ืชืื ืื ืืื ืขืืืื?
- ืืืืื ืขืืืื ืื ื ืจืืฆืื ืืืงื ืืื ืืืฆืข ืชืืืืืช ืขืืืจื?
ืืชืฉืืืืช ืืฉืืืืช ืืื ืืืคืฉืจื ืื ืื ืชื ืืช ืืขืจื ืื ืชืื ืื ืืืฉืจืื ืืืืคื ืืก ืชืืื ืืช ืืคืขืืืืช ืืืืืช ืฉืื.
ืืื ืื, ืืืื ืืขืืืง ืืืชืจ ืขื ืืขืจืืื ืืื ืขืืืื, ื ืืื ืืืฉืชืืฉ ืืคืื ืงืฆืืืช pandas describe() . ืขื ืืืช, ืืืืกืจืื ืฉื ืคืื ืงืฆืื ืื ืืื ืฉืืื ืืื ื ืืกืคืงืช ืืืืข ืขื ืขืืืืืช ืขื ืขืจืื ืืืจืืืช. ื ืืคื ืืื ืืืืืจ ืืืชืจ.
df.describe()
ืืืืืืช ืงืกื
ืืืื ื ืกืชืื ืืืคื ืืื ืื ื ืขืจืืื ืืืื:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
ืื ืืื ืืื ืงืฆืจ ืืืืขืื, ืขืืฉืื ื ืขืืืจ ืืืืจืื ืืขื ืืื ืื ืืืชืจ
ืืืื ื ื ืกื ืืืฆืื, ืืื ืืคืฉืจ, ืืืกืืจ ืขืืืืืช ืฉืืฉ ืืื ืจืง ืขืจื ืืื ืืื ืืฉืืจืืช (ืื ืื ืืฉืคืืขื ืืฉืื ืฆืืจื ืขื ืืชืืฆืื):
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]] #ะะตัะตะทะฐะฟะธััะฒะฐะตะผ ะดะฐัะฐัะตั, ะพััะฐะฒะปัั ัะพะปัะบะพ ัะต ะบะพะปะพะฝะบะธ, ะฒ ะบะพัะพััั
ะฑะพะปััะต ะพะดะฝะพะณะพ ัะฝะธะบะฐะปัะฝะพะณะพ ะทะฝะฐัะตะฝะธั
ืืขืช ืื ื ืืื ืื ืขื ืขืฆืื ื ืืขื ืืฆืืืช ืืคืจืืืงื ืฉืื ื ืืคื ื ืฉืืจืืช ืืคืืืืช (ืงืืืื ืืืืืืื ืืช ืืืชื ืืืืข ืืืืชื ืกืืจ ืืื ืืื ืืืงืืืื ืืงืืืืื):
df.drop_duplicates(inplace=True) #ะะตะปะฐะตะผ ััะพ, ะตัะปะธ ััะธัะฐะตะผ ะฝัะถะฝัะผ.
#ะ ะฝะตะบะพัะพััั
ะฟัะพะตะบัะฐั
ัะดะฐะปััั ัะฐะบะธะต ะดะฐะฝะฝัะต ั ัะฐะผะพะณะพ ะฝะฐัะฐะปะฐ ะฝะต ััะพะธั.
ืื ื ืืืืงืื ืืช ืืขืจื ืื ืชืื ืื ืืฉื ืืื: ืืืื ืขื ืขืจืืื ืืืืืชืืื ืืืฉื ื ืขื ืขืจืืื ืืืืชืืื
ืืื ืฆืจืื ืืขืฉืืช ืืืืจื ืงืื ื: ืื ืืงืืืื ืขื ื ืชืื ืื ืืกืจืื ืื ืชืื ืื ืืืืืชืืื ืืืืืชืืื ืืื ื ืืชืืืืื ืืืืืื ืื ืขื ืื, ืื ื ืฆืืจื ืืืืืื ืื ืื ืื ื ืืงืจืืืื - ืื ืืงืืืื ืขื ื ืชืื ืื ืืกืจืื, ืจืง ืืืง ืืื, ืื ืขืืืืืช ืืกืืืืืช. ืื ืืงืืืื ืืชืืืืื, ืื ืืฉ ืื ื ืืช ืื ืืืืืช ืืืืง ืืช ืืขืจื ืื ืชืื ืื ืืฉื ืืื. ืืืจืช, ืชืฆืืจืื ืงืืื ืื ืืืชืืืื ืขื ืืฉืืจืืช ืฉืืื ื ืืชืืืืืช ืืช ืื ืชืื ืื ืืืกืจืื ืืืืื ื ืืืืืชืืช ืืืืืชืืช, ืืจืง ืื ืืืืง ืืช ืืขืจื ืื ืชืื ืื ืืฉื ืืื.
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])
ืื ื ืขืืฉืื ืืืช ืืื ืืืงื ืขืืื ื ืืขืื ืืช ืฉื ื ืกืืื ืื ืชืื ืื ืืฉืื ืื ืืืื โ ืืืืฉื ื ืืื ืขื ืืื ืื ืืงื ืขื ืืืื ื.
ืื ื ืขืืืืื ืขื ื ืชืื ืื ืืืืชืืื
ืืืืจ ืืจืืฉืื ืฉืขืืื ื ืืขืฉืืช ืืื ืืงืืืข ืื ืืฉ "ืขืืืืืช ืจืืืื" ืื ืชืื ืื ืืืืืชืืื. ืื ื ืงืืจืืื ืืขืืืืืช ืืืื ืื ืืืืืื ืฉืื ืืฆืืืืช ืืช ืขืฆืื ืื ืชืื ืื ืืืืชืืื, ืื ืคืืขืืืช ืื ืชืื ืื ืืืืืชืืื.
ืืื ื ืืื ืืืืืช ืืืชื? ืืืืื ืฉืืื ืชืืื ืืืืคื ืื ืชืื ืื ืฉืืชื ืื ืชื, ืืื ืืืืคื ืืืื ืขืืืืืช ืืืื ืขืฉืืืืช ืืืืื ืืขื ื ืชืื ืื ืืืืืืืื (ืืืืืจ ืฉื 3-10 ืขืจืืื ืืืืืืืื).
print(df_numerical.nunique())
ืืืืจ ืฉืืืืื ื ืืช ืขืืืืืช ืืจืืืื, ื ืขืืืจ ืืืชื ืื ืชืื ืื ืืืืชืืื ืื ืชืื ืื ืืืืืชืืื:
spy_columns = df_numerical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']]#ะฒัะดะตะปัะตะผ ะบะพะปะพะฝะบะธ-ัะฟะธะพะฝั ะธ ะทะฐะฟะธััะฒะฐะตะผ ะฒ ะพัะดะตะปัะฝัั dataframe
df_numerical.drop(labels=['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะบะฐ2', 'ะบะพะปะพะฝะบะฐ3'], axis=1, inplace = True)#ะฒััะตะทะฐะตะผ ััะธ ะบะพะปะพะฝะบะธ ะธะท ะบะพะปะธัะตััะฒะตะฝะฝัั
ะดะฐะฝะฝัั
df_categorical.insert(1, 'ะบะพะปะพะฝะบะฐ1', spy_columns['ะบะพะปะพะฝะบะฐ1']) #ะดะพะฑะฐะฒะปัะตะผ ะฟะตัะฒัั ะบะพะปะพะฝะบั-ัะฟะธะพะฝ ะฒ ะบะฐัะตััะฒะตะฝะฝัะต ะดะฐะฝะฝัะต
df_categorical.insert(1, 'ะบะพะปะพะฝะบะฐ2', spy_columns['ะบะพะปะพะฝะบะฐ2']) #ะดะพะฑะฐะฒะปัะตะผ ะฒัะพััั ะบะพะปะพะฝะบั-ัะฟะธะพะฝ ะฒ ะบะฐัะตััะฒะตะฝะฝัะต ะดะฐะฝะฝัะต
df_categorical.insert(1, 'ะบะพะปะพะฝะบะฐ3', spy_columns['ะบะพะปะพะฝะบะฐ3']) #ะดะพะฑะฐะฒะปัะตะผ ััะตััั ะบะพะปะพะฝะบั-ัะฟะธะพะฝ ะฒ ะบะฐัะตััะฒะตะฝะฝัะต ะดะฐะฝะฝัะต
ืืืกืืฃ, ืืคืจืื ื ืืืืืืื ื ืชืื ืื ืืืืชืืื ืื ืชืื ืื ืืืืืชืืื ืืขืืฉืื ืื ืื ื ืืืืืื ืืขืืื ืืืชื ืืื ืฉืฆืจืื. ืืืืจ ืืจืืฉืื ืืื ืืืืื ืืืื ืืฉ ืื ื ืขืจืืื ืจืืงืื (NaN, ืืืืงืจืื ืืกืืืืื 0 ืืชืงืืื ืืขืจืืื ืจืืงืื).
for i in df_numerical.columns:
print(i, df[i][df[i]==0].count())
ืืฉืื ืื, ืืฉืื ืืืืื ืืืืื ืขืืืืืช ืืคืกืื ืขืฉืืืื ืืืฆืืืข ืขื ืขืจืืื ืืกืจืื: ืืื ืื ื ืืืข ืืืืคื ืืืกืืฃ ืื ืชืื ืื? ืื ืฉืื ืืืื ืืืืืช ืงืฉืืจ ืืขืจืื ืื ืชืื ืื? ืืฉ ืืขื ืืช ืขื ืฉืืืืช ืืื ืขื ืืกืืก ืื ืืงืจื ืืืืคื.
ืืื, ืื ืืื ืืืช ื ืืืื ืฉืืืชืื ืฉืืกืจืื ืื ื ื ืชืื ืื ืฉืืื ืืฉ ืืคืกืื, ืขืืื ื ืืืืืืฃ ืืช ืืืคืกืื ื-NaN ืืื ืืืงื ืขื ืืขืืืื ืขื ืื ืชืื ืื ืืืืืืื ืืืื ืืืืืจ ืืืชืจ:
df_numerical[["ะบะพะปะพะฝะบะฐ 1", "ะบะพะปะพะฝะบะฐ 2"]] = df_numerical[["ะบะพะปะพะฝะบะฐ 1", "ะบะพะปะพะฝะบะฐ 2"]].replace(0, nan)
ืขืืฉืื ืืืื ื ืจืื ืืืคื ืืกืจืื ืื ื ื ืชืื ืื:
sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ะะพะถะฝะพ ัะฐะบะถะต ะฒะพัะฟะพะปัะทะพะฒะฐัััั df_numerical.info()
ืืื ืืฉ ืืกืื ืืช ืืขืจืืื ืืืกืจืื ืืชืื ืืขืืืืืช ืืฆืืื. ืืขืืฉืื ืืชืืื ืืืืฃ - ืืื ืืชืืืืืื ืขื ืืขืจืืื ืืืื? ืืื ืขืื ืืืืืง ืฉืืจืืช ืขื ืืขืจืืื ืื ืืขืืืืืช ืืืื? ืื ืืืื ืืช ืืขืจืืื ืืจืืงืื ืืืื ืืืื ืืืจืื?
ืืืื ืชืจืฉืื ืืฉืืขืจ ืฉืืืื ืืขืืืจ ืื ืืืืืื ืื ื ืืชื, ืืืืคื ืขืงืจืื ื, ืืขืฉืืช ืขื ืขืจืืื ืจืืงืื:
0. ืืกืจ ืขืืืืืช ืืืืชืจืืช
df_numerical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True)
1. ืืื ืืกืคืจ ืืขืจืืื ืืจืืงืื ืืขืืืื ืื ืืืื ื-50%?
print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)
df_numerical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True)#ะฃะดะฐะปัะตะผ, ะตัะปะธ ะบะฐะบะฐั-ัะพ ะบะพะปะพะฝะบะฐ ะธะผะตะตั ะฑะพะปััะต 50 ะฟััััั
ะทะฝะฐัะตะฝะธะน
2. ืืืง ืฉืืจืืช ืขื ืขืจืืื ืจืืงืื
df_numerical.dropna(inplace=True)#ะฃะดะฐะปัะตะผ ัััะพัะบะธ ั ะฟััััะผะธ ะทะฝะฐัะตะฝะธัะผะธ, ะตัะปะธ ะฟะพัะพะผ ะพััะฐะฝะตััั ะดะพััะฐัะพัะฝะพ ะดะฐะฝะฝัั
ะดะปั ะพะฑััะตะฝะธั
3.1. ืืื ืกืช ืขืจื ืืงืจืื
import random #ะธะผะฟะพััะธััะตะผ random
df_numerical["ะบะพะปะพะฝะบะฐ"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ะบะพะปะพะฝะบะฐ"]), inplace=True) #ะฒััะฐะฒะปัะตะผ ัะฐะฝะดะพะผะฝัะต ะทะฝะฐัะตะฝะธั ะฒ ะฟััััะต ะบะปะตัะบะธ ัะฐะฑะปะธัั
3.2. ืืื ืกืช ืขืจื ืงืืืข
from sklearn.impute import SimpleImputer #ะธะผะฟะพััะธััะตะผ SimpleImputer, ะบะพัะพััะน ะฟะพะผะพะถะตั ะฒััะฐะฒะธัั ะทะฝะฐัะตะฝะธั
imputer = SimpleImputer(strategy='constant', fill_value="<ะะฐัะต ะทะฝะฐัะตะฝะธะต ะทะดะตัั>") #ะฒััะฐะฒะปัะตะผ ะพะฟัะตะดะตะปะตะฝะฝะพะต ะทะฝะฐัะตะฝะธะต ั ะฟะพะผะพััั SimpleImputer
df_numerical[["ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ1",'ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ2','ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ3']] = imputer.fit_transform(df_numerical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะฝะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']]) #ะัะธะผะตะฝัะตะผ ััะพ ะดะปั ะฝะฐัะตะน ัะฐะฑะปะธัั
df_numerical.drop(labels = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"], axis = 1, inplace = True) #ะฃะฑะธัะฐะตะผ ะบะพะปะพะฝะบะธ ัะพ ััะฐััะผะธ ะทะฝะฐัะตะฝะธัะผะธ
3.3. ืืื ืก ืืช ืืขืจื ืืืืืฆืข ืื ืืฉืืื ืืืืชืจ
from sklearn.impute import SimpleImputer #ะธะผะฟะพััะธััะตะผ SimpleImputer, ะบะพัะพััะน ะฟะพะผะพะถะตั ะฒััะฐะฒะธัั ะทะฝะฐัะตะฝะธั
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #ะฒะผะตััะพ mean ะผะพะถะฝะพ ัะฐะบะถะต ะธัะฟะพะปัะทะพะฒะฐัั most_frequent
df_numerical[["ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ1",'ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ2','ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ3']] = imputer.fit_transform(df_numerical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะฝะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']]) #ะัะธะผะตะฝัะตะผ ััะพ ะดะปั ะฝะฐัะตะน ัะฐะฑะปะธัั
df_numerical.drop(labels = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"], axis = 1, inplace = True) #ะฃะฑะธัะฐะตะผ ะบะพะปะพะฝะบะธ ัะพ ััะฐััะผะธ ะทะฝะฐัะตะฝะธัะผะธ
3.4. ืืื ืก ืืช ืืขืจื ืฉืืืฉื ืขื ืืื ืืืื ืืืจ
ืืคืขืืื ื ืืชื ืืืฉื ืขืจืืื ืืืืฆืขืืช ืืืืืื ืฉื ืจืืจืกืื ืืืืฆืขืืช ืืืืืื ืืกืคืจืืืช sklearn ืื ืกืคืจืืืช ืืืืืช ืืืจืืช. ืืฆืืืช ืฉืื ื ืืงืืืฉ ืืืืจ ื ืคืจื ืืืฆื ื ืืชื ืืขืฉืืช ืืืช ืืขืชืื ืืงืจืื.
ืื, ืืขืช ืขืชื, ืื ืจืืื ืขื ื ืชืื ืื ืืืืชืืื ืืืงืืข, ืื ืืฉื ื ื ืืืื ืกืื ืจืืื ืืืจืื ืืืื ืืื ืืขืฉืืช ืืื ืืืชืจ ืืื ืช ื ืชืื ืื ืืขืืืื ืืงืืื ืขืืืจ ืืฉืืืืช ืฉืื ืืช, ืืืืืจืื ืืืกืืกืืื ืขืืืจ ื ืชืื ืื ืืืืชืืื ื ืืงืื ืืืฉืืื ืืืืืจ ืื, ืืื ืขืืฉืื ืื ืืืื ืืืืืจ ืื ืชืื ืื ืืืืืืชื ืืื, ืฉืืคืจืื ื ืืื ืฆืขืืื ืืืืจื ืืืืืืชืืื. ืืชื ืืืื ืืฉื ืืช ืืืืจืช ืื ืืจืฆืื ื, ืืืชืืื ืืืชื ืืืฉืืืืช ืฉืื ืืช, ืื ืฉืขืืืื ืื ืชืื ืื ืืจืืฉ ืืขืืืจ ืืืจ ืืืื!
ื ืชืื ืื ืืืืืชืืื
ืืขืืงืจืื, ืขืืืจ ื ืชืื ืื ืืืืืชืืื, ื ืขืฉื ืฉืืืืฉ ืืฉืืืช One-hot-encoding ืขื ืื ืช ืืขืฆื ืืืชื ืืืืจืืืช (ืื ืืืืืืงื) ืืืกืคืจ. ืืคื ื ืฉื ืืฉืื ืื ืงืืื ืื, ืืื ื ืฉืชืืฉ ืืชืจืฉืื ืืืงืื ืฉืืืขืื ืืื ืืืชืืืื ืขื ืขืจืืื ืจืืงืื.
df_categorical.nunique()
sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')
0. ืืกืจ ืขืืืืืช ืืืืชืจืืช
df_categorical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True)
1. ืืื ืืกืคืจ ืืขืจืืื ืืจืืงืื ืืขืืืื ืื ืืืื ื-50%?
print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)
df_categorical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True) #ะฃะดะฐะปัะตะผ, ะตัะปะธ ะบะฐะบะฐั-ัะพ ะบะพะปะพะฝะบะฐ
#ะธะผะตะตั ะฑะพะปััะต 50% ะฟััััั
ะทะฝะฐัะตะฝะธะน
2. ืืืง ืฉืืจืืช ืขื ืขืจืืื ืจืืงืื
df_categorical.dropna(inplace=True)#ะฃะดะฐะปัะตะผ ัััะพัะบะธ ั ะฟััััะผะธ ะทะฝะฐัะตะฝะธัะผะธ,
#ะตัะปะธ ะฟะพัะพะผ ะพััะฐะฝะตััั ะดะพััะฐัะพัะฝะพ ะดะฐะฝะฝัั
ะดะปั ะพะฑััะตะฝะธั
3.1. ืืื ืกืช ืขืจื ืืงืจืื
import random
df_categorical["ะบะพะปะพะฝะบะฐ"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ะบะพะปะพะฝะบะฐ"]), inplace=True)
3.2. ืืื ืกืช ืขืจื ืงืืืข
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ะะฐัะต ะทะฝะฐัะตะฝะธะต ะทะดะตัั>")
df_categorical[["ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ1",'ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ2','ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ3']] = imputer.fit_transform(df_categorical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะฝะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']])
df_categorical.drop(labels = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"], axis = 1, inplace = True)
ืื ืกืืฃ ืกืืฃ ืืฉ ืื ื ืฉืืืื ืขื nulls ืื ืชืื ืื ืืืืืชืืื. ืขืืฉืื ืืืืข ืืืื ืืืฆืข ืงืืืื ืื ืืื ืขื ืืขืจืืื ืฉื ืืฆืืื ืืืกื ืื ืชืื ืื ืฉืื. ืฉืืื ืื ืืฉืืฉืช ืืขืชืื ืงืจืืืืช ืืืื ืืื ืืืืืื ืฉืืืืืืจืืชื ืฉืื ืืืื ืืืืื ืื ืชืื ืื ืืืืืืช ืืืืื.
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
features_to_encode = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"]
for feature in features_to_encode:
df_categorical = encode_and_bind(df_categorical, feature))
ืื, ืกืืฃ ืกืืฃ ืกืืืื ื ืืขืื ื ืชืื ืื ืืืืืชืืื ืืืืืชืืื ื ืคืจืืื - ืืืืข ืืืื ืืฉืื ืืืชื ืืืืจื
new_df = pd.concat([df_numerical,df_categorical], axis=1)
ืืืืจ ืฉืฉืืืื ื ืืช ืืขืจืื ืื ืชืื ืื ืืื ืืืื, ื ืืื ืกืืฃ ืกืืฃ ืืืฉืชืืฉ ืืืจื ืกืคืืจืืฆืื ืฉื ื ืชืื ืื ืืืืฆืขืืช MinMaxScaler ืืกืคืจืืืช sklearn. ืื ืืืคืื ืืช ืืขืจืืื ืฉืื ื ืืื 0 ื-1, ืื ืฉืืขืืืจ ืืขืช ืืืืื ืืืืื ืืขืชืื.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)
ืื ืชืื ืื ืืืื ืืืื ืื ืืขืช ืืื ืืืจ - ืจืฉืชืืช ืขืฆืืืืช, ืืืืืจืืชืื ML ืกืื ืืจืืืื ืืื'!
ืืืืืจ ืื, ืื ืืงืื ื ืืืฉืืื ืขืืืื ืขื ื ืชืื ื ืกืืจืืช ืืื, ืฉืื ืขืืืจ ื ืชืื ืื ืืืื ืขืืื ืืืฉืชืืฉ ืืืื ืืงืืช ืขืืืื ืืขื ืฉืื ืืช, ืืืชืื ืืืฉืืื ืฉืื. ืืขืชืื, ืืฆืืืช ืฉืื ื ืืงืืืฉ ืืืืจ ื ืคืจื ืื ืืฉื ืื, ืืื ื ืืงืืืื ืฉืืื ืืืื ืืืื ืืก ืืฉืื ืืขื ืืื, ืืืฉ ืืฉืืืืฉื ืืืืื, ืืืืืง ืืื ืื.
ืืงืืจ: www.habr.com