×Öž×€× ××¢× ××©× ×××֞ס ×ַך××Ö·× ×× ×€×¢×× ×€×× ××Ö·××Ö· ××××¡× ×©×Ö·×€Ö¿× ××Öž×× ××××× ×קעך ××× ×š×¢×Ö·××ס××ש עקס׀֌עק×××ש×Ö·× × ×€×× ×××֞ס ×Ö·×××××¥ ×××. ×€×××¢ ××¢× ××©× ×ך×Ö·××× ×Ö·× ×××Š× ××× ×××¢×× ×©×š××Ö·×× ×§×× × ×¢×ך×Ö·× × ×¢××××֞ךקס, ש×Ö·×€Ö¿× ×Ö· ק×× ×ַס×ס××Ö·× × ×€×× Iron Man, ×Öž×עך ש××Öž×× ×Ö·××¢××¢× ××× ×× ×€×× ×Ö·× ×Š××¢× ××ךק׀×ע׊עך.
×Öž×עך ×ַך××¢× ××Ö·××¢ סס××¢× ×××¡× ××× ××Ö·××-××¢×ך×××, ××× ×××× ×¢×š ×€×× ×× ××¢×š×¡× ××××××ק ××× ×Š×××-ק×Ö·× ×¡×××× × ×ַס׀֌עקץ ××× ×€ÖŒ×š×ַסעס×× × ×× ××Ö·×× ××××עך ×€×××× × ×¢×¡ ××× ×Ö· × ×¢×ך×Ö·× × ×¢×¥ ×Öž×עך ×Ö·× ×Ö·×××××× × ×¢×¡ ××× ×Ö· ×××עך ×××¢×.
××× ××¢× ×ַך××ק×, ××× ××עך ××Ö·× ×©×Ö·×€Ö¿× ×××¢× ××ַשך××Ö·×× ××× ××ך ×§×¢× ×¢× ×€ÖŒ×š×֞׊עס ××Ö·×× ×עש×××× × ××× ××××× ××× ×©×š××-××ך×-שך×× ×× ×¡×ך×ַקש×Ö·× × ××× ×§×Öž×. ××ך ×ע׀ך×××× ×Š× ××Ö·×× ×× ×§×Öž× ××Ö·× ×¥ ×€×עקס×Ö·××Ö·× ××× ×§×¢× ×××× ××¢××××× × ×€Ö¿×ַך ×€×ַךש×××¢× ×¢ ××Ö·××ַסעץ.
×€×××¢ ׀֌ך×֞׀עסס××Öž× ×Ö·×ס ×§×¢× × ××©× ××¢×€Ö¿×× ×¢× ×¢×€ÖŒ×¢×¡ ××סעך××¢××××× ××¢× ××× ××¢× ×ַך××ק×, ×Öž×עך ××××× ×¢×š× ×§×¢× ×¢× ××¢×š× ×¢× ×¢×€ÖŒ×¢×¡ × ××Ö·, ××× ××עך עס ×× ×××֞ס ××× ××Ö·× × ××¢××××× ×Š× ××Ö·×× ×Ö· ××Ö·××× ×עך ××¢×€× ×€Ö¿×ַך ×©× ×¢× ××× ×¡×ך×ַק××©×¢×š× ××Ö·×× ×€ÖŒ×š×ַסעס×× × ×§×¢× ×¢× × ×Öž×××Ö·×× ×× ×§×Öž× ××× ×€Ö¿×֞ך××Ö·× ×¢×¡ ×€Ö¿×ַך ×××, ×Öž×עך
××ך ××Öž×× ××ק×××¢× ×× ××Ö·××. ×××֞ס ×Š× ××Öž× ××××Ö·×עך?
×Ö·×××, ×עך × ×֞ך××Ö·×: ××ך ××Ö·×š×€Ö¿× ×Š× ×€Ö¿×ַךש×××× ×××֞ס ××ך ××Ö·× ×××¢× ×××, ×× ×§××××¢×××ק ××××. ×Š× ××Öž× ××֞ס, ××ך × ××Š× ×€ÖŒ×Ö·× ××ַס ×Š× ×€×©×× ××¢×€×× ××š× ×€×ַךש×××¢× ×¢ ××Ö·×× ×××׀֌ס.
import pandas as pd #ОЌпПÑÑОÑÑеЌ pandas
import numpy as np #ОЌпПÑÑОÑÑеЌ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ÑОÑаеЌ ЎаÑаÑÐµÑ Ðž запОÑÑваеЌ в пеÑеЌеММÑÑ df
df.head(3) #ÑЌПÑÑОЌ Ма пеÑвÑе 3 ÑÑÑПÑкО, ÑÑÐŸÐ±Ñ Ð¿ÐŸÐœÑÑÑ, как вÑглÑÐŽÑÑ Ð·ÐœÐ°ÑеМОÑ
df.info() #ÐеЌПМÑÑÑОÑÑеЌ ОМÑПÑЌаÑÐžÑ ÐŸ кПлПМкаÑ
××× ×¡ ק×ק ××× ×× ×××Ö·× ×××Ö·××עס:
- ×Š× ×× × ××עך ×€×× ×©×ך×ת ××× ××¢×עך ×××Ö·× ×©××××¢× ×Š× ×× ××Ö·× ×¥ × ××עך ×€×× ×©×ך×ת?
- ×××֞ס ××× ×× ×¢×¡×Ö·× ×¡ ×€×× ×× ××Ö·×× ××× ××¢×עך ×××Ö·×?
- ×××֞ס ×××Ö·× ××Öž× ××ך ×××¢×× ×Š× ×Š×× ×Š× ××Ö·×× ×€Ö¿×֞ך××ס××Öž×× ×€Ö¿×ַך ×××?
×× ×¢× ×׀ֿעךס ×Š× ×× ×€Ö¿×š××× ×××¢× ××Öž×× ××ך ×Š× ×€×× ×Ö·× ×עךק×××Ö·×× ×× ××Ö·××Ö·×¡×¢× ××× ××¢×¢×š×¢× ×Š××¢× ×Ö· ×€ÖŒ××Ö·× ×€Ö¿×ַך ×××× ××××Ö·×עך ×ַקש×Ö·× ×.
××××, ×€Ö¿×ַך ×Ö· ××׀֌עך ק×ק ××× ×× ×××Ö·××עס ××× ××¢×עך ×××Ö·×, ××ך ×§×¢× ×¢× × ××Š× ×× ×€ÖŒ×Ö·× ××ַס ××סקך××× () ×€×× ×§×Š××¢. ×Öž×עך, ×× ××ס×Öž×š× ×€×× ××¢× ×€Ö¿×× ×§×Š××¢ ××× ×Ö·× ×¢×¡ ××× × ××©× ×Š×ש××¢×× ××× ×€Ö¿×֞ך××ַ׊××¢ ×××¢×× ×©×€×××× ××× ×©×ך××§× ×××Ö·××עס. ×× ×š ×°×¢× × ×× × ×©×€×¢××¢ ך ××××× ×××¢× .
df.describe()
××Ö·××ק ××××ש×××Ö·××Ö·×××ש×Ö·×
××× ×¡ ק×ק ××× ××× ××ך ××Öž×× ×§××× ×××Ö·××עס ××× ×Ö·××¢:
import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
××֞ס ××× ××¢×××¢× ×Ö· ק×ךץ ק×ק ×€×× ×××××, ×××Š× ××ך ×××¢×× ×××× ××××£ ×Š× ×עך ×ש×ק×Ö·×××¢ ××××
××Öž××ך ׀֌ך××××š× ×Š× ××¢×€Ö¿×× ×¢× ×××, ×××× ××¢×××¢×, ×ַך×Öž×€ÖŒ× ×¢××¢× ×©×€×××× ×××֞ס ××Öž×× ××××× ×××× ×××¢×š× ××× ×Ö·××¢ ך×Öž×× (××× ×××¢×× × ××©× ××××š×§× ×× ×š×¢×××××Ö·× ××× ×§××× ×××¢×):
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]] #ÐеÑезапОÑÑваеЌ ЎаÑаÑеÑ, ПÑÑавлÑÑ ÑПлÑкП Ñе кПлПМкО, в кПÑПÑÑÑ
бПлÑÑе ПЎМПгП ÑМОкалÑМПгП зМаÑеМОÑ
×××Š× ××ך ××ַש××Š× ××× ××× ×× ×׊××× ×€×× ××× ××עך ׀֌ך×××¢×§× ×€×× ×××€ÖŒ××ק×Ö·× ×©×ך×ת (ש×ך×ת ×××֞ס ×Ö·× ×××Ö·××× ×× ××¢×××¢ ××× ×€Ö¿×֞ך××ַ׊××¢ ××× ×עך ××¢×××קעך ס×ך ××× ×××× ×¢×š ×€×× ×× ××××ס××× × ×©×ך×ת):
df.drop_duplicates(inplace=True) #ÐелаеЌ ÑÑП, еÑлО ÑÑОÑаеЌ ÐœÑжМÑÐŒ.
#РМекПÑПÑÑÑ
пÑПекÑаÑ
ÑЎалÑÑÑ ÑакОе ЎаММÑе Ñ ÑаЌПгП МаÑала Ме ÑÑПОÑ.
××ך ××××× ×× ××Ö·××Ö·×¡×¢× ××× ×Š××××: ×××× ×¢×š ××× ×§×××Ö·××××Ö·×××××¢ ×××Ö·××עס, ××× ×× ×× ×עךע ××× ×§×××Ö·× ××××Ö·×××××¢
××Öž ××ך ××Ö·×š×€Ö¿× ×Š× ××Ö·×× ×Ö· ק×××× ×§×עך×Ö·×€×ַק××ש×Ö·×: ×××× ×× ×©×ך×ת ××× ×€×¢×× ××ק ××Ö·×× ××× ×§×××Ö·××××Ö·×××××¢ ××× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·×× ××¢× ×¢× × ××©× ×××עך ק×֞ך×Ö·×××××Ö·× ××× ××¢×עך ×× ×עךע, ××¢××Öž×× ××ך ××Ö·×š×€Ö¿× ×Š× ××ַש×××¡× ×××֞ס ××ך קך×× - ×Ö·××¢ ×× ×©×ך×ת ××× ×€×¢×× ××ק ××Ö·××, ××××× ×××× ×€×× ×××, ×Öž×עך ×××עך ש׀××××. ×××× ×× ×©×ך×ת ××¢× ×¢× ×§×֞ך×Ö·×××××Ö·×, ××ך ××Öž×× ×Ö·××¢ ךע×× ×Š× ××××× ×× ××Ö·××Ö·×¡×¢× ××× ×Š××××. ×Ö·× ×עךש, ××ך ×××¢× ×¢×š×©×עך ××Ö·×š×€Ö¿× ×Š× ××Ö·× ×××¢× ××× ×× ×©×ך×ת ×××֞ס ××Öž× × ×× ×§×֞ך×Ö·×××× ×× ×€×¢×× ××ק ××Ö·×× ××× ×§×××Ö·××××Ö·×××××¢ ××× ×§×××Ö·× ××××Ö·×××××¢, ××× ××××× ××¢××Öž×× ××××× ×× ××Ö·××Ö·×¡×¢× ××× ×Š××××.
df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])
××ך ××Öž× ××֞ס ×Š× ××Ö·×× ×¢×¡ ×ך×× ×עך ×€Ö¿×ַך ××× ×× ×Š× ×€ÖŒ×š×֞׊עס ×× ×Š×××× ×€×ַךש×××¢× ×¢ ×××׀֌ס ×€×× ××Ö·×× - ש׀֌ע×עך ××ך ×××¢×× ×€Ö¿×ַךש×××× ××× ×€×× ×ך×× ×עך ××֞ס ×××× ××× ××עך ××¢××.
××ך ×ַך××¢×× ××× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·××
×עך עךש×עך ××Ö·× ××ך ××Öž× ××Öž× ××× ×Š× ××ַש×××¡× ×Š× ×¢×¡ ××¢× ×¢× "ש׀֌××Öž× ×©×€××××" ××× ×× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·××. ××ך ך××€× ×× ×©×€×××× ×Ö·× ××××Ö·× ××× ×€×֞ךש××¢×× ××× ××× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·××, ×Öž×עך ×Ö·×§× ××× ×§×××Ö·××××Ö·×××××¢ ××Ö·××.
××× ××Öž× ××ך ××¢×€×× ××š× ×××? ×€×× ×§×ךס, עס ×Ö·××¢ ××¢×€ÖŒ×¢× ×ס ××××£ ×× × ×Ö·××ך ×€×× ×× ××Ö·×× ××ך ×Ö·× ×Ö·××××, ×Öž×עך ××× ×Ö·×××¢××××, ×Ö·××Ö· ש׀×××× ×§×¢× ××Öž×× ×Ö· ×××¡× ××× ×Š×ק ××Ö·×× (××× ×עך ××¢×× × ×€×× 3-10 ××× ×Š×ק ×××Ö·××עס).
print(df_numerical.nunique())
×Ö·××Öž× ××ך ××Öž×× ××××¢× ×Ö·×€××× ×× ×©×€ÖŒ××Öž× ×©×€××××, ××ך ×××¢×× ××Ö·× ××× ×€×× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·×× ×Š× ×§×××Ö·××××Ö·×××××¢ ××Ö·××:
spy_columns = df_numerical[['кПлПМка1', 'кПлПка2', 'кПлПМка3']]#вÑЎелÑеЌ кПлПМкО-ÑÐ¿ÐžÐŸÐœÑ Ðž запОÑÑваеЌ в ПÑЎелÑÐœÑÑ dataframe
df_numerical.drop(labels=['кПлПМка1', 'кПлПка2', 'кПлПМка3'], axis=1, inplace = True)#вÑÑезаеЌ ÑÑО кПлПМкО Оз кПлОÑеÑÑвеММÑÑ
ЎаММÑÑ
df_categorical.insert(1, 'кПлПМка1', spy_columns['кПлПМка1']) #ЎПбавлÑеЌ пеÑвÑÑ ÐºÐŸÐ»ÐŸÐœÐºÑ-ÑпОПМ в каÑеÑÑвеММÑе ЎаММÑе
df_categorical.insert(1, 'кПлПМка2', spy_columns['кПлПМка2']) #ЎПбавлÑеЌ вÑПÑÑÑ ÐºÐŸÐ»ÐŸÐœÐºÑ-ÑпОПМ в каÑеÑÑвеММÑе ЎаММÑе
df_categorical.insert(1, 'кПлПМка3', spy_columns['кПлПМка3']) #ЎПбавлÑеЌ ÑÑеÑÑÑ ÐºÐŸÐ»ÐŸÐœÐºÑ-ÑпОПМ в каÑеÑÑвеММÑе ЎаММÑе
׊×× ×¡××£, ××ך ××Öž×× ××֞ך ××€×עש×××× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·×× ×€×× ×§×××Ö·××××Ö·×××××¢ ××Ö·×× ××× ×××Š× ××ך ×§×¢× ×¢× ×ַך××¢×× ××× ×¢×¡ ךע××. ×עך עךש×עך ××Ö·× ××× ×Š× ×€Ö¿×ַךש×××× ××× ××ך ××Öž×× ×××××ק ×××Ö·××עס (× ×Ö·×, ××× ××× ×¢×××¢××¢ ק×ַסעס 0 ×××¢× ×××× ×× ××¢× ×××¢× ××× ×××××ק ×××Ö·××עס).
for i in df_numerical.columns:
print(i, df[i][df[i]==0].count())
××× ××¢× ×€×× ×, עס ××× ××××××ק ×Š× ×€Ö¿×ַךש×××× ××× ×××֞ס ש׀×××× ×עך×֞ס ×§×¢× ×Öž× ××××Ö·×× ×€×¢×× ××ק ×××Ö·××עס: ××× ××֞ס ךע×× ×Š× ××× ×× ××Ö·×× ××¢× ×¢× ××¢×××××? ×Öž×עך ×§×¢× ×¢×¡ ×××× ×©××Ö·××ת ×Š× ×× ××Ö·×× ×××Ö·××עס? ×× ×€×š××עס ×××× ×××× ××¢×¢× ××€×¢×š× ××××£ ×Ö· ×€×Ö·×-×××-×€×Ö·× ×קעך.
×Ö·×××, ×××× ××ך × ×Öž× ××ַש×××¡× ×Ö·× ××ך ×§×¢× ×××× ×€×¢×× ××ק ××Ö·×× ××× ×¢×¡ ××¢× ×¢× ×עך×֞ס, ××ך ××Öž× ×€×ַך×××Ö·×× ×× ×עך×֞ס ××× NaN ×Š× ××Ö·×× ×¢×¡ ×ך×× ×עך ×Š× ×ַך××¢×× ××× ×× ×€×ַך׀×Ö·×× ××Ö·×× ×©×€ÖŒ×¢×עך:
df_numerical[["кПлПМка 1", "кПлПМка 2"]] = df_numerical[["кПлПМка 1", "кПлПМка 2"]].replace(0, nan)
×××Š× ××Öž××ך ××¢× ××× ××ך ×€×¢×× ××Ö·××:
sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ÐПжМП Ñакже вПÑпПлÑзПваÑÑÑÑ df_numerical.info()
××Öž ×× ×××Ö·××עס ×× ×× ×©×€×××× ×××֞ס ××¢× ×¢× ×€×¢×× ××ק ××Öž× ×××× ×× ×ע׊×××× × ××× ××¢×. ××× ×××Š× ×× ×©×€ÖŒ×ַס ××××× - ××× ×Š× ××Ö·× ×××¢× ××× ×× ×××Ö·××עס? ××Öž× ××× ××ס××¢×§× ×š×Öž×× ××× ×× ×××Ö·××עס ×Öž×עך ש׀××××? ×Öž×עך ×€ÖŒ××Öž××××š× ×× ×××××ק ×××Ö·××עס ××× ×¢×××¢××¢ ×× ×עךע?
××Öž ××× ×Ö·× ×ַ׀֌׀֌ך×֞קס×××Ö·××¢ ×××Ö·×ך×Ö·××¢ ×××֞ס ×§×¢× ×¢× ××¢××€Ö¿× ××ך ××ַש×××¡× ×××֞ס ×§×¢× ×¢×, ××× ×€ÖŒ×š×× ×Š××€ÖŒ, ×××× ××¢××× ××× ×××××ק ×××Ö·××עס:
0. ×ַך×Öž×€ÖŒ× ×¢××¢× ××× ××××ק ש׀××××
df_numerical.drop(labels=["кПлПМка1","кПлПМка2"], axis=1, inplace=True)
1. ××× ×× × ××עך ×€×× ×××××ק ×××Ö·××עס ××× ××¢× ×××Ö·× ×עך ××× 50%?
print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)
df_numerical.drop(labels=["кПлПМка1","кПлПМка2"], axis=1, inplace=True)#УЎалÑеЌ, еÑлО какаÑ-ÑП кПлПМка ÐžÐŒÐµÐµÑ Ð±ÐŸÐ»ÑÑе 50 пÑÑÑÑÑ
зМаÑеМОй
2. ××ס××¢×§× ×©×ך×ת ××× ×××××ק ×××Ö·××עס
df_numerical.dropna(inplace=True)#УЎалÑеЌ ÑÑÑПÑкО Ñ Ð¿ÑÑÑÑЌО зМаÑеМОÑЌО, еÑлО пПÑПЌ ПÑÑаМеÑÑÑ ÐŽÐŸÑÑаÑПÑМП ЎаММÑÑ
ÐŽÐ»Ñ ÐŸÐ±ÑÑеМОÑ
3.1. ×× ×¡×¢×š××× × ×Ö· ×ך×Ö·×€ - ××עך×
import random #ОЌпПÑÑОÑÑеЌ random
df_numerical["кПлПМка"].fillna(lambda x: random.choice(df[df[column] != np.nan]["кПлПМка"]), inplace=True) #вÑÑавлÑеЌ ÑаМЎПЌМÑе зМаÑÐµÐœÐžÑ Ð² пÑÑÑÑе клеÑкО ÑаблОÑÑ
3.2. ×× ×¡×¢×š××× × ×Ö· קעס×××עך××ק ××עך×
from sklearn.impute import SimpleImputer #ОЌпПÑÑОÑÑеЌ SimpleImputer, кПÑПÑÑй Ð¿ÐŸÐŒÐŸÐ¶ÐµÑ Ð²ÑÑавОÑÑ Ð·ÐœÐ°ÑеМОÑ
imputer = SimpleImputer(strategy='constant', fill_value="<ÐаÑе зМаÑеМОе зЎеÑÑ>") #вÑÑавлÑеЌ ПпÑеЎелеММПе зМаÑеМОе Ñ Ð¿ÐŸÐŒÐŸÑÑÑ SimpleImputer
df_numerical[["МПваÑ_кПлПМка1",'МПваÑ_кПлПМка2','МПваÑ_кПлПМка3']] = imputer.fit_transform(df_numerical[['кПлПМка1', 'кПлПМка2', 'кПлПМка3']]) #ÐÑОЌеМÑеЌ ÑÑП ÐŽÐ»Ñ ÐœÐ°Ñей ÑаблОÑÑ
df_numerical.drop(labels = ["кПлПМка1","кПлПМка2","кПлПМка3"], axis = 1, inplace = True) #УбОÑаеЌ кПлПМкО ÑП ÑÑаÑÑЌО зМаÑеМОÑЌО
3.3. ×ַך××Ö·× ××××× ×× ××ך××©× ××××¢× ×Öž×עך ך××Ö¿ ×Öž×€× ××עך×
from sklearn.impute import SimpleImputer #ОЌпПÑÑОÑÑеЌ SimpleImputer, кПÑПÑÑй Ð¿ÐŸÐŒÐŸÐ¶ÐµÑ Ð²ÑÑавОÑÑ Ð·ÐœÐ°ÑеМОÑ
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #вЌеÑÑП mean ЌПжМП Ñакже ОÑпПлÑзПваÑÑ most_frequent
df_numerical[["МПваÑ_кПлПМка1",'МПваÑ_кПлПМка2','МПваÑ_кПлПМка3']] = imputer.fit_transform(df_numerical[['кПлПМка1', 'кПлПМка2', 'кПлПМка3']]) #ÐÑОЌеМÑеЌ ÑÑП ÐŽÐ»Ñ ÐœÐ°Ñей ÑаблОÑÑ
df_numerical.drop(labels = ["кПлПМка1","кПлПМка2","кПлПМка3"], axis = 1, inplace = True) #УбОÑаеЌ кПлПМкО ÑП ÑÑаÑÑЌО зМаÑеМОÑЌО
3.4. ×ַך××Ö·× ××××× ×× ×××¢×š× ×§×Ö·×ק××Ö·×××××× ×××š× ×× ×× ×עך ××Öž××¢×
××× ×××Ö·××עס ×§×¢× ×¢× ×××× ×§×Ö·×ק××Ö·×××××× ××× ×š×Ö·×ךעש×Ö·× ××Öž××¢×ס × ××Š× ××Öž××¢×ס ×€×× ×× ×¡×§××¢×Ö·×š× ××××××Öž×עק ×Öž×עך ×× ×עךע ×¢× ××¢× ××××ךעך××. ××× ××עך ××Ö·× ×©×Ö·×€Ö¿× ×××¢× ×Öž×€ÖŒ××¢×× ×Ö· ××Ö·××× ×עך ×ַך×××§× ××××£ ××× ××֞ס ×§×¢× ×¢× ×××× ××¢××× ××× ××¢× ××¢×× ×Š×ק×× ×€Ö¿×.
×עך××××Ö·×, ×× ×עך׊××××× × ×××¢×× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·×× ×××¢× ×××× ×× ×עך×Ö·×€ÖŒ×××, ××××Ö·× ×¢×¡ ××¢× ×¢× ×€×××¢ ×× ×עךע × ××Ö·× ×¡×× ×××¢×× ××× ×Š× ×עסעך ××Ö·×× ××Ö·×× ×Š××ך××××× × ××× ×€ÖŒ×š×¢×€ÖŒ×š×֞סעסס×× × ×€Ö¿×ַך ×€×ַךש×××¢× ×¢ ××ַסקס, ××× ×× ×קעך××ק ××× ×× ×€Ö¿×ַך ק×××Ö·× ××××Ö·×××××¢ ××Ö·×× ××¢× ×¢× ××¢× ×××¢× ××× ×ש××× ××× ××¢× ×ַך××ק×, ××× ×××Š× ××× ×× ×Š××× ×Š× ×Š×ך×קק×××¢× ×Š× ×§×××Ö·××××Ö·×××××¢ ××Ö·×× ×××֞ס ××ך ××€×עש×××× ×¢×××¢××¢ ×ך×× ×Š×ך×ק ×€×× ×× ×§×××Ö·× ××××Ö·×××××¢. ××ך ×§×¢× ×¢× ××××©× ××¢× ××¢×€× ××× ××ך ×××××, ׊××€ÖŒ×Ö·×¡× ×¢×¡ ×Š× ×€×ַךש×××¢× ×¢ ××ַסקס, ×Ö·××× ×Ö·× ×× ×€ÖŒ×š×׀֌ך×ַסעס×× × ×€×× ××Ö·×× ××× ×××עך ×עש×××× ×!
ק×××Ö·××××Ö·×××××¢ ××Ö·××
×××ס×ק××, ×€Ö¿×ַך ק×××Ö·××××Ö·×××××¢ ××Ö·××, ×× ××××-×××ס-×¢× ×§×Öž××× × ×××€Ö¿× ××× ××¢× ××Š× ××× ×¡×ך ×Š× ×€Ö¿×֞ך××Ö·× ×¢×¡ ×€×× ×Ö· ש×ך××§× (×Öž×עך ××××€×¢×¥) ×Š× ×Ö· × ××עך. ××××עך ××ך ××Ö·× ××××£ ×Š× ××¢× ×€×× ×, ××Öž×× ××× ×× × ××Š× ×× ×××Ö·×ך×Ö·××¢ ××× ×§×Öž× ××××× ×Š× ××Ö·× ×××¢× ××× ×××××ק ×××Ö·××עס.
df_categorical.nunique()
sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')
0. ×ַך×Öž×€ÖŒ× ×¢××¢× ××× ××××ק ש׀××××
df_categorical.drop(labels=["кПлПМка1","кПлПМка2"], axis=1, inplace=True)
1. ××× ×× × ××עך ×€×× ×××××ק ×××Ö·××עס ××× ××¢× ×××Ö·× ×עך ××× 50%?
print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)
df_categorical.drop(labels=["кПлПМка1","кПлПМка2"], axis=1, inplace=True) #УЎалÑеЌ, еÑлО какаÑ-ÑП кПлПМка
#ÐžÐŒÐµÐµÑ Ð±ÐŸÐ»ÑÑе 50% пÑÑÑÑÑ
зМаÑеМОй
2. ××ס××¢×§× ×©×ך×ת ××× ×××××ק ×××Ö·××עס
df_categorical.dropna(inplace=True)#УЎалÑеЌ ÑÑÑПÑкО Ñ Ð¿ÑÑÑÑЌО зМаÑеМОÑЌО,
#еÑлО пПÑПЌ ПÑÑаМеÑÑÑ ÐŽÐŸÑÑаÑПÑМП ЎаММÑÑ
ÐŽÐ»Ñ ÐŸÐ±ÑÑеМОÑ
3.1. ×× ×¡×¢×š××× × ×Ö· ×ך×Ö·×€ - ××עך×
import random
df_categorical["кПлПМка"].fillna(lambda x: random.choice(df[df[column] != np.nan]["кПлПМка"]), inplace=True)
3.2. ×× ×¡×¢×š××× × ×Ö· קעס×××עך××ק ××עך×
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ÐаÑе зМаÑеМОе зЎеÑÑ>")
df_categorical[["МПваÑ_кПлПМка1",'МПваÑ_кПлПМка2','МПваÑ_кПлПМка3']] = imputer.fit_transform(df_categorical[['кПлПМка1', 'кПлПМка2', 'кПлПМка3']])
df_categorical.drop(labels = ["кПлПМка1","кПлПМка2","кПлПМка3"], axis = 1, inplace = True)
×Ö·×××, ××ך ××Öž×× ×עס×Öž×£ ××ַק×××¢× ×Ö· ×©×¢×€ÖŒ× ××××£ × ××× ××× ×§×××Ö·××××Ö·×××××¢ ××Ö·××. ×××Š× ×¢×¡ ××× ×Š××× ×Š× ××ך××€××š× ××××-×××ס ק×Öž××ך×× × ××××£ ×× ×××Ö·××עס ×××֞ס ××¢× ×¢× ××× ×××× ××Ö·××Ö·×××ס. ×עך ×××€Ö¿× ××× ×Öž×€× ××¢× ××Š× ×Š× ×¢× ×©×ך ×Ö·× ×××× ×Ö·××עך×××Ö·× ×§×¢× ×¢× ××¢×š× ×¢× ×€×× ××××-ק×××Ö·××××¢× ××Ö·××.
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
features_to_encode = ["кПлПМка1","кПлПМка2","кПлПМка3"]
for feature in features_to_encode:
df_categorical = encode_and_bind(df_categorical, feature))
×Ö·×××, ××ך ××Öž×× ×עס×Öž×£ ×€×ַך××ק ׀֌ך×ַסעס×× × ××Ö·××× ×עך ק×××Ö·××××Ö·×××××¢ ××× ×§×××Ö·× ××××Ö·×××××¢ ××Ö·×× - ׊××× ×Š× ×€×ַך××× ×× ××× ×Š×ך×ק
new_df = pd.concat([df_numerical,df_categorical], axis=1)
× ×Öž× ××ך ××Öž×× ×§×Ö·××××× × ×× ××Ö·××ַסעץ ׊×××Ö·××¢× ××× ××××, ××ך ×§×¢× ×¢× ×עס×Öž×£ × ××Š× ××Ö·×× ×ך×Ö·× ×¡×€×֞ך××ַ׊××¢ × ××Š× MinMaxScaler ×€Ö¿×× ×× ×¡×§××¢×Ö·×š× ××××××Öž×עק. ××֞ס ×××¢× ××Ö·×× ××× ××עך ×××Ö·××עס ׊××××©× 0 ××× 1, ×××֞ס ×××¢× ××¢××€Ö¿× ×××¢× ×ך××× ×× × ×× ××Öž××¢× ××× ×עך ׊×ק×× ×€Ö¿×.
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)
×× ××Ö·×× ××¢× ×¢× ×××Š× ×ך××× ×€Ö¿×ַך ×Ö·××¥ - × ×¢×ך×Ö·× × ×¢××××֞ךקס, × ×֞ך××Ö·× ML ×Ö·××עך×××Ö·××, ×¢×ק!
××× ××¢× ×ַך××ק×, ××ך ××Öž× × ×× × ×¢××¢× ××× ×ש××× ×ך××¢×× ××× ×Š××× ×¡×¢×š××¢ ××Ö·××, ××××Ö·× ×€Ö¿×ַך ×Ö·××Ö· ××Ö·×× ××ך ××Öž× × ××Š× ×Ö· ×××¡× ×Ö·× ×עךש ׀֌ך×ַסעס×× × ××¢×§× ×קס, ×××€ÖŒ×¢× ××× × ××××£ ×××× ×ַך××¢×. ××× ×עך ׊×ק×× ×€Ö¿×, ××× ××עך ××Ö·× ×©×Ö·×€Ö¿× ×××¢× ×Öž×€ÖŒ××¢×× ×Ö· ××Ö·××× ×עך ×ַך×××§× ×Š× ××¢× ××¢××¢, ××× ××ך ××Öž×€Ö¿× ×Ö·× ×¢×¡ ×××¢× ×§×¢× ×¢× ×Š× ××š×¢× ××¢× ×¢×€ÖŒ×¢×¡ ×ש×ק×Ö·×××¢, × ××Ö· ××× × ×׊×ק ××× ×××× ××¢××, ×€ÖŒ×× ×§× ××× ××֞ס.
×ק×ך: www.habr.com