ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

ืœืขืชื™ื ืงืจื•ื‘ื•ืช ืœืื ืฉื™ื ื”ื ื›ื ืกื™ื ืœืชื—ื•ื ืžื“ืข ื”ื ืชื•ื ื™ื ื™ืฉ ืฆื™ืคื™ื•ืช ืคื—ื•ืช ืžืžืฆื™ืื•ืชื™ื•ืช ืžืžื” ืฉืžืฆืคื” ืœื”ื. ืื ืฉื™ื ืจื‘ื™ื ื—ื•ืฉื‘ื™ื ืฉืขื›ืฉื™ื• ื”ื ื™ื›ืชื‘ื• ืจืฉืชื•ืช ื ื•ื™ืจื•ื ื™ื ืžื’ื ื™ื‘ื•ืช, ื™ื™ืฆืจื• ืขื•ื–ืจ ืงื•ืœื™ ืžืื™ื™ืจื•ืŸ ืžืŸ, ืื• ื™ื ืฆื—ื• ืืช ื›ื•ืœื ื‘ืฉื•ื•ืงื™ื ื”ืคื™ื ื ืกื™ื™ื.
ืื‘ืœ ืขื‘ื•ื“ื” ื ืชื•ื ื™ื ื”ืžื“ืขืŸ ืžื•ื ืข ื ืชื•ื ื™ื, ื•ืื—ื“ ื”ื”ื™ื‘ื˜ื™ื ื”ื—ืฉื•ื‘ื™ื ื•ื”ืฆื•ืจื›ื™ื ื‘ื™ื•ืชืจ ื”ื•ื ืขื™ื‘ื•ื“ ื”ื ืชื•ื ื™ื ืœืคื ื™ ื”ื–ื ืชื ืœืจืฉืช ืขืฆื‘ื™ืช ืื• ื ื™ืชื•ื—ื ื‘ืฆื•ืจื” ืžืกื•ื™ืžืช.

ื‘ืžืืžืจ ื–ื” ื”ืฆื•ื•ืช ืฉืœื ื• ื™ืชืืจ ื›ื™ืฆื“ ืชื•ื›ืœ ืœืขื‘ื“ ื ืชื•ื ื™ื ื‘ืžื”ื™ืจื•ืช ื•ื‘ืงืœื•ืช ืขื ื”ื•ืจืื•ืช ื•ืงื•ื“ ืฉืœื‘ ืื—ืจ ืฉืœื‘. ื ื™ืกื™ื ื• ืœื”ืคื•ืš ืืช ื”ืงื•ื“ ืœื’ืžื™ืฉ ืœืžื“ื™ ื•ื ื™ืชืŸ ื”ื™ื” ืœื”ืฉืชืžืฉ ื‘ื• ืขื‘ื•ืจ ืžืขืจื›ื™ ื ืชื•ื ื™ื ืฉื•ื ื™ื.

ืื ืฉื™ ืžืงืฆื•ืข ืจื‘ื™ื ืื•ืœื™ ืœื ื™ืžืฆืื• ืžืฉื”ื• ื™ื•ืฆื ื“ื•ืคืŸ ื‘ืžืืžืจ ื–ื”, ืื‘ืœ ืžืชื—ื™ืœื™ื ื™ื•ื›ืœื• ืœืœืžื•ื“ ืžืฉื”ื• ื—ื“ืฉ, ื•ื›ืœ ืžื™ ืฉื—ืœื ื–ืžืŸ ืจื‘ ืœื™ืฆื•ืจ ืžื—ื‘ืจืช ื ืคืจื“ืช ืœืขื™ื‘ื•ื“ ื ืชื•ื ื™ื ืžื”ื™ืจ ื•ืžื•ื‘ื ื” ื™ื›ื•ืœ ืœื”ืขืชื™ืง ืืช ื”ืงื•ื“ ื•ืœืขืฆื‘ ืื•ืชื• ื‘ืขืฆืžื•, ืื• ื”ื•ืจื“ ืืช ื”ืžื—ื‘ืจืช ื”ืžื•ื’ืžืจืช ืž- Github.

ืงื™ื‘ืœื ื• ืืช ืžืขืจืš ื”ื ืชื•ื ื™ื. ืžื” ืœืขืฉื•ืช ืื—ืจ ื›ืš?

ืื–, ื”ืกื˜ื ื“ืจื˜: ืื ื—ื ื• ืฆืจื™ื›ื™ื ืœื”ื‘ื™ืŸ ืขื ืžื” ืื ื—ื ื• ืžืชืžื•ื“ื“ื™ื, ืืช ื”ืชืžื•ื ื” ื”ื›ื•ืœืœืช. ืœืฉื ื›ืš, ืื ื• ืžืฉืชืžืฉื™ื ื‘ืคื ื“ื•ืช ื›ื“ื™ ืคืฉื•ื˜ ืœื”ื’ื“ื™ืจ ืกื•ื’ื™ ื ืชื•ื ื™ื ืฉื•ื ื™ื.

import pandas as pd #ะธะผะฟะพั€ั‚ะธั€ัƒะตะผ pandas
import numpy as np  #ะธะผะฟะพั€ั‚ะธั€ัƒะตะผ numpy
df = pd.read_csv("AB_NYC_2019.csv") #ั‡ะธั‚ะฐะตะผ ะดะฐั‚ะฐัะตั‚ ะธ ะทะฐะฟะธัั‹ะฒะฐะตะผ ะฒ ะฟะตั€ะตะผะตะฝะฝัƒัŽ df

df.head(3) #ัะผะพั‚ั€ะธะผ ะฝะฐ ะฟะตั€ะฒั‹ะต 3 ัั‚ั€ะพั‡ะบะธ, ั‡ั‚ะพะฑั‹ ะฟะพะฝัั‚ัŒ, ะบะฐะบ ะฒั‹ะณะปัะดัั‚ ะทะฝะฐั‡ะตะฝะธั

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

df.info() #ะ”ะตะผะพะฝัั‚ั€ะธั€ัƒะตะผ ะธะฝั„ะพั€ะผะฐั†ะธัŽ ะพ ะบะพะปะพะฝะบะฐั…

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

ื‘ื•ืื• ื ืกืชื›ืœ ืขืœ ืขืจื›ื™ ื”ืขืžื•ื“ื•ืช:

  1. ื”ืื ืžืกืคืจ ื”ืฉื•ืจื•ืช ื‘ื›ืœ ืขืžื•ื“ื” ืชื•ืื ืœืžืกืคืจ ื”ืฉื•ืจื•ืช ื”ื›ื•ืœืœ?
  2. ืžื”ื™ ืžื”ื•ืช ื”ื ืชื•ื ื™ื ื‘ื›ืœ ืขืžื•ื“ื”?
  3. ืœืื™ื–ื” ืขืžื•ื“ื” ืื ื• ืจื•ืฆื™ื ืœืžืงื“ ื›ื“ื™ ืœื‘ืฆืข ืชื—ื–ื™ื•ืช ืขื‘ื•ืจื•?

ื”ืชืฉื•ื‘ื•ืช ืœืฉืืœื•ืช ืืœื• ื™ืืคืฉืจื• ืœืš ืœื ืชื— ืืช ืžืขืจืš ื”ื ืชื•ื ื™ื ื•ืœืฉืจื˜ื˜ ื‘ืื•ืคืŸ ื’ืก ืชื•ื›ื ื™ืช ืœืคืขื•ืœื•ืช ื”ื‘ืื•ืช ืฉืœืš.

ื›ืžื• ื›ืŸ, ืœืžื‘ื˜ ืžืขืžื™ืง ื™ื•ืชืจ ืขืœ ื”ืขืจื›ื™ื ื‘ื›ืœ ืขืžื•ื“ื”, ื ื•ื›ืœ ืœื”ืฉืชืžืฉ ื‘ืคื•ื ืงืฆื™ื™ืช pandas describe() . ืขื ื–ืืช, ื”ื—ื™ืกืจื•ืŸ ืฉืœ ืคื•ื ืงืฆื™ื” ื–ื• ื”ื•ื ืฉื”ื™ื ืื™ื ื” ืžืกืคืงืช ืžื™ื“ืข ืขืœ ืขืžื•ื“ื•ืช ืขื ืขืจื›ื™ ืžื—ืจื•ื–ืช. ื ื˜ืคืœ ื‘ื”ื ืžืื•ื—ืจ ื™ื•ืชืจ.

df.describe()

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

ื”ื“ืžื™ื™ืช ืงืกื

ื‘ื•ืื• ื ืกืชื›ืœ ืื™ืคื” ืื™ืŸ ืœื ื• ืขืจื›ื™ื ื‘ื›ืœืœ:

import seaborn as sns
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

ื–ื” ื”ื™ื” ืžื‘ื˜ ืงืฆืจ ืžืœืžืขืœื”, ืขื›ืฉื™ื• ื ืขื‘ื•ืจ ืœื“ื‘ืจื™ื ืžืขื ื™ื™ื ื™ื ื™ื•ืชืจ

ื‘ื•ืื• ื ื ืกื” ืœืžืฆื•ื, ื•ืื ืืคืฉืจ, ืœื”ืกื™ืจ ืขืžื•ื“ื•ืช ืฉื™ืฉ ืœื”ืŸ ืจืง ืขืจืš ืื—ื“ ื‘ื›ืœ ื”ืฉื•ืจื•ืช (ื”ืŸ ืœื ื™ืฉืคื™ืขื• ื‘ืฉื•ื ืฆื•ืจื” ืขืœ ื”ืชื•ืฆืื”):

df = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]] #ะŸะตั€ะตะทะฐะฟะธัั‹ะฒะฐะตะผ ะดะฐั‚ะฐัะตั‚, ะพัั‚ะฐะฒะปัั ั‚ะพะปัŒะบะพ ั‚ะต ะบะพะปะพะฝะบะธ, ะฒ ะบะพั‚ะพั€ั‹ั… ะฑะพะปัŒัˆะต ะพะดะฝะพะณะพ ัƒะฝะธะบะฐะปัŒะฝะพะณะพ ะทะฝะฐั‡ะตะฝะธั

ื›ืขืช ืื ื• ืžื’ื ื™ื ืขืœ ืขืฆืžื ื• ื•ืขืœ ื”ืฆืœื—ืช ื”ืคืจื•ื™ืงื˜ ืฉืœื ื• ืžืคื ื™ ืฉื•ืจื•ืช ื›ืคื•ืœื•ืช (ืงื•ื•ื™ื ื”ืžื›ื™ืœื™ื ืืช ืื•ืชื• ืžื™ื“ืข ื‘ืื•ืชื• ืกื“ืจ ื›ืžื• ืื—ื“ ืžื”ืงื•ื•ื™ื ื”ืงื™ื™ืžื™ื):

df.drop_duplicates(inplace=True) #ะ”ะตะปะฐะตะผ ัั‚ะพ, ะตัะปะธ ัั‡ะธั‚ะฐะตะผ ะฝัƒะถะฝั‹ะผ.
                                 #ะ’ ะฝะตะบะพั‚ะพั€ั‹ั… ะฟั€ะพะตะบั‚ะฐั… ัƒะดะฐะปัั‚ัŒ ั‚ะฐะบะธะต ะดะฐะฝะฝั‹ะต ั ัะฐะผะพะณะพ ะฝะฐั‡ะฐะปะฐ ะฝะต ัั‚ะพะธั‚.

ืื ื• ืžื—ืœืงื™ื ืืช ืžืขืจืš ื”ื ืชื•ื ื™ื ืœืฉื ื™ื™ื: ื”ืื—ื“ ืขื ืขืจื›ื™ื ืื™ื›ื•ืชื™ื™ื ื•ื”ืฉื ื™ ืขื ืขืจื›ื™ื ื›ืžื•ืชื™ื™ื

ื›ืืŸ ืฆืจื™ืš ืœืขืฉื•ืช ื”ื‘ื”ืจื” ืงื˜ื ื”: ืื ื”ืงื•ื•ื™ื ืขื ื ืชื•ื ื™ื ื—ืกืจื™ื ื‘ื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื ื•ื›ืžื•ืชื™ื™ื ืื™ื ื ืžืชื•ืืžื™ื ื‘ืžื™ื•ื—ื“ ื–ื” ืขื ื–ื”, ืื– ื ืฆื˜ืจืš ืœื”ื—ืœื™ื˜ ืžื” ืื ื—ื ื• ืžืงืจื™ื‘ื™ื - ื›ืœ ื”ืงื•ื•ื™ื ืขื ื ืชื•ื ื™ื ื—ืกืจื™ื, ืจืง ื—ืœืง ืžื”ื, ืื• ืขืžื•ื“ื•ืช ืžืกื•ื™ืžื•ืช. ืื ื”ืงื•ื•ื™ื ืžืชื•ืืžื™ื, ืื– ื™ืฉ ืœื ื• ืืช ื›ืœ ื”ื–ื›ื•ืช ืœื—ืœืง ืืช ืžืขืจืš ื”ื ืชื•ื ื™ื ืœืฉื ื™ื™ื. ืื—ืจืช, ืชืฆื˜ืจื›ื• ืงื•ื“ื ื›ืœ ืœื”ืชืžื•ื“ื“ ืขื ื”ืฉื•ืจื•ืช ืฉืื™ื ืŸ ืžืชื•ืืžื•ืช ืืช ื”ื ืชื•ื ื™ื ื”ื—ืกืจื™ื ืžื‘ื—ื™ื ื” ืื™ื›ื•ืชื™ืช ื•ื›ืžื•ืชื™ืช, ื•ืจืง ืื– ืœื—ืœืง ืืช ืžืขืจืš ื”ื ืชื•ื ื™ื ืœืฉื ื™ื™ื.

df_numerical = df.select_dtypes(include = [np.number])
df_categorical = df.select_dtypes(exclude = [np.number])

ืื ื• ืขื•ืฉื™ื ื–ืืช ื›ื“ื™ ืœื”ืงืœ ืขืœื™ื ื• ืœืขื‘ื“ ืืช ืฉื ื™ ืกื•ื’ื™ ื”ื ืชื•ื ื™ื ื”ืฉื•ื ื™ื ื”ืœืœื• โ€“ ื‘ื”ืžืฉืš ื ื‘ื™ืŸ ืขื“ ื›ืžื” ื–ื” ืžืงืœ ืขืœ ื—ื™ื™ื ื•.

ืื ื• ืขื•ื‘ื“ื™ื ืขื ื ืชื•ื ื™ื ื›ืžื•ืชื™ื™ื

ื”ื“ื‘ืจ ื”ืจืืฉื•ืŸ ืฉืขืœื™ื ื• ืœืขืฉื•ืช ื”ื•ื ืœืงื‘ื•ืข ืื ื™ืฉ "ืขืžื•ื“ื•ืช ืจื™ื’ื•ืœ" ื‘ื ืชื•ื ื™ื ื”ื›ืžื•ืชื™ื™ื. ืื ื• ืงื•ืจืื™ื ืœืขืžื•ื“ื•ืช ื”ืœืœื• ื›ืš ืžื›ื™ื•ื•ืŸ ืฉื”ืŸ ืžืฆื™ื’ื•ืช ืืช ืขืฆืžืŸ ื›ื ืชื•ื ื™ื ื›ืžื•ืชื™ื™ื, ืืš ืคื•ืขืœื•ืช ื›ื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื.

ืื™ืš ื ื•ื›ืœ ืœื–ื”ื•ืช ืื•ืชื? ื›ืžื•ื‘ืŸ ืฉื”ื›ืœ ืชืœื•ื™ ื‘ืื•ืคื™ ื”ื ืชื•ื ื™ื ืฉืืชื” ืžื ืชื—, ืื‘ืœ ื‘ืื•ืคืŸ ื›ืœืœื™ ืขืžื•ื“ื•ืช ื›ืืœื” ืขืฉื•ื™ื•ืช ืœื”ื›ื™ืœ ืžืขื˜ ื ืชื•ื ื™ื ื™ื™ื—ื•ื“ื™ื™ื (ื‘ืื–ื•ืจ ืฉืœ 3-10 ืขืจื›ื™ื ื™ื™ื—ื•ื“ื™ื™ื).

print(df_numerical.nunique())

ืœืื—ืจ ืฉื–ื™ื”ื™ื ื• ืืช ืขืžื•ื“ื•ืช ื”ืจื™ื’ื•ืœ, ื ืขื‘ื™ืจ ืื•ืชื ืžื ืชื•ื ื™ื ื›ืžื•ืชื™ื™ื ืœื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื:

spy_columns = df_numerical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']]#ะฒั‹ะดะตะปัะตะผ ะบะพะปะพะฝะบะธ-ัˆะฟะธะพะฝั‹ ะธ ะทะฐะฟะธัั‹ะฒะฐะตะผ ะฒ ะพั‚ะดะตะปัŒะฝัƒัŽ dataframe
df_numerical.drop(labels=['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะบะฐ2', 'ะบะพะปะพะฝะบะฐ3'], axis=1, inplace = True)#ะฒั‹ั€ะตะทะฐะตะผ ัั‚ะธ ะบะพะปะพะฝะบะธ ะธะท ะบะพะปะธั‡ะตัั‚ะฒะตะฝะฝั‹ั… ะดะฐะฝะฝั‹ั…
df_categorical.insert(1, 'ะบะพะปะพะฝะบะฐ1', spy_columns['ะบะพะปะพะฝะบะฐ1']) #ะดะพะฑะฐะฒะปัะตะผ ะฟะตั€ะฒัƒัŽ ะบะพะปะพะฝะบัƒ-ัˆะฟะธะพะฝ ะฒ ะบะฐั‡ะตัั‚ะฒะตะฝะฝั‹ะต ะดะฐะฝะฝั‹ะต
df_categorical.insert(1, 'ะบะพะปะพะฝะบะฐ2', spy_columns['ะบะพะปะพะฝะบะฐ2']) #ะดะพะฑะฐะฒะปัะตะผ ะฒั‚ะพั€ัƒัŽ ะบะพะปะพะฝะบัƒ-ัˆะฟะธะพะฝ ะฒ ะบะฐั‡ะตัั‚ะฒะตะฝะฝั‹ะต ะดะฐะฝะฝั‹ะต
df_categorical.insert(1, 'ะบะพะปะพะฝะบะฐ3', spy_columns['ะบะพะปะพะฝะบะฐ3']) #ะดะพะฑะฐะฒะปัะตะผ ั‚ั€ะตั‚ัŒัŽ ะบะพะปะพะฝะบัƒ-ัˆะฟะธะพะฝ ะฒ ะบะฐั‡ะตัั‚ะฒะตะฝะฝั‹ะต ะดะฐะฝะฝั‹ะต

ืœื‘ืกื•ืฃ, ื”ืคืจื“ื ื• ืœื—ืœื•ื˜ื™ืŸ ื ืชื•ื ื™ื ื›ืžื•ืชื™ื™ื ืœื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื ื•ืขื›ืฉื™ื• ืื ื—ื ื• ื™ื›ื•ืœื™ื ืœืขื‘ื•ื“ ืื™ืชื ื›ืžื• ืฉืฆืจื™ืš. ื”ื“ื‘ืจ ื”ืจืืฉื•ืŸ ื”ื•ื ืœื”ื‘ื™ืŸ ื”ื™ื›ืŸ ื™ืฉ ืœื ื• ืขืจื›ื™ื ืจื™ืงื™ื (NaN, ื•ื‘ืžืงืจื™ื ืžืกื•ื™ืžื™ื 0 ื™ืชืงื‘ืœื• ื›ืขืจื›ื™ื ืจื™ืงื™ื).

for i in df_numerical.columns:
    print(i, df[i][df[i]==0].count())

ื‘ืฉืœื‘ ื–ื”, ื—ืฉื•ื‘ ืœื”ื‘ื™ืŸ ื‘ืื™ืœื• ืขืžื•ื“ื•ืช ืืคืกื™ื ืขืฉื•ื™ื™ื ืœื”ืฆื‘ื™ืข ืขืœ ืขืจื›ื™ื ื—ืกืจื™ื: ื”ืื ื–ื” ื ื•ื‘ืข ืžืื•ืคืŸ ืื™ืกื•ืฃ ื”ื ืชื•ื ื™ื? ืื• ืฉื–ื” ื™ื›ื•ืœ ืœื”ื™ื•ืช ืงืฉื•ืจ ืœืขืจื›ื™ ื”ื ืชื•ื ื™ื? ื™ืฉ ืœืขื ื•ืช ืขืœ ืฉืืœื•ืช ืืœื• ืขืœ ื‘ืกื™ืก ื›ืœ ืžืงืจื” ืœื’ื•ืคื•.

ืœื›ืŸ, ืื ื‘ื›ืœ ื–ืืช ื ื—ืœื™ื˜ ืฉื™ื™ืชื›ืŸ ืฉื—ืกืจื™ื ืœื ื• ื ืชื•ื ื™ื ืฉื‘ื”ื ื™ืฉ ืืคืกื™ื, ืขืœื™ื ื• ืœื”ื—ืœื™ืฃ ืืช ื”ืืคืกื™ื ื‘-NaN ื›ื“ื™ ืœื”ืงืœ ืขืœ ื”ืขื‘ื•ื“ื” ืขื ื”ื ืชื•ื ื™ื ื”ืื‘ื•ื“ื™ื ื”ืืœื” ืžืื•ื—ืจ ื™ื•ืชืจ:

df_numerical[["ะบะพะปะพะฝะบะฐ 1", "ะบะพะปะพะฝะบะฐ 2"]] = df_numerical[["ะบะพะปะพะฝะบะฐ 1", "ะบะพะปะพะฝะบะฐ 2"]].replace(0, nan)

ืขื›ืฉื™ื• ื‘ื•ืื• ื ืจืื” ืื™ืคื” ื—ืกืจื™ื ืœื ื• ื ืชื•ื ื™ื:

sns.heatmap(df_numerical.isnull(),yticklabels=False,cbar=False,cmap='viridis') # ะœะพะถะฝะพ ั‚ะฐะบะถะต ะฒะพัะฟะพะปัŒะทะพะฒะฐั‚ัŒัั df_numerical.info()

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

ื›ืืŸ ื™ืฉ ืœืกืžืŸ ืืช ื”ืขืจื›ื™ื ื”ื—ืกืจื™ื ื‘ืชื•ืš ื”ืขืžื•ื“ื•ืช ื‘ืฆื”ื•ื‘. ื•ืขื›ืฉื™ื• ืžืชื—ื™ืœ ื”ื›ื™ืฃ - ืื™ืš ืžืชืžื•ื“ื“ื™ื ืขื ื”ืขืจื›ื™ื ื”ืืœื”? ื”ืื ืขืœื™ ืœืžื—ื•ืง ืฉื•ืจื•ืช ืขื ื”ืขืจื›ื™ื ืื• ื”ืขืžื•ื“ื•ืช ื”ืืœื”? ืื• ืœืžืœื ืืช ื”ืขืจื›ื™ื ื”ืจื™ืงื™ื ื”ืืœื” ื‘ื›ืžื” ืื—ืจื™ื?

ืœื”ืœืŸ ืชืจืฉื™ื ืžืฉื•ืขืจ ืฉื™ื›ื•ืœ ืœืขื–ื•ืจ ืœืš ืœื”ื—ืœื™ื˜ ืžื” ื ื™ืชืŸ, ื‘ืื•ืคืŸ ืขืงืจื•ื ื™, ืœืขืฉื•ืช ืขื ืขืจื›ื™ื ืจื™ืงื™ื:

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

0. ื”ืกืจ ืขืžื•ื“ื•ืช ืžื™ื•ืชืจื•ืช

df_numerical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True)

1. ื”ืื ืžืกืคืจ ื”ืขืจื›ื™ื ื”ืจื™ืงื™ื ื‘ืขืžื•ื“ื” ื–ื• ื’ื“ื•ืœ ืž-50%?

print(df_numerical.isnull().sum() / df_numerical.shape[0] * 100)

df_numerical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True)#ะฃะดะฐะปัะตะผ, ะตัะปะธ ะบะฐะบะฐั-ั‚ะพ ะบะพะปะพะฝะบะฐ ะธะผะตะตั‚ ะฑะพะปัŒัˆะต 50 ะฟัƒัั‚ั‹ั… ะทะฝะฐั‡ะตะฝะธะน

2. ืžื—ืง ืฉื•ืจื•ืช ืขื ืขืจื›ื™ื ืจื™ืงื™ื

df_numerical.dropna(inplace=True)#ะฃะดะฐะปัะตะผ ัั‚ั€ะพั‡ะบะธ ั ะฟัƒัั‚ั‹ะผะธ ะทะฝะฐั‡ะตะฝะธัะผะธ, ะตัะปะธ ะฟะพั‚ะพะผ ะพัั‚ะฐะฝะตั‚ัั ะดะพัั‚ะฐั‚ะพั‡ะฝะพ ะดะฐะฝะฝั‹ั… ะดะปั ะพะฑัƒั‡ะตะฝะธั

3.1. ื”ื›ื ืกืช ืขืจืš ืืงืจืื™

import random #ะธะผะฟะพั€ั‚ะธั€ัƒะตะผ random
df_numerical["ะบะพะปะพะฝะบะฐ"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ะบะพะปะพะฝะบะฐ"]), inplace=True) #ะฒัั‚ะฐะฒะปัะตะผ ั€ะฐะฝะดะพะผะฝั‹ะต ะทะฝะฐั‡ะตะฝะธั ะฒ ะฟัƒัั‚ั‹ะต ะบะปะตั‚ะบะธ ั‚ะฐะฑะปะธั†ั‹

3.2. ื”ื›ื ืกืช ืขืจืš ืงื‘ื•ืข

from sklearn.impute import SimpleImputer #ะธะผะฟะพั€ั‚ะธั€ัƒะตะผ SimpleImputer, ะบะพั‚ะพั€ั‹ะน ะฟะพะผะพะถะตั‚ ะฒัั‚ะฐะฒะธั‚ัŒ ะทะฝะฐั‡ะตะฝะธั
imputer = SimpleImputer(strategy='constant', fill_value="<ะ’ะฐัˆะต ะทะฝะฐั‡ะตะฝะธะต ะทะดะตััŒ>") #ะฒัั‚ะฐะฒะปัะตะผ ะพะฟั€ะตะดะตะปะตะฝะฝะพะต ะทะฝะฐั‡ะตะฝะธะต ั ะฟะพะผะพั‰ัŒัŽ SimpleImputer
df_numerical[["ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ1",'ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ2','ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ3']] = imputer.fit_transform(df_numerical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะฝะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']]) #ะŸั€ะธะผะตะฝัะตะผ ัั‚ะพ ะดะปั ะฝะฐัˆะตะน ั‚ะฐะฑะปะธั†ั‹
df_numerical.drop(labels = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"], axis = 1, inplace = True) #ะฃะฑะธั€ะฐะตะผ ะบะพะปะพะฝะบะธ ัะพ ัั‚ะฐั€ั‹ะผะธ ะทะฝะฐั‡ะตะฝะธัะผะธ

3.3. ื”ื›ื ืก ืืช ื”ืขืจืš ื”ืžืžื•ืฆืข ืื• ื”ืฉื›ื™ื— ื‘ื™ื•ืชืจ

from sklearn.impute import SimpleImputer #ะธะผะฟะพั€ั‚ะธั€ัƒะตะผ SimpleImputer, ะบะพั‚ะพั€ั‹ะน ะฟะพะผะพะถะตั‚ ะฒัั‚ะฐะฒะธั‚ัŒ ะทะฝะฐั‡ะตะฝะธั
imputer = SimpleImputer(strategy='mean', missing_values = np.nan) #ะฒะผะตัั‚ะพ mean ะผะพะถะฝะพ ั‚ะฐะบะถะต ะธัะฟะพะปัŒะทะพะฒะฐั‚ัŒ most_frequent
df_numerical[["ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ1",'ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ2','ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ3']] = imputer.fit_transform(df_numerical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะฝะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']]) #ะŸั€ะธะผะตะฝัะตะผ ัั‚ะพ ะดะปั ะฝะฐัˆะตะน ั‚ะฐะฑะปะธั†ั‹
df_numerical.drop(labels = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"], axis = 1, inplace = True) #ะฃะฑะธั€ะฐะตะผ ะบะพะปะพะฝะบะธ ัะพ ัั‚ะฐั€ั‹ะผะธ ะทะฝะฐั‡ะตะฝะธัะผะธ

3.4. ื”ื›ื ืก ืืช ื”ืขืจืš ืฉื—ื•ืฉื‘ ืขืœ ื™ื“ื™ ืžื•ื“ืœ ืื—ืจ

ืœืคืขืžื™ื ื ื™ืชืŸ ืœื—ืฉื‘ ืขืจื›ื™ื ื‘ืืžืฆืขื•ืช ืžื•ื“ืœื™ื ืฉืœ ืจื’ืจืกื™ื” ื‘ืืžืฆืขื•ืช ืžื•ื“ืœื™ื ืžืกืคืจื™ื™ืช sklearn ืื• ืกืคืจื™ื•ืช ื“ื•ืžื•ืช ืื—ืจื•ืช. ื”ืฆื•ื•ืช ืฉืœื ื• ื™ืงื“ื™ืฉ ืžืืžืจ ื ืคืจื“ ื›ื™ืฆื“ ื ื™ืชืŸ ืœืขืฉื•ืช ื–ืืช ื‘ืขืชื™ื“ ื”ืงืจื•ื‘.

ืื–, ืœืขืช ืขืชื”, ื”ื ืจื˜ื™ื‘ ืขืœ ื ืชื•ื ื™ื ื›ืžื•ืชื™ื™ื ื™ื™ืงื˜ืข, ื›ื™ ื™ืฉื ื ื ื™ื•ืื ืกื™ื ืจื‘ื™ื ืื—ืจื™ื ืœื’ื‘ื™ ืื™ืš ืœืขืฉื•ืช ื˜ื•ื‘ ื™ื•ืชืจ ื”ื›ื ืช ื ืชื•ื ื™ื ื•ืขื™ื‘ื•ื“ ืžืงื“ื™ื ืขื‘ื•ืจ ืžืฉื™ืžื•ืช ืฉื•ื ื•ืช, ื•ื”ื“ื‘ืจื™ื ื”ื‘ืกื™ืกื™ื™ื ืขื‘ื•ืจ ื ืชื•ื ื™ื ื›ืžื•ืชื™ื™ื ื ืœืงื—ื• ื‘ื—ืฉื‘ื•ืŸ ื‘ืžืืžืจ ื–ื”, ื•ื›ืŸ ืขื›ืฉื™ื• ื–ื” ื”ื–ืžืŸ ืœื—ื–ื•ืจ ืœื ืชื•ื ื™ื ื”ืื™ื›ื•ืชื ื™ื™ื, ืฉื”ืคืจื“ื ื• ื›ืžื” ืฆืขื“ื™ื ืื—ื•ืจื” ืžื”ื›ืžื•ืชื™ื™ื. ืืชื” ื™ื›ื•ืœ ืœืฉื ื•ืช ืžื—ื‘ืจืช ื–ื• ื›ืจืฆื•ื ืš, ืœื”ืชืื™ื ืื•ืชื” ืœืžืฉื™ืžื•ืช ืฉื•ื ื•ืช, ื›ืš ืฉืขื™ื‘ื•ื“ ื”ื ืชื•ื ื™ื ืžืจืืฉ ื™ืขื‘ื•ืจ ืžื”ืจ ืžืื•ื“!

ื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื

ื‘ืขื™ืงืจื•ืŸ, ืขื‘ื•ืจ ื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื, ื ืขืฉื” ืฉื™ืžื•ืฉ ื‘ืฉื™ื˜ืช One-hot-encoding ืขืœ ืžื ืช ืœืขืฆื‘ ืื•ืชื ืžืžื—ืจื•ื–ืช (ืื• ืื•ื‘ื™ื™ืงื˜) ืœืžืกืคืจ. ืœืคื ื™ ืฉื ืžืฉื™ืš ืœื ืงื•ื“ื” ื–ื•, ื”ื‘ื” ื ืฉืชืžืฉ ื‘ืชืจืฉื™ื ื•ื‘ืงื•ื“ ืฉืœืžืขืœื” ื›ื“ื™ ืœื”ืชืžื•ื“ื“ ืขื ืขืจื›ื™ื ืจื™ืงื™ื.

df_categorical.nunique()

sns.heatmap(df_categorical.isnull(),yticklabels=False,cbar=False,cmap='viridis')

ืคื ืงืก ืจืฉื™ืžื•ืช ืœืขื™ื‘ื•ื“ ืžื”ื™ืจ ืฉืœ ื ืชื•ื ื™ื

0. ื”ืกืจ ืขืžื•ื“ื•ืช ืžื™ื•ืชืจื•ืช

df_categorical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True)

1. ื”ืื ืžืกืคืจ ื”ืขืจื›ื™ื ื”ืจื™ืงื™ื ื‘ืขืžื•ื“ื” ื–ื• ื’ื“ื•ืœ ืž-50%?

print(df_categorical.isnull().sum() / df_numerical.shape[0] * 100)

df_categorical.drop(labels=["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2"], axis=1, inplace=True) #ะฃะดะฐะปัะตะผ, ะตัะปะธ ะบะฐะบะฐั-ั‚ะพ ะบะพะปะพะฝะบะฐ 
                                                                          #ะธะผะตะตั‚ ะฑะพะปัŒัˆะต 50% ะฟัƒัั‚ั‹ั… ะทะฝะฐั‡ะตะฝะธะน

2. ืžื—ืง ืฉื•ืจื•ืช ืขื ืขืจื›ื™ื ืจื™ืงื™ื

df_categorical.dropna(inplace=True)#ะฃะดะฐะปัะตะผ ัั‚ั€ะพั‡ะบะธ ั ะฟัƒัั‚ั‹ะผะธ ะทะฝะฐั‡ะตะฝะธัะผะธ, 
                                   #ะตัะปะธ ะฟะพั‚ะพะผ ะพัั‚ะฐะฝะตั‚ัั ะดะพัั‚ะฐั‚ะพั‡ะฝะพ ะดะฐะฝะฝั‹ั… ะดะปั ะพะฑัƒั‡ะตะฝะธั

3.1. ื”ื›ื ืกืช ืขืจืš ืืงืจืื™

import random
df_categorical["ะบะพะปะพะฝะบะฐ"].fillna(lambda x: random.choice(df[df[column] != np.nan]["ะบะพะปะพะฝะบะฐ"]), inplace=True)

3.2. ื”ื›ื ืกืช ืขืจืš ืงื‘ื•ืข

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant', fill_value="<ะ’ะฐัˆะต ะทะฝะฐั‡ะตะฝะธะต ะทะดะตััŒ>")
df_categorical[["ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ1",'ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ2','ะฝะพะฒะฐั_ะบะพะปะพะฝะบะฐ3']] = imputer.fit_transform(df_categorical[['ะบะพะปะพะฝะบะฐ1', 'ะบะพะปะพะฝะบะฐ2', 'ะบะพะปะพะฝะบะฐ3']])
df_categorical.drop(labels = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"], axis = 1, inplace = True)

ืื– ืกื•ืฃ ืกื•ืฃ ื™ืฉ ืœื ื• ืฉืœื™ื˜ื” ืขืœ nulls ื‘ื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื. ืขื›ืฉื™ื• ื”ื’ื™ืข ื”ื–ืžืŸ ืœื‘ืฆืข ืงื™ื“ื•ื“ ื—ื ืื—ื“ ืขืœ ื”ืขืจื›ื™ื ืฉื ืžืฆืื™ื ื‘ืžืกื“ ื”ื ืชื•ื ื™ื ืฉืœืš. ืฉื™ื˜ื” ื–ื• ืžืฉืžืฉืช ืœืขืชื™ื ืงืจื•ื‘ื•ืช ืžืื•ื“ ื›ื“ื™ ืœื”ื‘ื˜ื™ื— ืฉื”ืืœื’ื•ืจื™ืชื ืฉืœืš ื™ื›ื•ืœ ืœืœืžื•ื“ ืžื ืชื•ื ื™ื ื‘ืื™ื›ื•ืช ื’ื‘ื•ื”ื”.

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

features_to_encode = ["ะบะพะปะพะฝะบะฐ1","ะบะพะปะพะฝะบะฐ2","ะบะพะปะพะฝะบะฐ3"]
for feature in features_to_encode:
    df_categorical = encode_and_bind(df_categorical, feature))

ืื–, ืกื•ืฃ ืกื•ืฃ ืกื™ื™ืžื ื• ืœืขื‘ื“ ื ืชื•ื ื™ื ืื™ื›ื•ืชื™ื™ื ื•ื›ืžื•ืชื™ื™ื ื ืคืจื“ื™ื - ื”ื’ื™ืข ื”ื–ืžืŸ ืœืฉืœื‘ ืื•ืชื ื‘ื—ื–ืจื”

new_df = pd.concat([df_numerical,df_categorical], axis=1)

ืœืื—ืจ ืฉืฉื™ืœื‘ื ื• ืืช ืžืขืจื›ื™ ื”ื ืชื•ื ื™ื ื™ื—ื“ ืœืื—ื“, ื ื•ื›ืœ ืกื•ืฃ ืกื•ืฃ ืœื”ืฉืชืžืฉ ื‘ื˜ืจื ืกืคื•ืจืžืฆื™ื” ืฉืœ ื ืชื•ื ื™ื ื‘ืืžืฆืขื•ืช MinMaxScaler ืžืกืคืจื™ื™ืช sklearn. ื–ื” ื™ื”ืคื•ืš ืืช ื”ืขืจื›ื™ื ืฉืœื ื• ื‘ื™ืŸ 0 ืœ-1, ืžื” ืฉื™ืขื–ื•ืจ ื‘ืขืช ืื™ืžื•ืŸ ื”ืžื•ื“ืœ ื‘ืขืชื™ื“.

from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
new_df = min_max_scaler.fit_transform(new_df)

ื”ื ืชื•ื ื™ื ื”ืืœื” ืžื•ื›ื ื™ื ื›ืขืช ืœื›ืœ ื“ื‘ืจ - ืจืฉืชื•ืช ืขืฆื‘ื™ื•ืช, ืืœื’ื•ืจื™ืชืžื™ ML ืกื˜ื ื“ืจื˜ื™ื™ื ื•ื›ื•'!

ื‘ืžืืžืจ ื–ื”, ืœื ืœืงื—ื ื• ื‘ื—ืฉื‘ื•ืŸ ืขื‘ื•ื“ื” ืขื ื ืชื•ื ื™ ืกื“ืจื•ืช ื–ืžืŸ, ืฉื›ืŸ ืขื‘ื•ืจ ื ืชื•ื ื™ื ื›ืืœื” ืขืœื™ืš ืœื”ืฉืชืžืฉ ื‘ื˜ื›ื ื™ืงื•ืช ืขื™ื‘ื•ื“ ืžืขื˜ ืฉื•ื ื•ืช, ื‘ื”ืชืื ืœืžืฉื™ืžื” ืฉืœืš. ื‘ืขืชื™ื“, ื”ืฆื•ื•ืช ืฉืœื ื• ื™ืงื“ื™ืฉ ืžืืžืจ ื ืคืจื“ ืœื ื•ืฉื ื–ื”, ื•ืื ื• ืžืงื•ื•ื™ื ืฉื”ื•ื ื™ื•ื›ืœ ืœื”ื›ื ื™ืก ืžืฉื”ื• ืžืขื ื™ื™ืŸ, ื—ื“ืฉ ื•ืฉื™ืžื•ืฉื™ ืœื—ื™ื™ืš, ื‘ื“ื™ื•ืง ื›ืžื• ื–ื”.

ืžืงื•ืจ: www.habr.com

ื”ื•ืกืคืช ืชื’ื•ื‘ื”