ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ืžื™ืœืช ืžื‘ื•ื ืงืฆืจื”

ืื ื™ ืžืืžื™ืŸ ืฉื ื•ื›ืœ ืœืขืฉื•ืช ื™ื•ืชืจ ื“ื‘ืจื™ื ืื™ืœื• ื”ื™ื• ืžืกืคืงื™ื ืœื ื• ื”ื•ืจืื•ืช ืฉืœื‘ ืื—ืจ ืฉืœื‘ ืฉื™ืืžืจื• ืœื ื• ืžื” ืœืขืฉื•ืช ื•ืื™ืš ืœืขืฉื•ืช ื–ืืช. ืื ื™ ืขืฆืžื™ ื–ื•ื›ืจ ืจื’ืขื™ื ื‘ื—ื™ื™ ืฉืœื ื™ื›ื•ืœืชื™ ืœื”ืชื—ื™ืœ ืžืฉื”ื• ื›ื™ ืคืฉื•ื˜ ื”ื™ื” ืงืฉื” ืœื”ื‘ื™ืŸ ืžืื™ืคื” ืœื”ืชื—ื™ืœ. ืื•ืœื™, ืคืขื ื‘ืื™ื ื˜ืจื ื˜ ืจืื™ืช ืืช ื”ืžื™ืœื™ื "Data Science" ื•ื”ื—ืœื˜ืช ืฉืืชื” ืจื—ื•ืง ืžื–ื”, ื•ื”ืื ืฉื™ื ืฉืขื•ืฉื™ื ืืช ื–ื” ื ืžืฆืื™ื ืื™ ืฉื ื‘ื—ื•ืฅ, ื‘ืขื•ืœื ืื—ืจ. ืœื, ื”ื ืžืžืฉ ื›ืืŸ. ื•ืื•ืœื™, ื”ื•ื“ื•ืช ืœืื ืฉื™ื ืžื”ืชื—ื•ื ื”ื–ื”, ื”ื•ืคื™ืข ืžืืžืจ ื‘ืคื™ื“ ืฉืœืš. ื™ืฉ ื”ืžื•ืŸ ืงื•ืจืกื™ื ืฉื™ืขื–ืจื• ืœื›ื ืœื”ืชืจื’ืœ ืœืžืœืื›ื” ื”ื–ื•, ืื‘ืœ ื›ืืŸ ืืขื–ื•ืจ ืœื›ื ืœืขืฉื•ืช ืืช ื”ืฆืขื“ ื”ืจืืฉื•ืŸ.

ื•ื‘ื›ืŸ, ืืชื” ืžื•ื›ืŸ? ื”ืจืฉื• ืœื™ ืœื•ืžืจ ืœื›ื ืžื™ื“ ืฉืชืฆื˜ืจื›ื• ืœื”ื›ื™ืจ ืืช Python 3, ื›ื™ ื–ื” ืžื” ืฉืืฉืชืžืฉ ื›ืืŸ. ืื ื™ ื’ื ืžืžืœื™ืฅ ืœืš ืœื”ืชืงื™ืŸ ืื•ืชื• ืขืœ Jupyter Notebook ืžืจืืฉ ืื• ืœืจืื•ืช ื›ื™ืฆื“ ืœื”ืฉืชืžืฉ ื‘- google colab.

ืฉืœื‘ ืจืืฉื•ืŸ

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

Kaggle ื”ื•ื ื”ืขื•ื–ืจ ื”ืžืฉืžืขื•ืชื™ ืฉืœืš ื‘ืขื ื™ื™ืŸ ื”ื–ื”. ื‘ืื•ืคืŸ ืขืงืจื•ื ื™, ืืชื” ื™ื›ื•ืœ ืœื”ืกืชื“ืจ ื‘ืœืขื“ื™ื•, ืื‘ืœ ืื ื™ ืื“ื‘ืจ ืขืœ ื–ื” ื‘ืžืืžืจ ืื—ืจ. ื–ื•ื”ื™ ืคืœื˜ืคื•ืจืžื” ื”ืžืืจื—ืช ืชื—ืจื•ื™ื•ืช Data Science. ื‘ื›ืœ ืชื—ืจื•ืช ื›ื–ื•, ื‘ืฉืœื‘ื™ื ื”ืžื•ืงื“ืžื™ื ืชืฆื‘ืจื• ื ื™ืกื™ื•ืŸ ืœื ืจื™ืืœื™ ื‘ืคืชืจื•ืŸ ื‘ืขื™ื•ืช ืžืกื•ื’ื™ื ืฉื•ื ื™ื, ื ื™ืกื™ื•ืŸ ื‘ืคื™ืชื•ื— ื•ื ื™ืกื™ื•ืŸ ื‘ืขื‘ื•ื“ื” ื‘ืฆื•ื•ืช, ื“ื‘ืจ ืฉื—ืฉื•ื‘ ื‘ืชืงื•ืคืชื ื•.

ื ื™ืงื— ืืช ื”ืžืฉื™ืžื” ืฉืœื ื• ืžืฉื. ื–ื” ื ืงืจื "ื˜ื™ื˜ืื ื™ืง". ื”ืชื ืื™ ื”ื•ื ื–ื”: ื—ื–ื” ืื ื›ืœ ืื“ื ื™ืฉืจื•ื“. ื‘ืื•ืคืŸ ื›ืœืœื™, ื”ืžืฉื™ืžื” ืฉืœ ืื“ื ื”ืžืขื•ืจื‘ ื‘-DS ื”ื™ื ืื™ืกื•ืฃ ื ืชื•ื ื™ื, ืขื™ื‘ื•ื“ื, ืื™ืžื•ืŸ ืžื•ื“ืœ, ื—ื™ื–ื•ื™ ื•ื›ื“ื•ืžื”. ื‘-kaggle ืžื•ืชืจ ืœื ื• ืœื“ืœื’ ืขืœ ืฉืœื‘ ืื™ืกื•ืฃ ื”ื ืชื•ื ื™ื - ื”ื ืžื•ืฆื’ื™ื ื‘ืคืœื˜ืคื•ืจืžื”. ืื ื—ื ื• ืฆืจื™ื›ื™ื ืœื”ื•ืจื™ื“ ืื•ืชื ื•ื ื•ื›ืœ ืœื”ืชื—ื™ืœ!

ืืชื” ื™ื›ื•ืœ ืœืขืฉื•ืช ื–ืืช ื‘ืื•ืคืŸ ื”ื‘ื:

ื”ื›ืจื˜ื™ืกื™ื™ื” ื ืชื•ื ื™ื ืžื›ื™ืœื” ืงื‘ืฆื™ื ื”ืžื›ื™ืœื™ื ื ืชื•ื ื™ื

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ื”ื•ืจื“ื ื• ืืช ื”ื ืชื•ื ื™ื, ื”ื›ื ื• ืืช ืžื—ื‘ืจื•ืช Jupyter ืฉืœื ื• ื•...

ืฉืœื‘ ืฉื ื™

ืื™ืš ืื ื—ื ื• ืขื›ืฉื™ื• ืžื˜ืขื™ืŸ ืืช ื”ื ืชื•ื ื™ื ื”ืืœื”?

ืจืืฉื™ืช, ื‘ื•ืื• ืœื™ื™ื‘ื ืืช ื”ืกืคืจื™ื•ืช ื”ื“ืจื•ืฉื•ืช:

import pandas as pd
import numpy as np

Pandas ื™ืืคืฉืจ ืœื ื• ืœื”ื•ืจื™ื“ ืงื‘ืฆื™ .csv ืœื”ืžืฉืš ืขื™ื‘ื•ื“.

ื™ืฉ ืฆื•ืจืš ื‘-Numpy ื›ื“ื™ ืœื™ื™ืฆื’ ืืช ื˜ื‘ืœืช ื”ื ืชื•ื ื™ื ืฉืœื ื• ื›ืžื˜ืจื™ืฆื” โ€‹โ€‹ืขื ืžืกืคืจื™ื.
ืœืš ืขืœ ื–ื”. ื‘ื•ืื• ื ื™ืงื— ืืช ื”ืงื•ื‘ืฅ train.csv ื•ื ืขืœื” ืื•ืชื• ืืœื™ื ื•:

dataset = pd.read_csv('train.csv')

ื ืชื™ื™ื—ืก ืœื‘ื—ื™ืจืช ื”ื ืชื•ื ื™ื ืฉืœ train.csv ื“ืจืš ืžืฉืชื ื” ื”ื ืชื•ื ื™ื. ื‘ื•ื ื ืจืื” ืžื” ื™ืฉ ืฉื:

dataset.head()

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ื”ืคื•ื ืงืฆื™ื” head() ืžืืคืฉืจืช ืœื ื• ืœื”ืกืชื›ืœ ืขืœ ื”ืฉื•ืจื•ืช ื”ืจืืฉื•ื ื•ืช ืฉืœ ืžืกื’ืจืช ื ืชื•ื ื™ื.

ืขืžื•ื“ื•ืช ื”- Survived ื”ืŸ ื‘ื“ื™ื•ืง ื”ืชื•ืฆืื•ืช ืฉืœื ื•, ื”ื™ื“ื•ืขื•ืช ื‘ืžืกื’ืจืช ื”ื ืชื•ื ื™ื ื”ื–ื•. ืขื‘ื•ืจ ืฉืืœืช ื”ืžืฉื™ืžื”, ืขืœื™ื ื• ืœื—ื–ื•ืช ืืช ื”ืขืžื•ื“ื” Survived ืขื‘ื•ืจ ื ืชื•ื ื™ test.csv. ื ืชื•ื ื™ื ืืœื” ืžืื—ืกื ื™ื ืžื™ื“ืข ืขืœ ื ื•ืกืขื™ื ืื—ืจื™ื ืฉืœ ื”ื˜ื™ื˜ืื ื™ืง, ืฉืขื‘ื•ืจื ืื ื—ื ื•, ื‘ืคืชืจื•ืŸ ื”ื‘ืขื™ื”, ืœื ื™ื•ื“ืขื™ื ืืช ื”ืชื•ืฆืื”.

ืื– ื‘ื•ืื• ื ื—ืœืง ืืช ื”ื˜ื‘ืœื” ืฉืœื ื• ืœื ืชื•ื ื™ื ืชืœื•ื™ื™ื ื•ื‘ืœืชื™ ืชืœื•ื™ื™ื. ื”ื›ืœ ืคืฉื•ื˜ ื›ืืŸ. ื ืชื•ื ื™ื ืชืœื•ื™ื™ื ื”ื ื”ื ืชื•ื ื™ื ื”ืชืœื•ื™ื™ื ื‘ื ืชื•ื ื™ื ื”ื‘ืœืชื™ ืชืœื•ื™ื™ื ืฉื ืžืฆืื™ื ื‘ืชื•ืฆืื•ืช. ื ืชื•ื ื™ื ื‘ืœืชื™ ืชืœื•ื™ื™ื ื”ื ืื•ืชื ื ืชื•ื ื™ื ื”ืžืฉืคื™ืขื™ื ืขืœ ื”ืชื•ืฆืื”.

ืœื“ื•ื’ืžื”, ื™ืฉ ืœื ื• ืืช ืžืขืจืš ื”ื ืชื•ื ื™ื ื”ื‘ื:

"ื•ื•ื‘ื” ืœื™ืžื“ื” ืžื“ืขื™ ื”ืžื—ืฉื‘ - ืœื.
Vova ืงื™ื‘ืœ ืฆื™ื•ืŸ 2 ื‘ืžื“ืขื™ ื”ืžื—ืฉื‘".

ื”ืฆื™ื•ืŸ ื‘ืžื“ืขื™ ื”ืžื—ืฉื‘ ืชืœื•ื™ ื‘ืชืฉื•ื‘ื” ืœืฉืืœื”: ื”ืื ื•ื•ื‘ื” ืœืžื“ื” ืžื“ืขื™ ื”ืžื—ืฉื‘? ื”ืื ื–ื” ื‘ืจื•ืจ? ื‘ื•ืื• ื ืžืฉื™ืš ื”ืœืื”, ืื ื—ื ื• ื›ื‘ืจ ื™ื•ืชืจ ืงืจื•ื‘ื™ื ืœืžื˜ืจื”!

ื”ืžืฉืชื ื” ื”ืžืกื•ืจืชื™ ืขื‘ื•ืจ ื ืชื•ื ื™ื ื‘ืœืชื™ ืชืœื•ื™ื™ื ื”ื•ื X. ืขื‘ื•ืจ ื ืชื•ื ื™ื ืชืœื•ื™ื™ื, y.

ืื ื• ืขื•ืฉื™ื ืืช ื”ืคืขื•ืœื•ืช ื”ื‘ืื•ืช:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

ืžื” ื–ื”? ืขื ื”ืคื•ื ืงืฆื™ื” iloc[:, 2: ] ืื ื—ื ื• ืื•ืžืจื™ื ืœืคื™ืชื•ืŸ: ืื ื™ ืจื•ืฆื” ืœืจืื•ืช ื‘ืžืฉืชื ื” X ืืช ื”ื ืชื•ื ื™ื ืฉืžืชื—ื™ืœื™ื ืžื”ืขืžื•ื“ื” ื”ืฉื ื™ื™ื” (ื›ื•ืœืœ ื•ื‘ืชื ืื™ ืฉื”ืกืคื™ืจื” ืžืชื—ื™ืœื” ืžืืคืก). ื‘ืฉื•ืจื” ื”ืฉื ื™ื™ื” ืื ื—ื ื• ืื•ืžืจื™ื ืฉืื ื—ื ื• ืจื•ืฆื™ื ืœืจืื•ืช ืืช ื”ื ืชื•ื ื™ื ื‘ืขืžื•ื“ื” ื”ืจืืฉื•ื ื”.

[ a:b, c:d ] ื”ื•ื ื”ืžื‘ื ื” ืฉืœ ืžื” ืฉืื ื• ืžืฉืชืžืฉื™ื ื‘ืกื•ื’ืจื™ื™ื. ืื ืœื ืชืฆื™ื™ืŸ ืžืฉืชื ื™ื ื›ืœืฉื”ื, ื”ื ื™ื™ืฉืžืจื• ื›ื‘ืจื™ืจืช ืžื—ื“ืœ. ื›ืœื•ืžืจ, ื ื•ื›ืœ ืœืฆื™ื™ืŸ [:,: d] ื•ืื– ื ืงื‘ืœ ืืช ื›ืœ ื”ืขืžื•ื“ื•ืช ื‘-dataframe, ืžืœื‘ื“ ืืœื• ืฉื”ื•ืœื›ื•ืช ืžืžืกืคืจ d ื•ืื™ืœืš. ื”ืžืฉืชื ื™ื a ื•-b ืžื’ื“ื™ืจื™ื ืžื—ืจื•ื–ื•ืช, ืื‘ืœ ืื ื—ื ื• ืฆืจื™ื›ื™ื ืืช ื›ื•ืœื, ืื– ื ืฉืื™ืจ ื–ืืช ื›ื‘ืจื™ืจืช ืžื—ื“ืœ.

ื‘ื•ื ื ืจืื” ืžื” ืงื™ื‘ืœื ื•:

X.head()

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

y.head()

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ืขืœ ืžื ืช ืœืคืฉื˜ ืืช ื”ืฉื™ืขื•ืจ ื”ืงื˜ืŸ ื”ื–ื”, ื ืกื™ืจ ืขืžื•ื“ื™ื ื”ื“ื•ืจืฉื™ื ื˜ื™ืคื•ืœ ืžื™ื•ื—ื“ ืื• ืฉืื™ื ื ืžืฉืคื™ืขื™ื ื›ืœืœ ืขืœ ื”ืฉืจื™ื“ื•ืช. ื”ื ืžื›ื™ืœื™ื ื ืชื•ื ื™ื ืžืกื•ื’ str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

ืกื•ึผืคึผึถืจ! ื‘ื•ืื• ื ืขื‘ื•ืจ ืœืฉืœื‘ ื”ื‘ื.

ืฉืœื‘ ืฉืœื™ืฉื™

ื›ืืŸ ืื ื—ื ื• ืฆืจื™ื›ื™ื ืœืงื•ื“ื“ ืืช ื”ื ืชื•ื ื™ื ืฉืœื ื• ื›ื“ื™ ืฉื”ืžื›ื•ื ื” ืชื‘ื™ืŸ ื˜ื•ื‘ ื™ื•ืชืจ ื›ื™ืฆื“ ื”ื ืชื•ื ื™ื ื”ืืœื” ืžืฉืคื™ืขื™ื ืขืœ ื”ืชื•ืฆืื”. ืื‘ืœ ืœื ื ืงื•ื“ื“ ื”ื›ืœ, ืืœื ืจืง ื ืชื•ื ื™ str ืฉื”ืฉืืจื ื•. ืขืžื•ื“ื” "ืกืงืก". ืื™ืš ืื ื—ื ื• ืจื•ืฆื™ื ืœืงื•ื“? ื‘ื•ืื• ื ืฆื™ื’ ื ืชื•ื ื™ื ืขืœ ืžื™ื ื• ืฉืœ ืื“ื ื›ื•ื•ืงื˜ื•ืจ: 10 - ื–ื›ืจ, 01 - ื ืงื‘ื”.

ืจืืฉื™ืช, ื‘ื•ืื• ื ืžื™ืจ ืืช ื”ื˜ื‘ืœืื•ืช ืฉืœื ื• ืœืžื˜ืจื™ืฆืช NumPy:

X = np.array(X)
y = np.array(y)

ื•ืขื›ืฉื™ื• ื‘ื•ืื• ื ืกืชื›ืœ:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

ืกืคืจื™ื™ืช sklearn ื”ื™ื ืกืคืจื™ื™ื” ื›ืœ ื›ืš ืžื’ื ื™ื‘ื” ืฉืžืืคืฉืจืช ืœื ื• ืœืขืฉื•ืช ืขื‘ื•ื“ื” ืžืœืื” ื‘-Data Science. ื”ื•ื ืžื›ื™ืœ ืžืกืคืจ ืจื‘ ืฉืœ ืžื•ื“ืœื™ื ืžืขื ื™ื™ื ื™ื ืฉืœ ืœืžื™ื“ืช ืžื›ื•ื ื” ื•ื’ื ืžืืคืฉืจ ืœื ื• ืœื‘ืฆืข ื”ื›ื ืช ื ืชื•ื ื™ื.

OneHotEncoder ื™ืืคืฉืจ ืœื ื• ืœืงื•ื“ื“ ืืช ื”ืžื™ืŸ ืฉืœ ืื“ื ื‘ื™ื™ืฆื•ื’ ื–ื”, ื›ืคื™ ืฉืชื™ืืจื ื•. ื™ื™ื•ื•ืฆืจื• 2 ื›ื™ืชื•ืช: ื–ื›ืจ, ื ืงื‘ื”. ืื ื”ืื“ื ื”ื•ื ื’ื‘ืจ, ืื– 1 ื™ื™ื›ืชื‘ ื‘ืขืžื•ื“ื” "ื–ื›ืจ", ื•-0 ื‘ืขืžื•ื“ื” "ื ืงื‘ื”", ื‘ื”ืชืืžื”.

ืื—ืจื™ OneHotEncoder() ื™ืฉ [1] - ื–ื” ืื•ืžืจ ืฉืื ื—ื ื• ืจื•ืฆื™ื ืœืงื•ื“ื“ ืขืžื•ื“ื” ืžืกืคืจ 1 (ืกืคื™ืจื” ืžืืคืก).

ืกื•ึผืคึผึถืจ. ื‘ื•ืื• ื ืชืงื“ื ืขื•ื“ ื™ื•ืชืจ!

ื›ื›ืœืœ, ื–ื” ืงื•ืจื” ืฉื—ืœืง ืžื”ื ืชื•ื ื™ื ื ืฉืืจื™ื ืจื™ืงื™ื (ื›ืœื•ืžืจ, NaN - ืœื ืžืกืคืจ). ืœืžืฉืœ, ื™ืฉ ืžื™ื“ืข ืขืœ ืื“ื: ืฉืžื•, ืžื™ื ื•. ืื‘ืœ ืื™ืŸ ืžื™ื“ืข ืขืœ ื’ื™ืœื•. ื‘ืžืงืจื” ื–ื”, ื ื™ื™ืฉื ืืช ื”ืฉื™ื˜ื” ื”ื‘ืื”: ื ืžืฆื ืืช ื”ืžืžื•ืฆืข ื”ืืจื™ืชืžื˜ื™ ืขืœ ื›ืœ ื”ืขืžื•ื“ื•ืช, ื•ืื ื—ืกืจื™ื ื ืชื•ื ื™ื ืžืกื•ื™ืžื™ื ื‘ืขืžื•ื“ื”, ื ืžืœื ืืช ื”ื—ืœืœ ื‘ืžืžื•ืฆืข ื”ืืจื™ืชืžื˜ื™.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

ืขื›ืฉื™ื• ื‘ื•ืื• ื ื™ืงื— ื‘ื—ืฉื‘ื•ืŸ ืฉืžืฆื‘ื™ื ืงื•ืจื™ื ื›ืฉื”ื ืชื•ื ื™ื ื’ื“ื•ืœื™ื ืžืื•ื“. ื—ืœืง ืžื”ื ืชื•ื ื™ื ื ืžืฆืื™ื ื‘ืžืจื•ื•ื— [0:1], ื‘ืขื•ื“ ืฉื—ืœืงื ืขืฉื•ื™ื™ื ืœื—ืจื•ื’ ืžืžืื•ืช ื•ืืœืคื™ื. ื›ื“ื™ ืœื‘ื˜ืœ ืคื™ื–ื•ืจ ื›ื–ื” ื•ื›ื“ื™ ืœื”ืคื•ืš ืืช ื”ืžื—ืฉื‘ ืœื“ื™ื™ืง ื™ื•ืชืจ ื‘ื—ื™ืฉื•ื‘ื™ื•, ื ืกืจื•ืง ืืช ื”ื ืชื•ื ื™ื ื•ื ืฉื ื” ืื•ืชื. ื›ืœ ื”ืžืกืคืจื™ื ืœื ื™ืขืœื• ืขืœ ืฉืœื•ืฉื”. ืœืฉื ื›ืš, ื ืฉืชืžืฉ ื‘ืคื•ื ืงืฆื™ื™ืช StandardScaler.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

ื›ืขืช ื”ื ืชื•ื ื™ื ืฉืœื ื• ื ืจืื™ื ื›ืš:

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ืžืขืžื“. ืื ื—ื ื• ื›ื‘ืจ ืงืจื•ื‘ื™ื ืœืžื˜ืจื” ืฉืœื ื•!

ืฉืœื‘ ืจื‘ื™ืขื™

ื‘ื•ืื• ืœืืžืŸ ืืช ื”ื“ื’ื ื”ืจืืฉื•ืŸ ืฉืœื ื•! ืžืกืคืจื™ื™ืช sklearn ื ื•ื›ืœ ืœืžืฆื•ื ืžืกืคืจ ืขืฆื•ื ืฉืœ ื“ื‘ืจื™ื ืžืขื ื™ื™ื ื™ื. ื™ื™ืฉืžืชื™ ืืช ืžื•ื“ืœ Gradient Boosting Classifier ืขืœ ื‘ืขื™ื” ื–ื•. ืื ื• ืžืฉืชืžืฉื™ื ื‘ืžืกื•ื•ื’ ืžื›ื™ื•ื•ืŸ ืฉื”ืžืฉื™ืžื” ืฉืœื ื• ื”ื™ื ืžืฉื™ืžืช ืกื™ื•ื•ื’. ื™ืฉ ืœื”ืงืฆื•ืช ืืช ื”ืคืจื•ื’ื ื•ื–ื” ืœ-1 (ืฉืจื“) ืื• 0 (ืœื ืฉืจื“).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

ืคื•ื ืงืฆื™ื™ืช ื”ื”ืชืืžื” ืื•ืžืจืช ืœืคื™ื™ืชื•ืŸ: ืชืŸ ืœืžื•ื“ืœ ืœื—ืคืฉ ืชืœื•ืช ื‘ื™ืŸ X ื•-y.

ืคื—ื•ืช ืžืฉื ื™ื™ื” ื•ื”ื“ื’ื ืžื•ื›ืŸ.

ื”ืฆืขื“ ื”ืจืืฉื•ืŸ ืฉืœืš ื‘-Data Science. ื›ึผึทื‘ึผึดื™ืจ

ืื™ืš ืœื™ื™ืฉื ืืช ื–ื”? ื ืจืื” ืขื›ืฉื™ื•!

ืฉืœื‘ ื—ืžื™ืฉื™. ืกื™ื›ื•ื

ืขื›ืฉื™ื• ืื ื—ื ื• ืฆืจื™ื›ื™ื ืœื˜ืขื•ืŸ ื˜ื‘ืœื” ืขื ื ืชื•ื ื™ ื”ื‘ื“ื™ืงื” ืฉืœื ื• ืฉืขื‘ื•ืจื” ืื ื—ื ื• ืฆืจื™ื›ื™ื ืœืขืฉื•ืช ืชื—ื–ื™ืช. ืขื ื”ื˜ื‘ืœื” ื”ื–ื• ื ื‘ืฆืข ืืช ื›ืœ ืื•ืชืŸ ื”ืคืขื•ืœื•ืช ืฉืขืฉื™ื ื• ืขื‘ื•ืจ X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

ื‘ื•ืื• ืœื™ื™ืฉื ืืช ื”ืžื•ื“ืœ ืฉืœื ื• ืขื›ืฉื™ื•!

gbc_predict = gbc.predict(X_test)

ืืช ื›ืœ. ืขืฉื™ื ื• ืชื—ื–ื™ืช. ื›ืขืช ื™ืฉ ืœื”ืงืœื™ื˜ ืื•ืชื• ื‘-csv ื•ืœืฉืœื•ื— ืื•ืชื• ืœืืชืจ.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

ืžื•ึผื›ึธืŸ. ืงื™ื‘ืœื ื• ืงื•ื‘ืฅ ื”ืžื›ื™ืœ ืชื—ื–ื™ื•ืช ืœื›ืœ ื ื•ืกืข. ื›ืœ ืฉื ื•ืชืจ ื”ื•ื ืœื”ืขืœื•ืช ืืช ื”ืคืชืจื•ื ื•ืช ื”ืœืœื• ืœืืชืจ ื•ืœืงื‘ืœ ื”ืขืจื›ื” ืฉืœ ื”ืชื—ื–ื™ืช. ืคืชืจื•ืŸ ืคืจื™ืžื™ื˜ื™ื‘ื™ ื›ื–ื” ื ื•ืชืŸ ืœื ืจืง 74% ืžื”ืชืฉื•ื‘ื•ืช ื”ื ื›ื•ื ื•ืช ื‘ืฆื™ื‘ื•ืจ, ืืœื ื’ื ืชื ื•ืคื” ืžืกื•ื™ืžืช ื‘-Data Science. ื”ืกืงืจื ื™ื ื‘ื™ื•ืชืจ ื™ื›ื•ืœื™ื ืœื›ืชื•ื‘ ืœื™ ื‘ื”ื•ื“ืขื•ืช ืคืจื˜ื™ื•ืช ื‘ื›ืœ ืขืช ื•ืœืฉืื•ืœ ืฉืืœื”. ืชื•ื“ื” ืœื›ืœ!

ืžืงื•ืจ: www.habr.com

ื”ื•ืกืคืช ืชื’ื•ื‘ื”