Koj thawj kauj ruam hauv Data Science. Titanic

Ib lo lus qhia luv luv

Kuv ntseeg tias peb tuaj yeem ua tau ntau yam yog tias peb tau muab cov lus qhia ua ntu zus uas yuav qhia peb yuav ua li cas thiab yuav ua li cas. Kuv tus kheej nco txog lub sijhawm hauv kuv lub neej thaum kuv pib tsis tau ib yam dab tsi vim nws yooj yim to taub qhov twg yuav pib. Tej zaum, ib zaug dhau los hauv Is Taws Nem koj pom cov lus "Data Science" thiab txiav txim siab tias koj nyob deb ntawm qhov no, thiab cov neeg uas ua qhov no yog qhov chaw nyob ntawd, hauv lwm lub ntiaj teb. Tsis yog, lawv nyob ntawm no. Thiab, tej zaum, ua tsaug rau cov neeg los ntawm daim teb no, ib tsab xov xwm tshwm sim ntawm koj qhov pub. Muaj ntau cov kev kawm uas yuav pab tau koj siv tau rau cov khoom siv tes ua no, tab sis ntawm no kuv yuav pab koj ua thawj kauj ruam.

Zoo, koj puas npaj txhij? Cia kuv qhia koj tam sim ntawd tias koj yuav tsum paub Python 3, vim qhov ntawd yog qhov kuv yuav siv ntawm no. Kuv kuj qhia koj kom nruab nws ntawm Jupyter Notebook ua ntej lossis saib yuav ua li cas siv google colab.

Kauj ruam ib

Koj thawj kauj ruam hauv Data Science. Titanic

Kaggle yog koj tus pab cuam tseem ceeb hauv qhov teeb meem no. Hauv txoj cai, koj tuaj yeem ua yam tsis muaj nws, tab sis kuv yuav tham txog qhov no hauv lwm tsab xov xwm. Nov yog lub platform uas tuav cov kev sib tw Data Science. Hauv txhua qhov kev sib tw no, nyob rau theem pib koj yuav tau txais qhov tsis muaj tseeb ntawm kev daws teeb meem ntawm ntau yam, kev txhim kho thiab kev paub ua haujlwm hauv pab pawg, uas yog qhov tseem ceeb hauv peb lub sijhawm.

Peb yuav coj peb txoj haujlwm los ntawm qhov ntawd. Nws hu ua "Titanic". Qhov xwm txheej yog qhov no: kwv yees seb txhua tus neeg yuav muaj sia nyob. Feem ntau hais lus, txoj haujlwm ntawm tus neeg koom nrog DS yog sau cov ntaub ntawv, ua nws, cob qhia tus qauv, kev kwv yees, thiab lwm yam. Hauv kaggle, peb raug tso cai hla cov ntaub ntawv sau cov theem - lawv tau nthuav tawm ntawm lub platform. Peb yuav tsum rub tawm lawv thiab peb tuaj yeem pib!

Koj tuaj yeem ua qhov no raws li hauv qab no:

cov ntaub ntawv tab muaj cov ntaub ntawv uas muaj cov ntaub ntawv

Koj thawj kauj ruam hauv Data Science. Titanic

Koj thawj kauj ruam hauv Data Science. Titanic

Peb rub tawm cov ntaub ntawv, npaj peb phau ntawv Jupyter thiab ...

Kauj ruam ob

Tam sim no peb thauj cov ntaub ntawv no li cas?

Ua ntej, cia peb import cov tsev qiv ntawv tsim nyog:

import pandas as pd
import numpy as np

Pandas yuav tso cai rau peb rub tawm .csv cov ntaub ntawv rau kev ua haujlwm ntxiv.

Numpy yog xav tau los sawv cev rau peb cov ntaub ntawv cov ntaub ntawv raws li ib tug matrix nrog cov zauv.
Ua ntej. Cia peb nqa cov ntaub ntawv train.csv thiab upload rau peb:

dataset = pd.read_csv('train.csv')

Peb yuav xa mus rau peb cov kev xaiv cov ntaub ntawv train.csv los ntawm cov dataset sib txawv. Wb saib dab tsi nyob ntawd:

dataset.head()

Koj thawj kauj ruam hauv Data Science. Titanic

Lub taub hau() muaj nuj nqi tso cai rau peb saib thawj ob peb kab ntawm dataframe.

Cov kab muaj sia nyob yog qhov tseeb peb cov txiaj ntsig, uas paub hauv cov ntaub ntawv no. Rau cov lus nug txog kev ua haujlwm, peb yuav tsum kwv yees cov kab muaj sia nyob rau cov ntaub ntawv test.csv. Cov ntaub ntawv no khaws cov ntaub ntawv hais txog lwm tus neeg caij tsheb ntawm Titanic, uas peb, daws qhov teeb meem, tsis paub qhov tshwm sim.

Yog li, cia peb faib peb lub rooj rau hauv cov ntaub ntawv nyob thiab ywj pheej. Txhua yam yooj yim ntawm no. Cov ntaub ntawv nyob hauv yog cov ntaub ntawv uas nyob ntawm cov ntaub ntawv ywj pheej uas nyob hauv qhov tshwm sim. Cov ntaub ntawv ywj pheej yog cov ntaub ntawv uas cuam tshuam rau qhov tshwm sim.

Piv txwv li, peb muaj cov ntaub ntawv hauv qab no:

"Vova qhia computer science - tsis muaj.
Vova tau txais 2 hauv computer science. "

Qib hauv computer science nyob ntawm cov lus teb rau lo lus nug: Vova puas kawm computer science? Nws puas meej? Cia peb mus, peb twb ze rau lub hom phiaj!

Cov kev hloov pauv ib txwm muaj rau cov ntaub ntawv ywj pheej yog X. Rau cov ntaub ntawv nyob, y.

Peb ua cov hauv qab no:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

Nws yog dab tsi? Nrog rau kev ua haujlwm iloc[:, 2:] peb qhia Python: Kuv xav pom hauv qhov sib txawv X cov ntaub ntawv pib ntawm kab thib ob ( suav nrog thiab muab qhov suav pib ntawm xoom). Hauv kab thib ob peb hais tias peb xav pom cov ntaub ntawv hauv thawj kab.

[a:b,c:d] yog qhov kev tsim kho ntawm qhov peb siv hauv kab lus. Yog tias koj tsis qhia meej txog qhov hloov pauv, lawv yuav raug cawm raws li lub neej ntawd. Ntawd yog, peb tuaj yeem hais qhia [:,: d] thiab tom qab ntawd peb yuav tau txais tag nrho cov kab hauv dataframe, tshwj tsis yog cov uas mus ntawm tus lej d mus ntxiv. Cov variables a thiab b txhais cov hlua, tab sis peb xav tau lawv tag nrho, yog li peb tawm qhov no raws li lub neej ntawd.

Cia peb pom dab tsi peb tau txais:

X.head()

Koj thawj kauj ruam hauv Data Science. Titanic

y.head()

Koj thawj kauj ruam hauv Data Science. Titanic

Txhawm rau ua kom yooj yim cov lus qhia me me no, peb yuav tshem cov kab uas xav tau kev saib xyuas tshwj xeeb lossis tsis cuam tshuam rau kev muaj sia nyob txhua. Lawv muaj cov ntaub ntawv ntawm hom str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

Super! Cia peb mus rau kauj ruam tom ntej.

Kauj Ruam Peb

Ntawm no peb yuav tsum tau encode peb cov ntaub ntawv kom lub tshuab nkag siab zoo li cas cov ntaub ntawv no cuam tshuam rau qhov tshwm sim. Tab sis peb yuav tsis encode txhua yam, tab sis tsuas yog cov ntaub ntawv str uas peb tau tso tseg. Kem "Sex". Peb xav kom code li cas? Cia peb sawv cev cov ntaub ntawv hais txog ib tug neeg li poj niam txiv neej raws li vector: 10 - txiv neej, 01 - poj niam.

Ua ntej, cia peb hloov peb cov ntxhuav rau hauv NumPy matrix:

X = np.array(X)
y = np.array(y)

Thiab tam sim no cia saib:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

Lub tsev qiv ntawv sklearn yog lub tsev qiv ntawv txias uas tso cai rau peb ua tiav cov haujlwm hauv Data Science. Nws muaj ntau tus qauv kev kawm tshuab nthuav dav thiab tseem tso cai rau peb ua cov ntaub ntawv npaj.

OneHotEncoder yuav tso cai rau peb nkag siab txog poj niam txiv neej ntawm ib tus neeg hauv qhov kev sawv cev, raws li peb tau piav qhia. 2 chav kawm yuav raug tsim: txiv neej, poj niam. Yog hais tias tus neeg yog ib tug txiv neej, ces 1 yuav muab sau rau hauv kab "txiv neej", thiab 0 nyob rau hauv lub "poj niam" kem, feem.

Tom qab OneHotEncoder() muaj [1] - qhov no txhais tau tias peb xav encode kab zauv 1 (suav los ntawm xoom).

Super. Cia peb txav mus ntxiv!

Raws li txoj cai, qhov no tshwm sim tias qee cov ntaub ntawv raug tso tseg (uas yog, NaN - tsis yog tus lej). Piv txwv li, muaj cov ntaub ntawv hais txog ib tug neeg: nws lub npe, poj niam txiv neej. Tab sis tsis muaj ntaub ntawv hais txog nws lub hnub nyoog. Hauv qhov no, peb yuav siv cov qauv hauv qab no: peb yuav pom cov lej lej ntawm txhua kab thiab, yog tias qee cov ntaub ntawv ploj lawm hauv kab, tom qab ntawd peb yuav sau qhov khoob nrog tus lej lej.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

Tam sim no cia peb mus rau hauv tus account tias cov xwm txheej tshwm sim thaum cov ntaub ntawv loj heev. Qee cov ntaub ntawv yog nyob rau lub sijhawm [0: 1], thaum qee qhov yuav mus dhau ntau pua thiab txhiab. Txhawm rau tshem tawm cov tawg no thiab ua kom lub khoos phis tawj muaj tseeb hauv nws cov kev suav, peb yuav luam theej duab cov ntaub ntawv thiab ntsuas nws. Cia txhua tus lej tsis pub tshaj peb. Txhawm rau ua qhov no, peb yuav siv StandardScaler muaj nuj nqi.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Tam sim no peb cov ntaub ntawv zoo li no:

Koj thawj kauj ruam hauv Data Science. Titanic

Chav kawm. Peb twb nyob ze ntawm peb lub hom phiaj!

Kauj ruam plaub

Cia peb cob qhia peb thawj tus qauv! Los ntawm lub tsev qiv ntawv sklearn peb tuaj yeem pom ntau yam txaus nyiam. Kuv siv tus qauv Gradient Boosting Classifier rau qhov teeb meem no. Peb siv A classifier vim tias peb txoj haujlwm yog kev faib ua haujlwm. Qhov kev kwv yees yuav tsum tau muab rau 1 (siv tau) lossis 0 (tsis muaj sia nyob).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

Txoj haujlwm haum qhia Python: Cia tus qauv saib rau qhov kev vam meej ntawm X thiab y.

Tsawg tshaj li ib ob thiab tus qauv npaj txhij.

Koj thawj kauj ruam hauv Data Science. Titanic

Yuav siv li cas? Peb mam li pom tam sim no!

Kauj ruam tsib. Xaus

Tam sim no peb yuav tsum thauj ib lub rooj nrog peb cov ntaub ntawv xeem uas peb yuav tsum tau ua qhov kev kwv yees. Nrog lub rooj no peb yuav ua txhua yam haujlwm uas peb tau ua rau X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

Cia peb siv peb tus qauv tam sim no!

gbc_predict = gbc.predict(X_test)

Tag nrho. Peb ua qhov kev kwv yees. Tam sim no nws yuav tsum tau sau tseg hauv csv thiab xa mus rau qhov chaw.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

Npaj txhij. Peb tau txais ib daim ntawv uas muaj kev kwv yees rau txhua tus neeg caij tsheb. Txhua yam uas tseem tshuav yog rub tawm cov kev daws teeb meem no rau lub vev xaib thiab tau txais kev ntsuam xyuas ntawm kev kwv yees. Qhov kev daws teeb meem zoo li no tsis yog tsuas yog 74% ntawm cov lus teb raug rau pej xeem, tab sis kuj muaj qee qhov kev txhawb nqa hauv Data Science. Qhov xav paub tshaj tuaj yeem sau rau kuv hauv cov lus ntiag tug txhua lub sijhawm thiab nug ib lo lus nug. Ua tsaug rau sawv daws!

Tau qhov twg los: www.hab.com

Ntxiv ib saib