Nhanho yako yekutanga muData Science. Titanic

Shoko pfupi rekusuma

Ndinotenda kuti taigona kuita zvimwe zvinhu dai takapihwa nhanho-nhanho mirairo yaizotiudza zvekuita uye maitirwo. Ini pachangu ndinorangarira nguva dzehupenyu hwangu pandakatadza kutanga chimwe chinhu nekuti zvaive zvakaoma kunzwisisa pekutangira. Zvichida, pane imwe nguva paInternet wakaona mazwi okuti "Data Science" uye wakasarudza kuti iwe wakanga uri kure neizvi, uye vanhu vanoita izvi vaiva kune imwe nzvimbo kunze uko, mune imwe nyika. Kwete, vari pano. Uye, pamwe, nekuda kwevanhu vanobva mundima ino, chinyorwa chakaonekwa pane chako chikafu. Kune akawanda makosi anozokubatsira kuti ujaire hunyanzvi uhu, asi pano ini ndichakubatsira iwe kutora danho rekutanga.

Zvakanaka, wagadzirira here? Rega ndikuudze ipapo kuti iwe uchada kuziva Python 3, sezvo ndizvo zvandichange ndichishandisa pano. Ini zvakare ndinokupa zano kuti uiise paJupyter Notebook pachine nguva kana kuona mashandisiro egoogle colab.

Chekutanga nhanho

Nhanho yako yekutanga muData Science. Titanic

Kaggle ndiye mubatsiri wako akakosha munyaya iyi. Muchidimbu, iwe unogona kuita pasina iyo, asi ini ndichataura pamusoro peizvi mune imwe nyaya. Iyi ipuratifomu inoitisa Data Science makwikwi. Mumakwikwi ega ega akadaro, mumatanho ekutanga iwe uchawana huwandu husingaite hwechiitiko mukugadzirisa matambudziko emhando dzakasiyana-siyana, ruzivo rwekuvandudza uye ruzivo rwekushanda muchikwata, izvo zvakakosha munguva yedu.

Tichatora basa redu kubva ipapo. Inonzi "Titanic". Mamiriro acho ndeaya: kufanotaura kana munhu wega wega achararama. Kazhinji, basa remunhu anobatanidzwa muDS nderekuunganidza data, kuigadzira, kudzidzisa modhi, kuita fungidziro, zvichingodaro. Mune kaggle, isu tinobvumidzwa kusvetuka danho rekuunganidza data - ivo vanounzwa pachikuva. Tinofanira kuadhaunirodha uye tinogona kutanga!

Iwe unogona kuita izvi sezvinotevera:

iyo Data tab ine mafaera ane data

Nhanho yako yekutanga muData Science. Titanic

Nhanho yako yekutanga muData Science. Titanic

Isu takadhawunirodha iyo data, takagadzirira mabhuku edu eJupyter uye...

Chechipiri nhanho

Tinoisa sei iyi data ikozvino?

Chekutanga, ngatitorei kunze kwenyika maraibhurari anodiwa:

import pandas as pd
import numpy as np

Pandas inozotitendera kudhawunirodha .csv mafaera kuti awedzere kugadziridzwa.

Numpy inodiwa kumiririra tafura yedu yedata sematrix ine manhamba.
Enderera mberi. Ngatitorei faira train.csv uye tiise kwatiri:

dataset = pd.read_csv('train.csv')

Isu tichareva yedu train.csv kusarudzwa kwedata kuburikidza nedhatabheti inosiyana. Ngationei zviripo:

dataset.head()

Nhanho yako yekutanga muData Science. Titanic

Musoro () basa rinotibvumira kutarisa mitsetse yekutanga yedataframe.

Iwo Akapona makoramu ndiwo chaiwo mhedzisiro yedu, inozivikanwa mune ino dataframe. Pamubvunzo webasa, isu tinofanirwa kufanotaura iyo Yakapukunyuka column yetest.csv data. Iyi data inochengetedza ruzivo nezvevamwe vafambi veTitanic, iyo isu, kugadzirisa dambudziko, hatizive mhedzisiro.

Saka, ngatigovane tafura yedu kuita inotsamira uye yakazvimirira data. Zvose zviri nyore pano. Dependent data idzo data dzinotsamira pane yakazvimirira data iri mune zvabuda. Yakazvimirira data ndiyo iyo data inokanganisa mhedzisiro.

Semuenzaniso, isu tine inotevera data seti:

"Vova akadzidzisa computer science - kwete.
Vova akagamuchira 2 musainzi yekombuta.

Giredhi musainzi yekombuta zvinoenderana nemhinduro kumubvunzo: Ko Vova akadzidza sainzi yekombuta? Zviri pachena here? Ngatienderere mberi, isu tatove pedyo nevavariro!

Mutsauko wechinyakare wedata rakazvimirira ndewe X. Kune data rinotsamira, y.

Isu tinoita zvinotevera:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

Chii? Nebasa iroc[:, 2:] tinoudza Python: Ini ndinoda kuona mukusiyana X iyo data inotanga kubva pachikamu chechipiri (inosanganisira uye chero kuverenga kunotanga kubva zero). Mumutsara wechipiri tinoti tinoda kuona data mumutsara wekutanga.

[ a:b, c:d ] ndiko kuvakwa kwezvatinoshandisa mumabharanzi. Kana iwe ukasataura chero shanduko, ivo vanozoponeswa sekusarudzika. Ndiko kuti, tinogona kutsanangura [:,: d] uye ipapo tichawana makoramu ese ari mudhataframe, kunze kweaya anobva kunhamba d zvichienda mberi. Iwo akasiyana a uye b anotsanangura tambo, asi isu tinoada ese, saka isu tinosiya izvi sekudhara.

Ngationei zvatinazvo:

X.head()

Nhanho yako yekutanga muData Science. Titanic

y.head()

Nhanho yako yekutanga muData Science. Titanic

Kuti tirerutsa chidzidzo chidiki ichi, tichabvisa makoramu anoda kutariswa kwakakosha kana kusakanganisa kupona zvachose. Iwo ane data yerudzi str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

Super! Ngatienderere mberi kune nhanho inotevera.

Chechitatu nhanho

Pano isu tinoda encode yedu data kuitira kuti muchina unzwisise zviri nani kuti iyi data inobata sei mhedzisiro. Asi isu hatisi kuzokodha zvese, asi chete str data yatakasiya. Column "Bonde". Tinoda kukodha sei? Ngatimiririrei data nezvemurume wemunhu sevector: 10 - murume, 01 - mukadzi.

Kutanga, ngatishandure matafura edu kuita NumPy matrix:

X = np.array(X)
y = np.array(y)

Uye zvino ngatitarisei:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

Iyo sklearn raibhurari iraibhurari inotonhorera iyo inotibvumira kuita basa rakazara muData Sayenzi. Iyo ine nhamba huru yemamodhi anonakidza ekudzidza muchina uye inotitenderawo kuita gadziriro yedata.

OneHotEncoder ichaita kuti tikwanise kuvharidzira hunhu hwemunhu mune icho chinomiririra, sezvatakatsanangura. Makirasi maviri achagadzirwa: murume, mukadzi. Kana munhu wacho ari murume, ipapo 2 achanyorwa mukoramu β€œyemurume”, uye 1 muchikamu cheβ€œmukadzi” zvichiteerana.

Mushure meOneHotEncoder() pane [1] - izvi zvinoreva kuti tinoda kukodha nhamba yekoramu 1 (kuverenga kubva paziro).

Super. Ngatienderere mberi!

Sezvo mutemo, izvi zvinoitika kuti imwe data inosara isina chinhu (kureva, NaN - kwete nhamba). Somuenzaniso, pane ruzivo pamusoro pomunhu: zita rake, mukadzi. Asi hapana ruzivo nezvezera rake. Muchiitiko ichi, tichashandisa nzira inotevera: tichawana arithmetic inoreva pamusoro pemakoramu ose uye, kana imwe data isipo mumutsara, ipapo tichazadza chisipo nearithmetic zvinoreva.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

Iye zvino ngatitarisei kuti mamiriro ezvinhu anoitika kana iyo data yakakura kwazvo. Imwe data iri mukati menguva [0: 1], nepo mamwe anogona kupfuura mazana nezviuru. Kubvisa kupararira kwakadaro uye kuita kuti komputa iwedzere kurongeka mukuverenga kwayo, isu tichaongorora iyo data nekuiyera. Nhamba dzese ngadzirege kudarika nhatu. Kuti tiite izvi, isu tichashandisa iyo StandardScaler basa.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Iye zvino data yedu inoita seizvi:

Nhanho yako yekutanga muData Science. Titanic

Kirasi. Tatova pedyo nechinangwa chedu!

Danho rechina

Ngatidzidzisei muenzaniso wedu wekutanga! Kubva kuraibhurari ye sklearn tinogona kuwana nhamba huru yezvinhu zvinonakidza. Ndakashandisa iyo Gradient Boosting Classifier modhi kune iri dambudziko. Isu tinoshandisa A classifier nekuti basa redu ibasa rekuisa. Kufungidzira kunofanirwa kupihwa kune 1 (akapona) kana 0 (haana kurarama).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

Iyo yakakodzera basa inoudza Python: Rega modhi itarise zvinoenderana pakati peX uye y.

Pasingasviki yechipiri uye muenzaniso wakagadzirira.

Nhanho yako yekutanga muData Science. Titanic

Nzira yekuishandisa sei? Tichaona ikozvino!

Danho rechishanu. Mhedziso

Iye zvino tinoda kurodha tafura nedata redu rekuyedza iro ratinoda kuita fungidziro. Netafura iyi tichaita zvese zvakafanana zviito zvatakaitira X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

Ngatishandise modhi yedu ikozvino!

gbc_predict = gbc.predict(X_test)

Zvose. Takaita fungidziro. Iye zvino inoda kurekodhwa mu csv uye kutumirwa kune saiti.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

Ready. Takagamuchira faira rine mafambiro emumwe mufambi. Chasara kurodha mhinduro idzi kuwebhusaiti uye kuwana ongororo yekufanotaura. Sarudzo yechinyakare yakadaro inopa kwete chete 74% yemhinduro dzakarurama paruzhinji, asiwo kumwe kurudziro muData Science. Anonyanya kuda kuziva anogona kundinyorera mune zvakavanzika meseji chero nguva uye kubvunza mubvunzo. Ndinotenda kune vese!

Source: www.habr.com

Voeg