Gawo lanu loyamba mu Data Science. Titanic

Mawu achidule oyambilira

Ndikukhulupirira kuti titha kuchita zambiri ngati titapatsidwa malangizo pang'onopang'ono omwe angatiuze zoyenera kuchita ndi momwe tingachitire. Ndimakumbukira nthawi zina m'moyo wanga pomwe sindinathe kuyambitsa china chake chifukwa zinali zovuta kumvetsetsa poyambira. Mwinamwake, kamodzi pa intaneti mudawona mawu akuti "Data Science" ndipo munaganiza kuti muli kutali ndi izi, ndipo anthu omwe amachita izi anali kwinakwake kunja uko, kudziko lina. Ayi, ali pomwe pano. Ndipo, mwina, chifukwa cha anthu ochokera m'gawoli, nkhani idawoneka pazakudya zanu. Pali maphunziro ambiri omwe angakuthandizeni kuzolowera lusoli, koma apa ndikuthandizani kuti mutenge gawo loyamba.

Chabwino, mwakonzeka? Ndiroleni ndikuuzeni nthawi yomweyo kuti muyenera kudziwa Python 3, popeza ndizomwe ndikugwiritsa ntchito pano. Ndikukulangizaninso kuti muyike pa Jupyter Notebook pasadakhale kapena muwone momwe mungagwiritsire ntchito google colab.

Khwerero XNUMX

Gawo lanu loyamba mu Data Science. Titanic

Kaggle ndiye wothandizira wanu pankhaniyi. M'malo mwake, mutha kuchita popanda izo, koma ndilankhula za izi m'nkhani ina. Iyi ndi nsanja yomwe imakhala ndi mpikisano wa Data Science. Pampikisano uliwonse woterewu, m'magawo oyambilira mudzapeza chidziwitso chosadziwika bwino pakuthana ndi mavuto amitundu yosiyanasiyana, zochitika zachitukuko komanso chidziwitso chogwira ntchito mu gulu, chomwe chili chofunikira m'nthawi yathu ino.

Titenga ntchito yathu kuchokera pamenepo. Amatchedwa "Titanic". Mkhalidwe wake ndi uwu: neneratu ngati munthu aliyense adzapulumuka. Nthawi zambiri, ntchito ya munthu yemwe ali ndi DS ndi kusonkhanitsa deta, kukonza, kuphunzitsa chitsanzo, kulosera zam'tsogolo, ndi zina zotero. Mu kaggle, timaloledwa kudumpha gawo losonkhanitsa deta - zimaperekedwa papulatifomu. Tiyenera kutsitsa ndipo titha kuyamba!

Mutha kuchita izi motere:

Tsamba la Data lili ndi mafayilo omwe ali ndi data

Gawo lanu loyamba mu Data Science. Titanic

Gawo lanu loyamba mu Data Science. Titanic

Tidatsitsa zomwe tapeza, kukonza zolemba zathu za Jupyter ndi ...

Gawo lachiwiri

Kodi tsopano timayika bwanji deta iyi?

Choyamba, tiyeni titengere malaibulale ofunikira:

import pandas as pd
import numpy as np

Pandas itilola kutsitsa mafayilo a .csv kuti tiziwakonza.

Numpy ndiyofunikira kuyimira tebulo lathu la data ngati matrix okhala ndi manambala.
Chitani zomwezo. Tiyeni titenge fayilo train.csv ndikuyiyika kwa ife:

dataset = pd.read_csv('train.csv')

Tidzanena za kusankha kwathu kwa data ya train.csv kudzera mumitundu yosiyanasiyana ya data. Tiyeni tiwone zomwe zilipo:

dataset.head()

Gawo lanu loyamba mu Data Science. Titanic

Mutu () ntchito imatithandiza kuyang'ana mizere yoyambirira ya dataframe.

Mizati Yopulumuka ndiye zotsatira zathu, zomwe zimadziwika mu dataframe iyi. Pafunso lantchito, tiyenera kulosera za Survived column ya data ya test.csv. Izi zimasunga zambiri za okwera ena a Titanic, omwe ife, kuthetsa vutoli, sitikudziwa zotsatira zake.

Chifukwa chake, tiyeni tigawane tebulo lathu kukhala data yodalira komanso yodziyimira payokha. Zonse ndi zophweka apa. Deta yodalirika ndizomwe zimadalira deta yodziimira yomwe ili muzotsatira. Deta yodziyimira payokha ndizomwe zimakhudza zotsatira.

Mwachitsanzo, tili ndi seti yotsatirayi:

"Vova adaphunzitsa sayansi yamakompyuta - ayi.
Vova adalandira 2 mu sayansi yamakompyuta. ”

Kalasi ya sayansi yamakompyuta imadalira yankho la funso: kodi Vova adaphunzira sayansi yamakompyuta? Ndi zomveka? Tiyeni tipitirire, tayandikira kale ku cholinga!

Kusintha kwachikhalidwe kwa deta yodziimira ndi X. Pa data yodalira, y.

Timachita izi:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

Ndi chiyani? Ndi ntchito iloc[:, 2:] timauza Python: Ndikufuna kuwona mu variable X deta kuyambira ndime yachiwiri (kuphatikiza ndi kupereka kuti kuwerengera kumayambira ziro). Mu mzere wachiwiri timanena kuti tikufuna kuwona deta mu gawo loyamba.

[ a:b, c:d ] ndikumanga kwa zomwe timagwiritsa ntchito m'makolo. Ngati simutchula zosintha zilizonse, zidzasungidwa ngati zosasintha. Ndiko kuti, titha kufotokozera [:,: d] ndiyeno tidzapeza mizati yonse mu dataframe, kupatula zomwe zimachokera ku nambala d kupita mtsogolo. Zosintha a ndi b zimatanthauzira zingwe, koma timazifuna zonse, kotero timasiya izi ngati zosasintha.

Tiyeni tiwone zomwe tili nazo:

X.head()

Gawo lanu loyamba mu Data Science. Titanic

y.head()

Gawo lanu loyamba mu Data Science. Titanic

Kuti tifewetse phunziro laling'onoli, tichotsa zipilala zomwe zimafuna chisamaliro chapadera kapena sizikhudza kupulumuka konse. Ali ndi deta yamtundu wa str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

Zapamwamba! Tiyeni tipitirire ku sitepe yotsatira.

Khwerero XNUMX

Apa tifunika kuyika deta yathu kuti makinawo amvetse bwino momwe detayi imakhudzira zotsatira zake. Koma sitingasinthire chilichonse, koma deta yokhayo yomwe tidasiya. Mzere "Sex". Kodi tikufuna kulemba bwanji? Tiyeni tiyimire zambiri zokhudza jenda la munthu ngati vekitala: 10 - mwamuna, 01 - wamkazi.

Choyamba, tiyeni tisinthe matebulo athu kukhala matrix a NumPy:

X = np.array(X)
y = np.array(y)

Ndipo tsopano tiyeni tiwone:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

Laibulale ya sklearn ndi laibulale yabwino kwambiri yomwe imatilola kuchita ntchito yonse mu Data Science. Lili ndi mitundu yambiri yosangalatsa yophunzirira makina komanso imatilola kuchita kukonzekera deta.

OneHotEncoder itilola kuyika jenda la munthu pachithunzicho, monga tafotokozera. 2 makalasi adzapangidwa: amuna, akazi. Ngati munthuyo ndi mwamuna, ndiye kuti 1 adzalembedwa mugawo la "mwamuna", ndi 0 mu gawo la "mkazi", motsatira.

Pambuyo pa OneHotEncoder() pali [1] - izi zikutanthauza kuti tikufuna kuyika nambala 1 (kuwerengera kuyambira ziro).

Super. Tiyeni tipite patsogolo!

Monga lamulo, izi zimachitika kuti deta ina imasiyidwa yopanda kanthu (ndiko kuti, NaN - osati nambala). Mwachitsanzo, pali zambiri zokhudza munthu: dzina lake, jenda. Koma palibe chidziwitso cha msinkhu wake. Pankhaniyi, tidzagwiritsa ntchito njira yotsatirayi: tidzapeza chiwerengero cha masamu pamizere yonse ndipo, ngati deta ina ikusowa pamzati, ndiye kuti tidzadzaza chopandacho ndi chiwerengero cha masamu.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

Tsopano tiyeni tiganizire kuti zochitika zimachitika pamene deta ili yaikulu kwambiri. Zina zili mu nthawi [0:1], pomwe zina zitha kupitilira mazana ndi masauzande. Kuti tichotse kubalalitsa koteroko ndikupangitsa kompyuta kukhala yolondola kwambiri pakuwerengera kwake, tidzasanthula deta ndikuyikulitsa. Lolani manambala onse asapitirire atatu. Kuti tichite izi, tigwiritsa ntchito StandardScaler ntchito.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Tsopano deta yathu ikuwoneka motere:

Gawo lanu loyamba mu Data Science. Titanic

Kalasi. Tayandikira kale ku cholinga chathu!

Khwerero Chachinayi

Tiyeni tiphunzitse chitsanzo chathu choyamba! Kuchokera ku library ya sklearn titha kupeza zinthu zambiri zosangalatsa. Ndinagwiritsa ntchito chitsanzo cha Gradient Boosting Classifier pavutoli. Timagwiritsa ntchito A classifier chifukwa ntchito yathu ndi ntchito yogawa. Kuneneratu kuyenera kuperekedwa kwa 1 (opulumuka) kapena 0 (sanakhale ndi moyo).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

Ntchito yoyenera imauza Python: Lolani chitsanzocho chiyang'ane zodalira pakati pa X ndi y.

Pasanathe sekondi imodzi ndipo chitsanzo chakonzeka.

Gawo lanu loyamba mu Data Science. Titanic

Kodi mungagwiritse ntchito bwanji? Tiwona tsopano!

Khwerero XNUMX. Mapeto

Tsopano tikufunika kukweza tebulo ndi deta yathu yoyesera yomwe tikufunikira kuti tiwonetseretu. Ndi tebulo ili tidzachita zonse zomwe tidachitira X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

Tiyeni tigwiritse ntchito chitsanzo chathu tsopano!

gbc_predict = gbc.predict(X_test)

Zonse. Tinapangatu zamtsogolo. Tsopano ikuyenera kujambulidwa mu csv ndikutumizidwa patsamba.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

Okonzeka. Tinalandira fayilo yokhala ndi zolosera za wokwera aliyense. Zomwe zatsala ndikukweza mayankho patsamba lino ndikupeza kuwunika kwamtsogolo. Yankho lachikale loterolo silimapereka 74% yokha ya mayankho olondola pagulu, komanso chilimbikitso mu Data Science. Wofuna kudziwa zambiri akhoza kundilembera mu mauthenga achinsinsi nthawi iliyonse ndikufunsa funso. Zikomo kwa nonse!

Source: www.habr.com

Kuwonjezera ndemanga