Matakinku na farko a Kimiyyar Bayanai. Titanic

A takaice kalmar gabatarwa

Na yi imani cewa za mu iya yin abubuwa da yawa idan an ba mu umarnin mataki-mataki wanda zai gaya mana abin da za mu yi da yadda za mu yi. Ni kaina na tuna lokuttan rayuwata da ban iya fara wani abu ba saboda yana da wuyar fahimtar inda zan fara. Wataƙila, sau ɗaya a kan Intanet kun ga kalmomin "Kimiyyar Bayanai" kuma ku yanke shawarar cewa kun yi nisa daga wannan, kuma mutanen da suke yin wannan suna wani wuri a can, a wata duniyar. A'a, suna nan a nan. Kuma, watakila, godiya ga mutane daga wannan filin, labarin ya bayyana akan abincin ku. Akwai darussa da yawa da za su taimake ka ka saba da wannan sana'a, amma a nan zan taimake ka ka ɗauki mataki na farko.

To, kun shirya? Bari in gaya muku nan da nan cewa za ku buƙaci sanin Python 3, tunda abin da zan yi amfani da shi ke nan. Ina kuma ba ku shawarar shigar da shi akan Jupyter Notebook a gaba ko duba yadda ake amfani da google colab.

Mataki na farko

Matakinku na farko a Kimiyyar Bayanai. Titanic

Kaggle shine babban mataimaki a wannan lamarin. A ka'ida, za ku iya yin ba tare da shi ba, amma zan yi magana game da wannan a cikin wani labarin. Wannan dandali ne da ke karbar bakuncin gasar Kimiyyar Bayanai. A cikin kowane irin wannan gasa, a farkon matakan za ku sami ƙwarewar da ba ta dace ba don magance matsaloli iri-iri, ƙwarewar haɓakawa da ƙwarewar aiki a cikin ƙungiyar, wanda ke da mahimmanci a zamaninmu.

Za mu dauki aikinmu daga nan. Ana kiran shi "Titanic". Sharadi shine wannan: kintace ko kowane mutum zai tsira. Gabaɗaya magana, aikin mutumin da ke cikin DS shine tattara bayanai, sarrafa su, horar da abin ƙira, hasashen hasashen, da sauransu. A cikin kaggle, an ba mu damar tsallake matakin tattara bayanai - an gabatar da su akan dandamali. Muna buƙatar sauke su kuma za mu iya farawa!

Kuna iya yin haka kamar haka:

shafin Data yana dauke da fayiloli masu dauke da bayanai

Matakinku na farko a Kimiyyar Bayanai. Titanic

Matakinku na farko a Kimiyyar Bayanai. Titanic

Mun zazzage bayanan, mun shirya littattafan rubutu na Jupyter da...

Mataki na biyu

Ta yaya muke loda wannan bayanan yanzu?

Da farko, bari mu shigo da dakunan karatu masu mahimmanci:

import pandas as pd
import numpy as np

Pandas zai ba mu damar sauke fayilolin .csv don ƙarin aiki.

Ana buƙatar Numpy don wakiltar teburin bayanan mu azaman matrix tare da lambobi.
Ci gaba. Bari mu ɗauki fayil ɗin train.csv mu loda mana shi:

dataset = pd.read_csv('train.csv')

Za mu koma zuwa zaɓin bayanan mu na train.csv ta hanyar ma'auni na dataset. Bari mu ga abin da ke can:

dataset.head()

Matakinku na farko a Kimiyyar Bayanai. Titanic

Ayyukan shugaban () yana ba mu damar duba ƴan layuka na farko na tsarin bayanai.

Rukunin tsira su ne ainihin sakamakonmu, waɗanda aka sani a cikin wannan tsarin bayanai. Don tambayar ɗawainiya, muna buƙatar yin hasashen ginshiƙin tsira don bayanan test.csv. Wannan bayanan yana adana bayanai game da sauran fasinjoji na Titanic, wanda mu, magance matsalar, ba mu san sakamakon ba.

Don haka, bari mu raba teburin mu zuwa bayanai masu dogaro da kai. Komai yana da sauki a nan. Bayanan dogara sune waɗannan bayanan da suka dogara da bayanan mai zaman kansa wanda ke cikin sakamakon. Bayanai masu zaman kansu sune bayanan da ke tasiri ga sakamakon.

Misali, muna da saitin bayanai masu zuwa:

"Vova ta koyar da kimiyyar kwamfuta - a'a.
Vova ya sami 2 a kimiyyar kwamfuta."

Maki a kimiyyar kwamfuta ya dogara da amsar tambayar: shin Vova ya yi karatun kimiyyar kwamfuta? A bayyane yake? Mu ci gaba, mun riga mun kusanci manufa!

Madaidaicin al'ada don bayanai masu zaman kansu shine X. Don bayanan dogara, y.

Muna yin haka:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

Menene shi? Tare da aikin iloc[:, 2:] muna gaya wa Python: Ina so in ga a cikin m X bayanan da ke farawa daga ginshiƙi na biyu (wanda ya haɗa kuma idan an fara kirgawa daga sifili). A cikin layi na biyu mun ce muna son ganin bayanan a cikin shafi na farko.

[a:b, c:d] shine ginin abin da muke amfani da shi a cikin bakan gizo. Idan baku fayyace wasu masu canji ba, za a adana su azaman tsoho. Wato za mu iya tantance [:,: d] sannan za mu samu dukkan ginshikan da ke cikin bayanan, sai wadanda ke tafiya daga lamba d gaba. Ma'anar a da b suna bayyana kirtani, amma muna buƙatar su duka, don haka mun bar wannan azaman tsoho.

Bari mu ga abin da muka samu:

X.head()

Matakinku na farko a Kimiyyar Bayanai. Titanic

y.head()

Matakinku na farko a Kimiyyar Bayanai. Titanic

Domin sauƙaƙa wannan ƙaramin darasi, za mu cire ginshiƙai waɗanda ke buƙatar kulawa ta musamman ko kuma ba su shafar rayuwa kwata-kwata. Sun ƙunshi bayanai na nau'in str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

Super! Mu ci gaba zuwa mataki na gaba.

Mataki na uku

Anan muna buƙatar shigar da bayanan mu don injin ya fi fahimtar yadda wannan bayanan ke shafar sakamakon. Amma ba za mu ɓoye komai ba, amma kawai bayanan str da muka bari. Rukunin "Jima'i". Ta yaya muke son yin lamba? Bari mu wakilci bayanai game da jinsin mutum a matsayin vector: 10 - namiji, 01 - mace.

Da farko, bari mu canza teburin mu zuwa matrix NumPy:

X = np.array(X)
y = np.array(y)

Kuma yanzu bari mu duba:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

Laburaren sklearn irin wannan ɗakin karatu ne mai kyau wanda ke ba mu damar yin cikakken aiki a Kimiyyar Bayanai. Ya ƙunshi adadi mai yawa na nau'ikan koyon injin mai ban sha'awa kuma yana ba mu damar yin shirye-shiryen bayanai.

OneHotEncoder zai ba mu damar ɓoye jinsin mutum a cikin wannan wakilcin, kamar yadda muka bayyana. Za a ƙirƙiri azuzuwan 2: namiji, mace. Idan mutum namiji ne, to, za a rubuta 1 a cikin ginshiƙin "namiji", kuma 0 a cikin ginshiƙi "mace", bi da bi.

Bayan OneHotEncoder() akwai [1] - wannan yana nufin muna so mu ɓoye lambar shafi 1 (ƙidaya daga sifili).

Super. Mu kara matsawa!

A matsayinka na mai mulki, wannan yana faruwa cewa an bar wasu bayanai ba komai (wato, NaN - ba lamba ba). Misali, akwai bayanai game da mutum: sunansa, jinsi. Amma babu bayanai game da shekarunsa. A wannan yanayin, za mu yi amfani da hanyar da ke gaba: za mu sami ma'anar lissafi a kan dukkan ginshiƙai kuma, idan wasu bayanai sun ɓace a cikin ginshiƙi, to za mu cika rashin amfani da ma'anar lissafi.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

Yanzu bari mu yi la'akari da cewa yanayi yana faruwa lokacin da bayanai ke da girma sosai. Wasu bayanai suna cikin tazara [0:1], yayin da wasu na iya wuce ɗaruruwa da dubbai. Don kawar da irin wannan warwatse da kuma sa kwamfutar ta fi dacewa a cikin lissafinta, za mu duba bayanan da kuma daidaita shi. Kada duk lambobi su wuce uku. Don yin wannan, za mu yi amfani da aikin StandardScaler.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Yanzu bayanan mu yayi kama da haka:

Matakinku na farko a Kimiyyar Bayanai. Titanic

Class. Mun riga mun kusanci burinmu!

Mataki na hudu

Bari mu horar da mu na farko model! Daga ɗakin karatu na sklearn za mu iya samun adadi mai yawa na abubuwa masu ban sha'awa. Na yi amfani da samfurin Rarraba Ƙarfafawa na Gradient zuwa wannan matsalar. Muna amfani da A classifier saboda aikinmu aikin rarrabuwa ne. Ya kamata a sanya hasashen zuwa 1 (raya) ko 0 (ba su tsira ba).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

Ayyukan dacewa yana gaya wa Python: Bari samfurin ya nemi abin dogaro tsakanin X da y.

Kasa da na biyu kuma samfurin yana shirye.

Matakinku na farko a Kimiyyar Bayanai. Titanic

Yadda za a yi amfani da shi? Za mu gani yanzu!

Mataki na biyar. Kammalawa

Yanzu muna buƙatar loda tebur tare da bayanan gwajin mu wanda muke buƙatar yin hasashen. Tare da wannan tebur za mu yi duk ayyuka iri ɗaya waɗanda muka yi don X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

Bari mu yi amfani da samfurin mu yanzu!

gbc_predict = gbc.predict(X_test)

Duka. Mun yi hasashe. Yanzu yana buƙatar yin rikodin a cikin csv kuma a aika zuwa rukunin yanar gizon.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

Shirya Mun sami fayil mai ɗauke da tsinkaya ga kowane fasinja. Duk abin da ya rage shine a loda waɗannan mafita zuwa gidan yanar gizon kuma samun kimanta hasashen. Irin wannan ingantaccen bayani yana ba kawai 74% na daidaitattun amsoshi akan jama'a, har ma da wasu kuzari a cikin Kimiyyar Bayanai. Mafi sani zai iya rubuto mani a cikin saƙon sirri a kowane lokaci kuma ya yi tambaya. Godiya ga duka!

source: www.habr.com

Add a comment