Mohato oa hau oa pele ho Saense ea data. Titanic

Lentsoe le lekhutšoanyane la selelekela

Ke lumela hore re ka etsa lintho tse ngata haeba re ne re ka fuoa litaelo tsa mohato ka mohato tse neng li tla re bolella hore na re etse eng le hore na re e etse joang. Ke hopola linako tse ling bophelong ba ka ha ke ne ke sa khone ho qala ntho hobane ho ne ho le thata ho utloisisa hore na ke qale hokae. Mohlomong, ka nako e 'ngoe Inthaneteng u bone mantsoe "Data Science" 'me u etsa qeto ea hore u ne u le hōle le sena,' me batho ba etsang sena ba ne ba le kae-kae moo, lefatšeng le leng. Che, ba teng mona. 'Me, mohlomong, ka lebaka la batho ba tsoang tšimong ena, ho hlahile sengoloa phepelong ea hau. Ho na le lithuto tse ngata tse tla u thusa ho tloaela mosebetsi ona oa matsoho, empa mona ke tla u thusa ho nka mohato oa pele.

Joale, na u se u loketse? E-re ke u bolelle hang-hang hore u tla hloka ho tseba Python 3, kaha ke eona eo ke tla be ke e sebelisa mona. Ke boetse ke u eletsa hore u e kenye ho Jupyter Notebook esale pele kapa u bone mokhoa oa ho sebelisa google colab.

Mohato o le mong

Mohato oa hau oa pele ho Saense ea data. Titanic

Kaggle ke mothusi oa hau oa bohlokoa tabeng ena. Ha e le hantle, u ka etsa ntle le eona, empa ke tla bua ka sena sehloohong se seng. Sena ke sethala se tsamaisang litlholisano tsa Saense ea data. Tlhōlisanong e 'ngoe le e' ngoe e joalo, likarolong tsa pele u tla fumana phihlelo e sa utloahaleng ea ho rarolla mathata a mefuta e sa tšoaneng, phihlelo ea tsoelo-pele le phihlelo ea ho sebetsa sehlopheng, e leng sa bohlokoa mehleng ea rona.

Re tla nka mosebetsi oa rona ho tloha moo. E bitsoa "Titanic". Boemo ke bona: bolela esale pele hore na motho ka mong o tla pholoha. Ka kakaretso, mosebetsi oa motho ea amehang ho DS ke ho bokella lintlha, ho li lokisa, ho koetlisa mohlala, ho bolela esale pele, joalo-joalo. Ka kaggle, re lumelloa ho tlōla sethaleng sa pokello ea lintlha - li hlahisoa sethaleng. Re hloka ho li khoasolla 'me re ka qala!

U ka etsa sena ka tsela e latelang:

tab ya Data e na le difaele tse nang le data

Mohato oa hau oa pele ho Saense ea data. Titanic

Mohato oa hau oa pele ho Saense ea data. Titanic

Re khoasolla datha, ra lokisa libuka tsa rona tsa Jupyter le ...

Mohato oa Bobeli

Hona joale re kenya data ee joang?

Pele, ha re tliseng lilaebrari tse hlokahalang:

import pandas as pd
import numpy as np

Pandas e tla re lumella ho khoasolla lifaele tsa .csv bakeng sa ts'ebetso e eketsehileng.

Numpy ea hlokahala ho emela tafole ea rona ea data joalo ka matrix e nang le linomoro.
Tsoela pele. Ha re nke faele train.csv 'me re e romele ho rona:

dataset = pd.read_csv('train.csv')

Re tla bua ka khetho ea rona ea data ea train.csv ka mofuta oa dataset. Ha re bone se teng:

dataset.head()

Mohato oa hau oa pele ho Saense ea data. Titanic

The head() mosebetsi o re lumella ho sheba mela e seng mekae ea pele ea dataframe.

Likholomo tse Pholohileng ke liphetho tsa rona hantle, tse tsejoang ho dataframe ena. Bakeng sa potso ea mosebetsi, re hloka ho bolela esale pele kholomo e Pholohileng bakeng sa data ea test.csv. Lintlha tsena li boloka tlhahisoleseling mabapi le bapalami ba bang ba Titanic, eo rona, ho rarolla bothata, re sa tsebeng sephetho.

Kahoo, a re aroleng tafole ea rona ho data e itšetlehileng ka eona le e ikemetseng. Ntho e 'ngoe le e' ngoe e bonolo mona. Lintlha tse itšetlehileng ka tsona ke lintlha tse itšetlehileng ka boitsebiso bo ikemetseng bo hlahang liphellong. Lintlha tse ikemetseng ke lintlha tse susumetsang sephetho.

Ka mohlala, re na le sete e latelang ea data:

"Vova o rutile mahlale a khomphutha - che.
Vova o fumane 2 ho saense ea khomphutha. ”

Mophato oa saense ea k'homphieutha o itšetlehile ka karabo ea potso: na Vova o ile a ithuta saense ea k'homphieutha? E hlakile? Ha re tsoeleng pele, re se re le haufi le sepheo!

Phapang e tloaelehileng bakeng sa data e ikemetseng ke X. Bakeng sa data e itšetlehileng ka eona, y.

Re etsa tse latelang:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

Ke eng? Ka ts'ebetso iloc[:, 2:] re bolella Python: Ke batla ho bona ka ho feto-fetoha X data e qalang ho tloha kholomong ea bobeli (e kenyelelitsoe mme ha feela ho bala ho qala ho tloha ho zero). Moleng oa bobeli re re re batla ho bona data kholumong ea pele.

[ a:b, c:d ] ke kaho ea seo re se sebelisang ka masakaneng. Haeba u sa hlalose mefuta efe kapa efe, e tla bolokoa e le ea kamehla. Ke hore, re ka hlakisa [:,: d] ebe joale re tla fumana litšiea tsohle ho dataframe, ntle le tse tlohang ho nomoro ea d ho ea pele. Mefuta e fapaneng a le b e hlalosa likhoele, empa re li hloka kaofela, kahoo re tlohela sena e le kamehla.

Ha re bone hore na re na le eng:

X.head()

Mohato oa hau oa pele ho Saense ea data. Titanic

y.head()

Mohato oa hau oa pele ho Saense ea data. Titanic

E le ho nolofatsa thuto ena e nyenyane, re tla tlosa litšiea tse hlokang tlhokomelo e khethehileng kapa tse sa ameng ho pholoha ho hang. Li na le data ea mofuta oa str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

E kholo! Ha re feteleng mohatong o latelang.

Mohato oa Boraro

Mona re hloka ho kenyelletsa lintlha tsa rona e le hore mochini o utloisise hantle hore na data ena e ama sephetho joang. Empa re ke ke ra kopanya ntho e 'ngoe le e' ngoe, empa ke data feela eo re e siileng. Kholomo "Sex". Re batla ho ngola joang? Ha re emele data mabapi le bong ba motho joalo ka vector: 10 - e motona, 01 - e tšehali.

Taba ea pele, ha re fetoleng litafole tsa rona hore e be matrix a NumPy:

X = np.array(X)
y = np.array(y)

Mme jwale ha re shebeng:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

Laeborari ea sklearn ke laebrari e pholileng e re lumellang ho etsa mosebetsi o felletseng ho Saense ea Boitsebiso. E na le palo e kholo ea mefuta e khahlisang ea ho ithuta ka mochini mme e boetse e re lumella ho etsa litokisetso tsa data.

OneHotEncoder e tla re lumella ho kenyelletsa bong ba motho setšoantšong seo, joalo ka ha re hlalositse. Ho tla theoa lihlopha tse 2: banna, basali. Haeba motho e le monna, joale 1 e tla ngoloa kholomong ea "monna", le 0 kholomong ea "mosali", ka ho latellana.

Ka mor'a OneHotEncoder() ho na le [1] - sena se bolela hore re batla ho kenyelletsa nomoro ea mohala oa 1 (ho bala ho tloha ho zero).

Super. A re feteleng pele!

E le molao, sena se etsahala hore lintlha tse ling li siiloe li se na letho (ke hore, NaN - eseng palo). Ka mohlala, ho na le boitsebiso bo mabapi le motho: lebitso la hae, bong. Empa ha ho na litaba tse mabapi le lilemo tsa hae. Tabeng ena, re tla sebelisa mokhoa o latelang: re tla fumana moelelo oa lipalo holim'a litšiea tsohle, 'me, haeba lintlha tse ling li haella kholeng, joale re tla tlatsa sekheo ka moelelo oa lipalo.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

Joale a re nahaneng hore maemo a etsahala ha data e le khōlō haholo. Lintlha tse ling li nakong [0:1], ha tse ling li ka feta makholo le likete. Ho felisa ho hasana ho joalo le ho etsa hore k'homphieutha e nepahale haholoanyane lipalong tsa eona, re tla hlahloba lintlha le ho li lekanya. Lipalo kaofela li se fete boraro. Ho etsa sena, re tla sebelisa mosebetsi oa StandardScaler.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Joale data ea rona e shebahala tjena:

Mohato oa hau oa pele ho Saense ea data. Titanic

Sehlopha. Re se re le haufi le sepheo sa rona!

Mohato oa bone

Ha re koetliseng mohlala oa rona oa pele! Ho tsoa laebraring ea sklearn re ka fumana palo e kholo ea lintho tse khahlisang. Ke sebelisitse mohlala oa Gradient Boosting Classifier bothateng bona. Re sebelisa A classifier hobane mosebetsi oa rona ke mosebetsi oa ho arola. Polelo e lokela ho abeloa 1 (e pholohileng) kapa 0 (ha ea ka ea phela).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

Mosebetsi o loketseng o bolella Python: E re mohlala o batle ho itšetleha pakeng tsa X le y.

Ka tlase ho motsotsoana mme mohlala o se o loketse.

Mohato oa hau oa pele ho Saense ea data. Titanic

Mokhoa oa ho e sebelisa joang? Re tla bona hona joale!

Mohato oa bohlano. Qetello

Joale re hloka ho kenya tafole ka data ea rona ea liteko eo re hlokang ho e etsa ponelopele. Ka tafole ena re tla etsa liketso tsohle tse tšoanang le tseo re li entseng bakeng sa X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

Ha re sebeliseng mohlala oa rona hona joale!

gbc_predict = gbc.predict(X_test)

Tsohle. Re entse ponelopele. Hona joale e hloka ho ngoloa ka csv le ho romelloa webosaeteng.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

E lokile. Re fumane faele e nang le likhakanyo bakeng sa mopalami e mong le e mong. Sohle se setseng ke ho kenya litharollo tsena webosaeteng le ho fumana tlhahlobo ea ponelopele. Tharollo e joalo ea khale ha e fane feela ka 74% ea likarabo tse nepahetseng sechabeng, empa hape e fana ka tšusumetso ho Data Science. Ea labalabelang ho tseba haholo a ka 'ngolla melaetsa ea lekunutu neng kapa neng mme a botsa potso. Ke leboha bohle!

Source: www.habr.com

Eketsa ka tlhaloso