Your first step in Data Science. Titanic

A small introductory word

I believe that we could do more things if we were provided with step-by-step instructions that tell us what to do and how to do it. I myself remember in my life such moments when I could not start some business because it was simply difficult to understand where to start. Perhaps, once upon a time on the Internet you saw the words β€œData Science” and decided that you are far from it, and the people who do this are somewhere out there, in another world. No, they're right here. And, perhaps, thanks to people from this area, an article got into your feed. There are plenty of courses to help you master this craft, but here I'll help you take the first step.

Well, are you ready? I will say right away that you will have to know Python 3, since I will use it here. And also I advise you to install it on Jupyter Notebook in advance or see how to use google colab.

Π¨Π°Π³ ΠΏΠ΅Ρ€Π²Ρ‹ΠΉ

Your first step in Data Science. Titanic

Kaggle is your significant assistant in this matter. In principle, you can do without it, but I will talk about this in another article. This is the kind of platform that hosts Data Science competitions. In each such competition in the early stages, you will receive an unrealistic amount of experience in solving problems of various kinds, development experience and teamwork experience, which is important in our time.

We will take our task from there. It is called so: "Titanic". The condition is as follows: predict whether each individual person will survive. In general, the task of a person involved in DS is data collection, processing, model training, forecasting, and so on. In kaggle, we are allowed to skip the stage of collecting data - they are presented on the platform. We need to download them and we can get started!

You can do this as follows:

the Data tab contains files that contain data

Your first step in Data Science. Titanic

Your first step in Data Science. Titanic

We loaded the data, prepared our Jupyter notebooks and…

Π¨Π°Π³ Π²Ρ‚ΠΎΡ€ΠΎΠΉ

How can we download this data now?

First, we import the necessary libraries:

import pandas as pd
import numpy as np

Pandas will allow us to load .csv files for further processing.

Numpy is needed in order to represent our data table as a matrix with numbers.
Go ahead. Let's take the train.csv file and upload it to us:

dataset = pd.read_csv('train.csv')

We will refer to our train.csv dataset via the dataset variable. Let's see what's in there:

dataset.head()

Your first step in Data Science. Titanic

The head() function allows us to view the first few lines of the dataframe.

Survived columns are just our results, which are known in this dataframe. For the task question, we need to predict the Survived column for test.csv data. This data contains information about other passengers of the Titanic, for which we, solving the problem, do not know the outcome.

So, let's divide our table into dependent and independent data. Everything is simple here. Dependent data is data that depends on what is independent of what is in the outcomes. Independent data are those data that affect the outcome.

For example, we have the following dataset:

β€œVova studied computer science - no.
Vova received 2 in computer science.

The score in computer science depends on the answer to the question: did Vova study computer science? Clearly? Moving on, we are closer to the goal!

The traditional variable for independent data is X. For dependent data, y.

We do the following:

X = dataset.iloc[ : , 2 : ]
y = dataset.iloc[ : , 1 : 2 ]

What it is? With the iloc[:, 2:] function, we tell the python: I want to see in the variable X data starting from the second column (inclusive and provided that the count starts from zero). In the second line we say that we want to see the data of the first column in y.

[ a:b, c:d ] is the construction of what we use in brackets. If you do not specify any variables, they will be saved by default. That is, we can specify [:,: d] and then we will get all the columns in the dataframe, except for those that go from the number d and beyond. Variables a and b define strings, but we all need them, so we leave this by default.

Let's see what we got:

X.head()

Your first step in Data Science. Titanic

y.head()

Your first step in Data Science. Titanic

In order to simplify this little lesson, we will remove columns that require special care, or do not affect survivability at all. They contain data of type str.

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X.drop(count, inplace=True, axis=1)

Super! Let's go to the next step.

Step Three

Here we need to encode our data so that the machine better understands how this data affects the result. But we will not encode everything, but only the str data that we left. The "Sex" column. How do we want to encode? Let's represent data about a person's gender as a vector: 10 - male, 01 - female.

To begin with, we will translate our tables into a NumPy matrix:

X = np.array(X)
y = np.array(y)

And now we look:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X = np.array(ct.fit_transform(X))

The sklearn library is such a cool library that allows us to do full work in Data Science. It contains a large number of interesting machine learning models, and also allows us to do data preparation.

OneHotEncoder will allow us to encode the gender of a person in the same representation as we described. 2 classes will be created: male, female. If the person is a man, then 1 will be written in the "male" column, and 0 in the "female" column, respectively.

After OneHotEncoder() is [1] - this means that we want to encode column number 1 (count from zero).

Super. We are moving even further!

As a rule, this happens that some data remains unfilled (that is, NaN - not a number). For example, there is information about a person: his name, gender. But there is no information about his age. In this case, we will apply the following method: we will find the arithmetic mean over all columns and, if some data is missing in the column, then we will fill the void with the arithmetic mean.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X)
X = imputer.transform(X)

And now let's take into account that there are situations when the data is very much overgrown. Some data is in the interval [0:1], and some can go beyond hundreds and thousands. To eliminate such a spread and the computer was more accurate in the calculations, we will scale the data, scale it. Let all numbers not exceed three. To do this, we use the StandardScaler function.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X[:, 2:] = sc.fit_transform(X[:, 2:])

Now our data looks like this:

Your first step in Data Science. Titanic

Class. We are close to our goal!

Step Four

Let's train our first model! From the sklearn library, we can find a huge number of interesting things. I applied the Gradient Boosting Classifier model to this problem. A classifier is used because our task is a classification task. It is necessary to refer the forecast to 1 (survived) or 0 (did not survive).

from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate=0.5, max_depth=5, n_estimators=150)
gbc.fit(X, y)

The fit function tells python: Let the model look for dependencies between x and y.

Less than a second and the model is ready.

Your first step in Data Science. Titanic

How to apply it? Now we'll see!

Step five. Conclusion

Now we need to load a table with our test data for which we want to make a prediction. With this table, we will do all the same actions that we did for X.

X_test = pd.read_csv('test.csv', index_col=0)

count = ['Name', 'Ticket', 'Cabin', 'Embarked']
X_test.drop(count, inplace=True, axis=1)

X_test = np.array(X_test)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])],
                       remainder='passthrough')
X_test = np.array(ct.fit_transform(X_test))

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_test)
X_test = imputer.transform(X_test)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_test[:, 2:] = sc.fit_transform(X_test[:, 2:])

Let's apply our model!

gbc_predict = gbc.predict(X_test)

All. We made a forecast. Now it needs to be written to csv and sent to the site.

np.savetxt('my_gbc_predict.csv', gbc_predict, delimiter=",", header = 'Survived')

Ready. We received a file containing predictions for each passenger. It remains to upload these solutions to the site and get an estimate of the forecast. Such a primitive solution gives not only 74% correct answers in public, but also some boost in Data Science. The most curious can write to me in private messages at any time and ask a question. Thanks to all!

Source: habr.com

Add a comment