Chewing on logistic regression

Chewing on logistic regression

In this article, we will analyze the theoretical calculations of the transformation linear regression functions Π² the inverse logit transform function (in other words, the logistic response function). Then, using the arsenal maximum likelihood method, in accordance with the logistic regression model, we derive the loss function Logistic Loss, or in other words, we will define a function with which the parameters of the weight vector are selected in the logistic regression model Chewing on logistic regression.

Outline of the article:

  1. Let's recap the straight line relationship between two variables
  2. Reveal the need for transformation linear regression functions Chewing on logistic regression Π² logistic response function Chewing on logistic regression
  3. Let's carry out the transformations and derive logistic response function
  4. Let's try to understand why the least squares method is bad when choosing parameters Chewing on logistic regression Features Logistic Loss
  5. Use maximum likelihood method for determining selection functions Chewing on logistic regression:

    5.1. Case 1: function Logistic Loss for objects with class designation 0 ΠΈ 1:

    Chewing on logistic regression

    5.2. Case 2: function Logistic Loss for objects with class designation -1 ΠΈ +1:

    Chewing on logistic regression


The article is replete with simple examples in which all calculations are easy to make orally or on paper, in some cases a calculator may be required. So get ready πŸ™‚

This article is more intended for data scientists with an initial level of knowledge in the basics of machine learning.

The article will also provide code for drawing graphs and calculations. All code is written in python-2.7. I will explain in advance about the β€œnovelty” of the version used - this is one of the conditions for passing the well-known course from Yandex on the equally well-known online education platform Coursera, and, as you can imagine, the material is based on this course.

01. Straight Line Dependency

It is quite reasonable to ask the question - what does the straight-line dependence and logistic regression have to do with it?

Everything is simple! Logistic regression is one of the models that are related to the linear classifier. In simple words, the task of a linear classifier is to predict the target values Chewing on logistic regression from variables (regressors) Chewing on logistic regression. It is assumed that the relationship between the features Chewing on logistic regression and target values Chewing on logistic regression linear. Hence the name of the classifier - linear. If very roughly generalized, then the logistic regression model is based on the assumption that there is a linear relationship between the features Chewing on logistic regression and target values Chewing on logistic regression. Here it is, the connection.

In the studio, the first example, and he, correctly, about the rectilinear dependence of the quantities under study. In the process of preparing the article, I came across an example that has already set the teeth on edge for many - the dependence of current strength on voltage (β€œApplied Regression Analysis”, N. Draper, G. Smith). Here we will also consider it.

In accordance with Ohm's law:

Chewing on logistic regressionWhere Chewing on logistic regression - current strength, Chewing on logistic regression - voltage, Chewing on logistic regression - resistance.

If we didn't know Ohm's law, then we could find the dependence empirically by changing Chewing on logistic regression and measuring Chewing on logistic regressionwhile maintaining Chewing on logistic regression fixed. Then we would see that the dependency graph Chewing on logistic regression from Chewing on logistic regression gives a more or less straight line through the origin. We said β€œmore or less” because, although the relationship is in fact exact, our measurements may contain small errors, and therefore the points on the graph may not fall exactly on the line, but will be randomly scattered around it.

Graph 1 "Dependence Chewing on logistic regression from Chewing on logistic regressionΒ»

Chewing on logistic regression

Chart drawing code

import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import random

R = 13.75

x_line = np.arange(0,220,1)
y_line = []
for i in x_line:
    y_line.append(i/R)
    
y_dot = []
for i in y_line:
    y_dot.append(i+random.uniform(-0.9,0.9))


fig, axes = plt.subplots(figsize = (14,6), dpi = 80)
plt.plot(x_line,y_line,color = 'purple',lw = 3, label = 'I = U/R')
plt.scatter(x_line,y_dot,color = 'red', label = 'Actual results')
plt.xlabel('I', size = 16)
plt.ylabel('U', size = 16)
plt.legend(prop = {'size': 14})
plt.show()

02. The need for transformations of the linear regression equation

Let's consider another example. Imagine that we work in a bank and we are faced with the task of determining the probability of repayment of a loan by a borrower, depending on some factors. To simplify the task, we will consider only two factors: the monthly salary of the borrower and the monthly payment to repay the loan.

The task is very conditional, but in this example we can understand why it is not enough to use linear regression functions, and also find out what transformations with the function you need to carry out.

Let's go back to the example. It is understood that the higher the salary, the more the borrower will be able to send monthly to repay the loan. At the same time, for a certain range of salaries, this dependence will be quite linear. For example, let's take a salary range from 60.000R to 200.000R and assume that in the specified salary range, the dependence of the monthly payment on the salary is linear. Suppose, for the specified range of wages, it was revealed that the ratio of wages to payment cannot fall below 3 and the borrower should still have 5.000R in reserve. And only in this case, we will assume that the borrower will return the loan to the bank. Then, the linear regression equation will take the form:

Chewing on logistic regression

where Chewing on logistic regression, Chewing on logistic regression, Chewing on logistic regression, Chewing on logistic regression β€” salary Chewing on logistic regressionth borrower, Chewing on logistic regression β€” loan payment Chewing on logistic regressionth borrower.

Substituting salary and loan payment with fixed parameters into the equation Chewing on logistic regression You can decide whether to grant or deny a loan.

Looking ahead, we note that, for the given parameters Chewing on logistic regression linear regression functionapplied in logistic response functions will produce large values ​​that will make it difficult to carry out calculations to determine the probabilities of repaying the loan. Therefore, it is proposed to reduce our coefficients, let's say, by 25.000 times. From this transformation in the coefficients, the decision to issue a loan will not change. Let's remember this moment for the future, and now, to make it even clearer what we are talking about, let's consider the situation with three potential borrowers.

Table 1 Potential Borrowers

Chewing on logistic regression

Code for generating a table

import pandas as pd

r = 25000.0
w_0 = -5000.0/r
w_1 = 1.0/r
w_2 = -3.0/r

data = {'The borrower':np.array(['Vasya', 'Fedya', 'Lesha']), 
        'Salary':np.array([120000,180000,210000]),
       'Payment':np.array([3000,50000,70000])}

df = pd.DataFrame(data)

df['f(w,x)'] = w_0 + df['Salary']*w_1 + df['Payment']*w_2

decision = []
for i in df['f(w,x)']:
    if i > 0:
        dec = 'Approved'
        decision.append(dec)
    else:
        dec = 'Refusal'
        decision.append(dec)
        
df['Decision'] = decision

df[['The borrower', 'Salary', 'Payment', 'f(w,x)', 'Decision']]

In accordance with the data in the table, Vasya, with a salary of 120.000R, wants to receive such a loan in order to repay it at 3.000R monthly. We determined that in order to approve the loan, Vasya's salary must exceed three times the amount of the payment, and that there should still be 5.000R. Vasya satisfies this requirement: Chewing on logistic regression. It remains even 106.000R. Although when calculating Chewing on logistic regression we reduced the odds Chewing on logistic regression 25.000 times, the result is the same - the loan can be approved. Fedya will also receive a loan, but Lesha, despite the fact that he receives the most, will have to moderate his appetites.

Let's draw a graph for this case.

Chart 2 "Classification of Borrowers"

Chewing on logistic regression

Chart drawing code

salary = np.arange(60000,240000,20000)
payment = (-w_0-w_1*salary)/w_2


fig, axes = plt.subplots(figsize = (14,6), dpi = 80)
plt.plot(salary, payment, color = 'grey', lw = 2, label = '$f(w,x_i)=w_0 + w_1x_{i1} + w_2x_{i2}$')
plt.plot(df[df['Decision'] == 'Approved']['Salary'], df[df['Decision'] == 'Approved']['Payment'], 
         'o', color ='green', markersize = 12, label = 'Decision - Loan approved')
plt.plot(df[df['Decision'] == 'Refusal']['Salary'], df[df['Decision'] == 'Refusal']['Payment'], 
         's', color = 'red', markersize = 12, label = 'Decision - Loan refusal')
plt.xlabel('Salary', size = 16)
plt.ylabel('Payment', size = 16)
plt.legend(prop = {'size': 14})
plt.show()

So, our straight line, constructed in accordance with the function Chewing on logistic regression, separates "bad" borrowers from "good" ones. Those borrowers whose desires do not match their capabilities are above the straight line (Lesha), those who, according to the parameters of our model, are able to repay the loan, are below the straight line (Vasya and Fedya). Otherwise, we can say this - our line divides borrowers into two classes. We denote them as follows: to the class Chewing on logistic regression we will classify those borrowers who are most likely to repay the loan to the class Chewing on logistic regression or Chewing on logistic regression we will include those borrowers who most likely will not be able to repay the loan.

Let's summarize the conclusions from this simple example. Let's take a point Chewing on logistic regression and, substituting the coordinates of the point into the corresponding equation of the straight line Chewing on logistic regressionLet's consider three options:

  1. If the point is under the line, and we assign it to the class Chewing on logistic regression, then the value of the function Chewing on logistic regression will be positive from Chewing on logistic regression to Chewing on logistic regression. So we can assume that the probability of repaying the loan is within Chewing on logistic regression. The larger the value of the function, the higher the probability.
  2. If the point is above the line and we assign it to the class Chewing on logistic regression or Chewing on logistic regression, then the value of the function will be negative from Chewing on logistic regression to Chewing on logistic regression. Then we will assume that the probability of debt repayment is within Chewing on logistic regression and, the greater the modulo value of the function, the higher our confidence.
  3. The point is on a straight line, on the boundary between two classes. In this case, the value of the function Chewing on logistic regression will be equal to Chewing on logistic regression and the probability of repaying the loan is Chewing on logistic regression.

Now, let's imagine that we have not two factors, but tens, borrowers, not three, but thousands. Then instead of a straight line we will have m-dimensional plane and coefficients Chewing on logistic regression we will not be taken from the ceiling, but withdrawn in accordance with all the rules, and on the basis of accumulated data on borrowers who have repaid or not repaid the loan. And indeed, note that we are now selecting borrowers at already known coefficients Chewing on logistic regression. In fact, the task of the logistic regression model is precisely to determine the parameters Chewing on logistic regression, for which the value of the loss function Logistic Loss will tend to the minimum. But about how the vector is calculated Chewing on logistic regression, we will find out in the 5th section of the article. In the meantime, we return to the promised land - to our banker and his three clients.

Thanks to the function Chewing on logistic regression we know who can be given a loan, and who should be denied. But you can’t go to the director with such information, because they wanted to get from us the probability of repayment of the loan by each borrower. What to do? The answer is simple - we need to somehow transform the function Chewing on logistic regression, whose values ​​lie in the range Chewing on logistic regression to a function whose values ​​will lie in the range Chewing on logistic regression. And such a function exists, it is called logistic response function or inverse-logit transformation. Meet:

Chewing on logistic regression

Let's see how it goes step by step logistic response function. Note that we will walk in the opposite direction, i.e. we will assume that we know a probability value that lies between Chewing on logistic regression to Chewing on logistic regression and then we will β€œunwind” this value over the entire range of numbers from Chewing on logistic regression to Chewing on logistic regression.

03. Derive the logistic response function

Step 1. Convert the probability values ​​to the range Chewing on logistic regression

At the time of function transformation Chewing on logistic regression Π² logistic response function Chewing on logistic regression we will leave alone our credit analyst, and instead we will go through the bookmakers. No, of course, we will not bet, all that interests us there is the meaning of the expression, for example, the odds are 4 to 1. The odds, familiar to all bettors, are the ratio of β€œsuccesses” to β€œfailures”. In terms of probabilities, odds are the probability of an event occurring divided by the probability that the event will not occur. Let's write the formula for the chance of an event occurring Chewing on logistic regression:

Chewing on logistic regression

Where Chewing on logistic regression is the probability of the event occurring, Chewing on logistic regression - the probability of NOT the occurrence of the event

For example, if the probability that a young, strong and frisky horse nicknamed "Veterok" will beat an old and flabby old woman named "Matilda" at the races is equal to Chewing on logistic regression, then the chances of success for Veterok will be Chewing on logistic regression ΠΊ Chewing on logistic regression Chewing on logistic regression and vice versa, knowing the odds, it will not be difficult for us to calculate the probability Chewing on logistic regression:

Chewing on logistic regression

Thus, we have learned to β€œtranslate” probability into chances, which take values ​​from Chewing on logistic regression to Chewing on logistic regression. Let's take one more step and learn how to "translate" the probability to the entire number line from Chewing on logistic regression to Chewing on logistic regression.

Step 2. Convert the probability values ​​to the range Chewing on logistic regression

This step is very simple - we take the logarithm of the chances to the base of the Euler number Chewing on logistic regression and get:

Chewing on logistic regression

Now we know that if Chewing on logistic regression, then calculate the value Chewing on logistic regression will be very simple and, moreover, it must be positive: Chewing on logistic regression. This is true.

For the sake of curiosity, we check that if Chewing on logistic regression, then we expect to see a negative value Chewing on logistic regression. We check: Chewing on logistic regression. All right.

Now we know how to translate the probability value from Chewing on logistic regression to Chewing on logistic regression on the entire number line from Chewing on logistic regression to Chewing on logistic regression. In the next step, we will do the opposite.

In the meantime, we note that in accordance with the rules of logarithm, knowing the value of the function Chewing on logistic regression, you can calculate the odds:

Chewing on logistic regression

This method of determining the odds will be useful to us in the next step.

Step 3. We derive a formula for determining Chewing on logistic regression

So we have learned by knowing Chewing on logistic regression, find function values Chewing on logistic regression. However, in fact, we need everything exactly the opposite - knowing the value Chewing on logistic regression find Chewing on logistic regression. To do this, we turn to such a concept as the inverse odds function, according to which:

Chewing on logistic regression

In the article, we will not derive the above formula, but we will check on the numbers from the example above. We know that with odds of 4 to 1 (Chewing on logistic regression), the probability of the event occurring is 0.8 (Chewing on logistic regression). Let's make a substitution: Chewing on logistic regression. This is consistent with our previous calculations. We move on.

In the last step, we deduced that Chewing on logistic regression, which means we can make a change in the inverse odds function. We get:

Chewing on logistic regression

Divide both the numerator and denominator by Chewing on logistic regression, then:

Chewing on logistic regression

For every fireman, in order to make sure that we didn’t make a mistake anywhere, we’ll do one more small check. In step 2, we Chewing on logistic regression determined that Chewing on logistic regression. Then, substituting the value Chewing on logistic regression into the logistic response function, we expect to get Chewing on logistic regression. Substitute and get: Chewing on logistic regression

Congratulations, dear reader, we have just derived and tested the logistic response function. Let's look at the graph of the function.

Graph 3 "Logistic response function"

Chewing on logistic regression

Chart drawing code

import math

def logit (f):
    return 1/(1+math.exp(-f))

f = np.arange(-7,7,0.05)
p = []

for i in f:
    p.append(logit(i))

fig, axes = plt.subplots(figsize = (14,6), dpi = 80)
plt.plot(f, p, color = 'grey', label = '$ 1 / (1+e^{-w^Tx_i})$')
plt.xlabel('$f(w,x_i) = w^Tx_i$', size = 16)
plt.ylabel('$p_{i+}$', size = 16)
plt.legend(prop = {'size': 14})
plt.show()

In the literature, you can also find the name of this function as sigmoid function. The graph clearly shows that the main change in the probability of an object belonging to a class occurs in a relatively small range Chewing on logistic regression, somewhere from Chewing on logistic regression to Chewing on logistic regression.

I propose to return to our credit analyst and help him with the calculation of the probability of repayment of loans, otherwise he risks being left without a premium πŸ™‚

Table 2 Potential Borrowers

Chewing on logistic regression

Code for generating a table

proba = []
for i in df['f(w,x)']:
    proba.append(round(logit(i),2))
    
df['Probability'] = proba

df[['The borrower', 'Salary', 'Payment', 'f(w,x)', 'Decision', 'Probability']]

So, we have determined the probability of repayment of the loan. In general, this seems to be true.

Indeed, the probability that Vasya, with a salary of 120.000R, will be able to pay 3.000R to the bank every month is close to 100%. By the way, we must understand that the bank can also issue a loan to Lesha if the bank's policy provides, for example, to lend to clients with a probability of repayment of the loan more than, say, 0.3. Just in this case, the bank will form a larger reserve for possible losses.

It should also be noted that the ratio of salary to payment of at least 3 and with a margin of 5.000R was taken from the ceiling. Therefore, we could not use the weight vector in its original form Chewing on logistic regression. We needed to greatly reduce the coefficients and in this case we divided each coefficient by 25.000, that is, in fact, we adjusted the result. But this was done on purpose to simplify the understanding of the material at the initial stage. In life, we will not need to invent and adjust the coefficients, but to find them. Just in the following sections of the article, we will derive the equations with which the parameters are selected Chewing on logistic regression.

04. The method of least squares in determining the vector of weights Chewing on logistic regression in the logistic response function

We already know such a method for selecting the weight vector Chewing on logistic regressionAs least squares method (LSM) and actually, why don't we use it in binary classification problems then? Indeed, nothing prevents you from using MNC, only this method in classification problems gives results less accurate than Logistic Loss. There is a theoretical justification for this. Let's first look at one simple example.

Suppose our models (using MSE ΠΈ Logistic Loss) have already begun to select the weight vector Chewing on logistic regression and we stopped the calculation at some step. It doesn't matter if it's in the middle, at the end or at the beginning, the main thing is that we already have some values ​​of the weight vector and let's say that at this step, the weight vector Chewing on logistic regression for both models have no differences. Then we take the obtained weights and substitute them into logistic response function (Chewing on logistic regression) for some object that belongs to the class Chewing on logistic regression. Let us study two cases when, in accordance with the selected weight vector, our model is very wrong and vice versa - the model is strongly confident that the object belongs to the class Chewing on logistic regression. Let's see what penalties will be "issued" when using MNC ΠΈ Logistic Loss.

Code for calculating penalties depending on the loss function used

# класс ΠΎΠ±ΡŠΠ΅ΠΊΡ‚Π°
y = 1
# Π²Π΅Ρ€ΠΎΡΡ‚Π½ΠΎΡΡ‚ΡŒ отнСсСния ΠΎΠ±ΡŠΠ΅ΠΊΡ‚Π° ΠΊ классу Π² соотвСтствии с ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Π°ΠΌΠΈ w
proba_1 = 0.01

MSE_1 = (y - proba_1)**2
print 'Π¨Ρ‚Ρ€Π°Ρ„ MSE ΠΏΡ€ΠΈ Π³Ρ€ΡƒΠ±ΠΎΠΉ ошибкС =', MSE_1

# напишСм Ρ„ΡƒΠ½ΠΊΡ†ΠΈΡŽ для вычислСния f(w,x) ΠΏΡ€ΠΈ извСстной вСроятности отнСсСния ΠΎΠ±ΡŠΠ΅ΠΊΡ‚Π° ΠΊ классу +1 (f(w,x)=ln(odds+))
def f_w_x(proba):
    return math.log(proba/(1-proba)) 

LogLoss_1 = math.log(1+math.exp(-y*f_w_x(proba_1)))
print 'Π¨Ρ‚Ρ€Π°Ρ„ Log Loss ΠΏΡ€ΠΈ Π³Ρ€ΡƒΠ±ΠΎΠΉ ошибкС =', LogLoss_1

proba_2 = 0.99

MSE_2 = (y - proba_2)**2
LogLoss_2 = math.log(1+math.exp(-y*f_w_x(proba_2)))

print '**************************************************************'
print 'Π¨Ρ‚Ρ€Π°Ρ„ MSE ΠΏΡ€ΠΈ сильной увСрСнности =', MSE_2
print 'Π¨Ρ‚Ρ€Π°Ρ„ Log Loss ΠΏΡ€ΠΈ сильной увСрСнности =', LogLoss_2

The blunder case - the model relates the object to the class Chewing on logistic regression with a probability of 0,01

Use Penalty MNC will be:
Chewing on logistic regression

Use Penalty Logistic Loss will be:
Chewing on logistic regression

A Case of Strong Confidence - the model relates the object to the class Chewing on logistic regression with a probability of 0,99

Use Penalty MNC will be:
Chewing on logistic regression

Use Penalty Logistic Loss will be:
Chewing on logistic regression

This example illustrates well that under a gross error, the loss function log loss penalizes the model much more than MSE. Let's now understand what are the theoretical background of using the loss function log loss in classification problems.

05. Maximum likelihood method and logistic regression

As promised at the beginning, the article is replete with simple examples. In the studio, another example and old guests - bank borrowers: Vasya, Fedya and Lesha.

Just to be on the safe side, before developing an example, let me remind you that in life we ​​are dealing with a training sample of thousands or millions of objects with tens or hundreds of features. However, here the numbers are taken so that they easily fit in the head of a novice datasight tester.

Let's go back to the example. Let's imagine that the director of the bank decided to issue a loan to all those in need, despite the fact that the algorithm suggested not to issue it to Lesha. And now enough time has passed and we became aware of which of the three heroes repaid the loan, and who did not. As expected: Vasya and Fedya repaid the loan, but Lesha did not. Now let's imagine that this result will be a new training sample for us and, at the same time, we seem to have lost all data on the factors affecting the probability of repaying the loan (borrower's salary, monthly payment). Then, intuitively, we can assume that every third borrower does not repay the loan to the bank, or in other words, the probability of repayment of the loan by the next borrower Chewing on logistic regression. This intuitive assumption has a theoretical confirmation and is based on maximum likelihood method, often referred to in the literature maximum likelihood principle.

First, let's get acquainted with the conceptual apparatus.

Sampling likelihood is the probability of obtaining just such a sample, obtaining just such observations / results, i.e. the product of the probabilities of obtaining each of the results of the sample (for example, Vasya, Fedya and Lesha's loan is repaid or not repaid at the same time).

likelihood function relates the likelihood of the sample to the values ​​of the distribution parameters.

In our case, the training sample is a generalized Bernoulli scheme, in which the random variable takes only two values: Chewing on logistic regression or Chewing on logistic regression. Therefore, the likelihood of a sample can be written as a likelihood function of the parameter Chewing on logistic regression in the following way:

Chewing on logistic regression
Chewing on logistic regression

The above entry can be interpreted as follows. The joint probability that Vasya and Fedya will repay the loan is Chewing on logistic regression, the probability that Lesha will NOT repay the loan is equal to Chewing on logistic regression (since it was NOT repayment of the loan), therefore, the joint probability of all three events is equal to Chewing on logistic regression.

Maximum likelihood method is a method for estimating an unknown parameter by maximizing likelihood functions. In our case, we need to find such a value Chewing on logistic regressionat which Chewing on logistic regression reaches a maximum.

Where does the idea actually come from - to look for the value of the unknown parameter at which the likelihood function reaches its maximum? The origins of the idea stem from the notion that the sample is the only source of knowledge available to us about the population. Everything we know about the population is represented in the sample. Therefore, all we can say is that the sample is the most accurate reflection of the population available to us. Therefore, we need to find such a parameter at which the existing sample becomes the most probable.

Obviously, we are dealing with an optimization problem in which it is required to find the extremum point of the function. To find the extremum point, it is necessary to consider the first-order condition, that is, equate the derivative of the function to zero and solve the equation for the desired parameter. However, the search for the derivative of the product of a large number of factors can turn out to be a protracted matter; to avoid this, there is a special trick - the transition to the logarithm likelihood functions. Why is such a transition possible? Let us pay attention to the fact that we are not looking for the extremum of the functionChewing on logistic regression, and the extremum point, that is, the value of the unknown parameter Chewing on logistic regressionat which Chewing on logistic regression reaches a maximum. When passing to the logarithm, the extremum point does not change (although the extremum itself will be different), since the logarithm is a monotonic function.

Let's, in accordance with the above, continue to develop our example with loans from Vasya, Fedya and Lesha. To start, let's go to logarithm of the likelihood function:

Chewing on logistic regression

Now we can easily differentiate the expression with respect to Chewing on logistic regression:

Chewing on logistic regression

And finally, consider the first-order condition - we equate the derivative of the function to zero:

Chewing on logistic regression

Thus, our intuitive estimate of the probability of loan repayment Chewing on logistic regression was theoretically justified.

Great, but what do we do with this information now? If we assume that every third borrower does not return the money to the bank, then the latter will inevitably go bankrupt. So it is, but only when assessing the probability of repaying the loan equal to Chewing on logistic regression we did not take into account the factors affecting the repayment of the loan: the borrower's salary and the amount of the monthly payment. Recall that earlier we calculated the probability of repayment of a loan by each client, taking into account these same factors. It is logical that our probabilities turned out to be different from the constant equal to Chewing on logistic regression.

Let's determine the likelihood of the samples:

Code for Calculating Likelihoods of Samples

from functools import reduce

def likelihood(y,p):
    line_true_proba = []
    for i in range(len(y)):
        ltp_i = p[i]**y[i]*(1-p[i])**(1-y[i])
        line_true_proba.append(ltp_i)
    likelihood = []
    return reduce(lambda a, b: a*b, line_true_proba)
        
    
y = [1.0,1.0,0.0]
p_log_response = df['Probability']
const = 2.0/3.0
p_const = [const, const, const]


print 'ΠŸΡ€Π°Π²Π΄ΠΎΠΏΠΎΠ΄ΠΎΠ±ΠΈΠ΅ Π²Ρ‹Π±ΠΎΡ€ΠΊΠΈ ΠΏΡ€ΠΈ константном Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΈ p=2/3:', round(likelihood(y,p_const),3)

print '****************************************************************************************************'

print 'ΠŸΡ€Π°Π²Π΄ΠΎΠΏΠΎΠ΄ΠΎΠ±ΠΈΠ΅ Π²Ρ‹Π±ΠΎΡ€ΠΊΠΈ ΠΏΡ€ΠΈ расчСтном Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΈ p:', round(likelihood(y,p_log_response),3)

Sampling likelihood at a constant value Chewing on logistic regression:

Chewing on logistic regression

Sampling likelihood in calculating the probability of repayment of a loan, taking into account factors Chewing on logistic regression:

Chewing on logistic regression
Chewing on logistic regression

The likelihood of a sample with a probability calculated depending on the factors turned out to be higher than the likelihood with a constant probability value. What does it say? This suggests that knowledge of the factors made it possible to more accurately select the probability of loan repayment for each client. Therefore, when issuing another loan, it would be more correct to use the model proposed at the end of the 3rd section of the article for assessing the probability of debt repayment.

But then, if we want to maximize sample likelihood function, then why not use some algorithm that will give the probabilities for Vasya, Fedya and Lesha, for example, equal to 0.99, 0.99 and 0.01, respectively. Perhaps such an algorithm will perform well on the training sample, as it will bring the value of the likelihood of the sample closer to Chewing on logistic regression, but, firstly, such an algorithm will most likely have difficulties with generalizing ability, and secondly, this algorithm will definitely not be linear. And if the methods of dealing with overfitting (equally weak generalizing ability) are clearly not included in the plan of this article, then let's go over the second point in more detail. To do this, it is enough to answer a simple question. Can the probability of repaying the loan to Vasya and Fedya be the same, taking into account the factors known to us? From the point of view of sound logic, of course not, it cannot. So, Vasya will give 2.5% of his salary per month to repay the loan, and Fedya - almost 27,8%. Also on Chart 2 β€œClients Classification” we see that Vasya is much further from the line separating the classes than Fedya. Finally, we know that the function Chewing on logistic regression for Vasya and Fedya takes different values: 4.24 for Vasya and 1.0 for Fedya. Now, if Fedya, for example, earned an order of magnitude more or asked for a smaller loan, then the probabilities of repaying the loan from Vasya and Fedya would be similar. In other words, you can't deceive a linear relationship. And if we really calculated the odds Chewing on logistic regression, and did not take them from the ceiling, we could safely say that our values Chewing on logistic regression best allow to estimate the probability of loan repayment by each borrower, but since we agreed to consider that the definition of coefficients Chewing on logistic regression was carried out in accordance with all the rules, then we will assume so - our coefficients allow us to give a better estimate of the probability πŸ™‚

However, we digress. In this section, we need to understand how the weight vector is determined Chewing on logistic regression, which is necessary to assess the probability of loan repayment by each borrower.

Briefly summarize what arsenal we use in search of coefficients Chewing on logistic regression:

1. We assume that the relationship between the target variable (forecast value) and the factor influencing the result is linear. For this reason, it applies linear regression function form Chewing on logistic regression, the line of which divides objects (clients) into classes Chewing on logistic regression ΠΈ Chewing on logistic regression or Chewing on logistic regression (customers able to repay the loan and not able). In our case, the equation has the form Chewing on logistic regression.

2. We use inverse logit function form Chewing on logistic regression to determine the probability of an object belonging to a class Chewing on logistic regression.

3. We consider our training sample as a realization of the generalized Bernoulli schemes, that is, for each object, a random variable is generated, which with probability Chewing on logistic regression (its own for each object) takes the value 1 and with probability Chewing on logistic regression - 0.

4. We know we need to maximize sample likelihood function taking into account the accepted factors in order to make the available sample the most plausible. In other words, we need to select parameters that will make the sample most likely. In our case, the selected parameter is the probability of repaying the loan Chewing on logistic regression, which in turn depends on the unknown coefficients Chewing on logistic regression. So we need to find such a vector of weights Chewing on logistic regression, at which the likelihood of the sample will be maximum.

5. We know what to maximize sample likelihood functions can be used maximum likelihood method. And we know all the tricky tricks to work with this method.

Here is such a multi-way turns out πŸ™‚

And now remember that at the very beginning of the article we wanted to derive two types of loss function Logistic Loss depending on how the classes of objects are designated. It so happened that in classification problems with two classes, classes are denoted as Chewing on logistic regression ΠΈ Chewing on logistic regression or Chewing on logistic regression. Depending on the notation, the output will be the corresponding loss function.

Case 1. Classification of objects on Chewing on logistic regression ΠΈ Chewing on logistic regression

Earlier, when determining the plausibility of the sample, in which the probability of debt repayment by the borrower was calculated based on factors and given coefficients Chewing on logistic regression, we applied the formula:

Chewing on logistic regression

Actually Chewing on logistic regression is the meaning logistic response functions Chewing on logistic regression for a given weight vector Chewing on logistic regression

Then nothing prevents us from writing the likelihood function of the sample as follows:

Chewing on logistic regression

It happens that sometimes, it is difficult for some novice analysts to immediately understand how this function works. Let's look at 4 short examples that will make everything clear:

1. If Chewing on logistic regression (i.e., according to the training sample, the object belongs to class +1), and our algorithm Chewing on logistic regression determines the probability of an object being assigned to a class Chewing on logistic regression equal to 0.9, then this piece of sample likelihood will be calculated as follows:

Chewing on logistic regression

2. If Chewing on logistic regression, Chewing on logistic regression, then the calculation will be:

Chewing on logistic regression

3. If Chewing on logistic regression, Chewing on logistic regression, then the calculation will be:

Chewing on logistic regression

4. If Chewing on logistic regression, Chewing on logistic regression, then the calculation will be:

Chewing on logistic regression

It is obvious that the likelihood function will be maximized in cases 1 and 3, or in the general case - with correctly guessed values ​​of the probabilities of classifying an object to a class Chewing on logistic regression.

Due to the fact that when determining the probability of referring an object to a class Chewing on logistic regression we do not know only the coefficients Chewing on logistic regression, then we will look for them. As mentioned above, this is an optimization problem in which we first need to find the derivative of the likelihood function with respect to the weight vector Chewing on logistic regression. However, it makes sense to simplify the task beforehand: we will look for the derivative of the logarithm likelihood functions.

Chewing on logistic regression

Why, after taking a logarithm, in logistic error functions, we have changed sign from Chewing on logistic regression on Chewing on logistic regression. Everything is simple, since in the tasks of assessing the quality of the model it is customary to minimize the value of the function, we have multiplied the right side of the expression by Chewing on logistic regression and accordingly, instead of maximizing, we now minimize the function.

Actually, now, before your eyes, the loss function was deduced a lot - Logistic Loss for a training sample with two classes: Chewing on logistic regression ΠΈ Chewing on logistic regression.

Now, to find the coefficients, we only need to find the derivative logistic error functions and then, using numerical optimization methods, such as gradient descent or stochastic gradient descent, find the most optimal coefficients Chewing on logistic regression. But, given the already large volume of the article, it is proposed to carry out differentiation on your own, or perhaps this will be the topic for the next article with a lot of arithmetic without such detailed examples.

Case 2. Classification of objects on Chewing on logistic regression ΠΈ Chewing on logistic regression

The approach here will be the same as with classes Chewing on logistic regression ΠΈ Chewing on logistic regression, but the path itself to the output of the loss function Logistic Loss, will be more ornate. Let's get started. We will use the operator for the likelihood function "if...then..."... That is, if Chewing on logistic regression-th object belongs to the class Chewing on logistic regression, then to calculate the likelihood of the sample, we use the probability Chewing on logistic regressionif the object belongs to the class Chewing on logistic regression, then we substitute into the likelihood Chewing on logistic regression. This is what the likelihood function looks like:

Chewing on logistic regression

Let's write on the fingers how it works. Consider 4 cases:

1. If Chewing on logistic regression ΠΈ Chewing on logistic regression, then the likelihood of the sample will "go" Chewing on logistic regression

2. If Chewing on logistic regression ΠΈ Chewing on logistic regression, then the likelihood of the sample will "go" Chewing on logistic regression

3. If Chewing on logistic regression ΠΈ Chewing on logistic regression, then the likelihood of the sample will "go" Chewing on logistic regression

4. If Chewing on logistic regression ΠΈ Chewing on logistic regression, then the likelihood of the sample will "go" Chewing on logistic regression

Obviously, in cases 1 and 3, when the probabilities were correctly determined by the algorithm, likelihood function will be maximized, which is exactly what we wanted to get. However, this approach is quite cumbersome and further we will consider a more compact notation. But first, let's log the likelihood function with sign change, since now we'll be minimizing it.

Chewing on logistic regression

Substitute instead of Chewing on logistic regression expression Chewing on logistic regression:

Chewing on logistic regression

Simplify the right term under the logarithm using simple arithmetic tricks and get:

Chewing on logistic regression

And now it's time to get rid of the operator "if...then...". Note that when an object Chewing on logistic regression belongs to the class Chewing on logistic regression, then in the expression under the logarithm, in the denominator, Chewing on logistic regression raised to the power Chewing on logistic regressionif the object belongs to the class Chewing on logistic regression, then $e$ is raised to the power Chewing on logistic regression. Therefore, the notation of the degree can be simplified - combine both cases into one: Chewing on logistic regression. Then logistic error function will take the form:

Chewing on logistic regression

In accordance with the rules of logarithm, we turn the fraction and take out the sign "Chewing on logistic regression" (minus) for the logarithm, we get:

Chewing on logistic regression

Here is the loss function logistics loss, which is used in the training sample with objects related to classes: Chewing on logistic regression ΠΈ Chewing on logistic regression.

Well, at this point I take my leave and we conclude the article.

Chewing on logistic regression The previous work of the author is β€œWe bring the equation of linear regression into a matrix form”

auxiliary materials

1. Literature

1) Applied regression analysis / N. Draper, G. Smith - 2nd ed. - M .: Finance and statistics, 1986 (translated from English)

2) Probability theory and mathematical statistics / V.E. Gmurman - 9th ed. - M .: Higher School, 2003

3) Probability theory / N.I. Chernova - Novosibirsk: Novosibirsk State University, 2007

4) Business analytics: from data to knowledge / Paklin N. B., Oreshkov V. I. - 2nd ed. - St. Petersburg: Peter, 2013

5) Data Science Data science from scratch / Joel Gras - St. Petersburg: BHV Petersburg, 2017

6) Practical statistics for Data Science specialists / P. Bruce, E. Bruce - St. Petersburg: BHV Petersburg, 2018

2. Lectures, courses (video)

1) The essence of the maximum likelihood method, Boris Demeshev

2) Maximum likelihood method in the continuous case, Boris Demeshev

3) logistic regression. Open course ODS, Yury Kashnitsky

4) Lecture 4, Evgeny Sokolov (from 47 minutes of video)

5) Logistic regression, Vyacheslav Vorontsov

3. Internet sources

1) Linear classification and regression models

2) How to Understand Logistic Regression Easily

3) Logistic error function

4) Independent tests and the Bernoulli formula

5) Ballad of MMP

6) Maximum likelihood method

7) Formulas and properties of logarithms

8) Why is the number Chewing on logistic regression?

9) Linear classifier

Source: habr.com

Add a comment