In this article, we will analyze the theoretical calculations of the transformation linear regression functions Π² the inverse logit transform function (in other words, the logistic response function). Then, using the arsenal maximum likelihood method, in accordance with the logistic regression model, we derive the loss function Logistic Loss, or in other words, we will define a function with which the parameters of the weight vector are selected in the logistic regression model .
Outline of the article:
- Let's recap the straight line relationship between two variables
- Reveal the need for transformation linear regression functions Π² logistic response function
- Let's carry out the transformations and derive logistic response function
- Let's try to understand why the least squares method is bad when choosing parameters Features Logistic Loss
- Use maximum likelihood method for determining selection functions :
5.1. Case 1: function Logistic Loss for objects with class designation 0 ΠΈ 1:
5.2. Case 2: function Logistic Loss for objects with class designation -1 ΠΈ +1:
The article is replete with simple examples in which all calculations are easy to make orally or on paper, in some cases a calculator may be required. So get ready π
This article is more intended for data scientists with an initial level of knowledge in the basics of machine learning.
The article will also provide code for drawing graphs and calculations. All code is written in python-2.7. I will explain in advance about the βnoveltyβ of the version used - this is one of the conditions for passing the well-known course from Yandex on the equally well-known online education platform Coursera, and, as you can imagine, the material is based on this course.
01. Straight Line Dependency
It is quite reasonable to ask the question - what does the straight-line dependence and logistic regression have to do with it?
Everything is simple! Logistic regression is one of the models that are related to the linear classifier. In simple words, the task of a linear classifier is to predict the target values from variables (regressors) . It is assumed that the relationship between the features and target values linear. Hence the name of the classifier - linear. If very roughly generalized, then the logistic regression model is based on the assumption that there is a linear relationship between the features and target values . Here it is, the connection.
In the studio, the first example, and he, correctly, about the rectilinear dependence of the quantities under study. In the process of preparing the article, I came across an example that has already set the teeth on edge for many - the dependence of current strength on voltage (βApplied Regression Analysisβ, N. Draper, G. Smith). Here we will also consider it.
In accordance with Ohm's law:
Where - current strength, - voltage, - resistance.
If we didn't know Ohm's law, then we could find the dependence empirically by changing and measuring while maintaining fixed. Then we would see that the dependency graph from gives a more or less straight line through the origin. We said βmore or lessβ because, although the relationship is in fact exact, our measurements may contain small errors, and therefore the points on the graph may not fall exactly on the line, but will be randomly scattered around it.
Graph 1 "Dependence from Β»
Chart drawing code
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import random
R = 13.75
x_line = np.arange(0,220,1)
y_line = []
for i in x_line:
y_line.append(i/R)
y_dot = []
for i in y_line:
y_dot.append(i+random.uniform(-0.9,0.9))
fig, axes = plt.subplots(figsize = (14,6), dpi = 80)
plt.plot(x_line,y_line,color = 'purple',lw = 3, label = 'I = U/R')
plt.scatter(x_line,y_dot,color = 'red', label = 'Actual results')
plt.xlabel('I', size = 16)
plt.ylabel('U', size = 16)
plt.legend(prop = {'size': 14})
plt.show()
02. The need for transformations of the linear regression equation
Let's consider another example. Imagine that we work in a bank and we are faced with the task of determining the probability of repayment of a loan by a borrower, depending on some factors. To simplify the task, we will consider only two factors: the monthly salary of the borrower and the monthly payment to repay the loan.
The task is very conditional, but in this example we can understand why it is not enough to use linear regression functions, and also find out what transformations with the function you need to carry out.
Let's go back to the example. It is understood that the higher the salary, the more the borrower will be able to send monthly to repay the loan. At the same time, for a certain range of salaries, this dependence will be quite linear. For example, let's take a salary range from 60.000R to 200.000R and assume that in the specified salary range, the dependence of the monthly payment on the salary is linear. Suppose, for the specified range of wages, it was revealed that the ratio of wages to payment cannot fall below 3 and the borrower should still have 5.000R in reserve. And only in this case, we will assume that the borrower will return the loan to the bank. Then, the linear regression equation will take the form:
where , , , β salary th borrower, β loan payment th borrower.
Substituting salary and loan payment with fixed parameters into the equation You can decide whether to grant or deny a loan.
Looking ahead, we note that, for the given parameters linear regression functionapplied in logistic response functions will produce large values ββthat will make it difficult to carry out calculations to determine the probabilities of repaying the loan. Therefore, it is proposed to reduce our coefficients, let's say, by 25.000 times. From this transformation in the coefficients, the decision to issue a loan will not change. Let's remember this moment for the future, and now, to make it even clearer what we are talking about, let's consider the situation with three potential borrowers.
Table 1 Potential Borrowers
Code for generating a table
import pandas as pd
r = 25000.0
w_0 = -5000.0/r
w_1 = 1.0/r
w_2 = -3.0/r
data = {'The borrower':np.array(['Vasya', 'Fedya', 'Lesha']),
'Salary':np.array([120000,180000,210000]),
'Payment':np.array([3000,50000,70000])}
df = pd.DataFrame(data)
df['f(w,x)'] = w_0 + df['Salary']*w_1 + df['Payment']*w_2
decision = []
for i in df['f(w,x)']:
if i > 0:
dec = 'Approved'
decision.append(dec)
else:
dec = 'Refusal'
decision.append(dec)
df['Decision'] = decision
df[['The borrower', 'Salary', 'Payment', 'f(w,x)', 'Decision']]
In accordance with the data in the table, Vasya, with a salary of 120.000R, wants to receive such a loan in order to repay it at 3.000R monthly. We determined that in order to approve the loan, Vasya's salary must exceed three times the amount of the payment, and that there should still be 5.000R. Vasya satisfies this requirement: . It remains even 106.000R. Although when calculating we reduced the odds 25.000 times, the result is the same - the loan can be approved. Fedya will also receive a loan, but Lesha, despite the fact that he receives the most, will have to moderate his appetites.
Let's draw a graph for this case.
Chart 2 "Classification of Borrowers"
Chart drawing code
salary = np.arange(60000,240000,20000)
payment = (-w_0-w_1*salary)/w_2
fig, axes = plt.subplots(figsize = (14,6), dpi = 80)
plt.plot(salary, payment, color = 'grey', lw = 2, label = '$f(w,x_i)=w_0 + w_1x_{i1} + w_2x_{i2}$')
plt.plot(df[df['Decision'] == 'Approved']['Salary'], df[df['Decision'] == 'Approved']['Payment'],
'o', color ='green', markersize = 12, label = 'Decision - Loan approved')
plt.plot(df[df['Decision'] == 'Refusal']['Salary'], df[df['Decision'] == 'Refusal']['Payment'],
's', color = 'red', markersize = 12, label = 'Decision - Loan refusal')
plt.xlabel('Salary', size = 16)
plt.ylabel('Payment', size = 16)
plt.legend(prop = {'size': 14})
plt.show()
So, our straight line, constructed in accordance with the function , separates "bad" borrowers from "good" ones. Those borrowers whose desires do not match their capabilities are above the straight line (Lesha), those who, according to the parameters of our model, are able to repay the loan, are below the straight line (Vasya and Fedya). Otherwise, we can say this - our line divides borrowers into two classes. We denote them as follows: to the class we will classify those borrowers who are most likely to repay the loan to the class or we will include those borrowers who most likely will not be able to repay the loan.
Let's summarize the conclusions from this simple example. Let's take a point and, substituting the coordinates of the point into the corresponding equation of the straight line Let's consider three options:
- If the point is under the line, and we assign it to the class , then the value of the function will be positive from to . So we can assume that the probability of repaying the loan is within . The larger the value of the function, the higher the probability.
- If the point is above the line and we assign it to the class or , then the value of the function will be negative from to . Then we will assume that the probability of debt repayment is within and, the greater the modulo value of the function, the higher our confidence.
- The point is on a straight line, on the boundary between two classes. In this case, the value of the function will be equal to and the probability of repaying the loan is .
Now, let's imagine that we have not two factors, but tens, borrowers, not three, but thousands. Then instead of a straight line we will have m-dimensional plane and coefficients we will not be taken from the ceiling, but withdrawn in accordance with all the rules, and on the basis of accumulated data on borrowers who have repaid or not repaid the loan. And indeed, note that we are now selecting borrowers at already known coefficients . In fact, the task of the logistic regression model is precisely to determine the parameters , for which the value of the loss function Logistic Loss will tend to the minimum. But about how the vector is calculated , we will find out in the 5th section of the article. In the meantime, we return to the promised land - to our banker and his three clients.
Thanks to the function we know who can be given a loan, and who should be denied. But you canβt go to the director with such information, because they wanted to get from us the probability of repayment of the loan by each borrower. What to do? The answer is simple - we need to somehow transform the function , whose values ββlie in the range to a function whose values ββwill lie in the range . And such a function exists, it is called logistic response function or inverse-logit transformation. Meet:
Let's see how it goes step by step logistic response function. Note that we will walk in the opposite direction, i.e. we will assume that we know a probability value that lies between to and then we will βunwindβ this value over the entire range of numbers from to .
03. Derive the logistic response function
Step 1. Convert the probability values ββto the range
At the time of function transformation Π² logistic response function we will leave alone our credit analyst, and instead we will go through the bookmakers. No, of course, we will not bet, all that interests us there is the meaning of the expression, for example, the odds are 4 to 1. The odds, familiar to all bettors, are the ratio of βsuccessesβ to βfailuresβ. In terms of probabilities, odds are the probability of an event occurring divided by the probability that the event will not occur. Let's write the formula for the chance of an event occurring :
Where is the probability of the event occurring, - the probability of NOT the occurrence of the event
For example, if the probability that a young, strong and frisky horse nicknamed "Veterok" will beat an old and flabby old woman named "Matilda" at the races is equal to , then the chances of success for Veterok will be ΠΊ and vice versa, knowing the odds, it will not be difficult for us to calculate the probability :
Thus, we have learned to βtranslateβ probability into chances, which take values ββfrom to . Let's take one more step and learn how to "translate" the probability to the entire number line from to .
Step 2. Convert the probability values ββto the range
This step is very simple - we take the logarithm of the chances to the base of the Euler number and get:
Now we know that if , then calculate the value will be very simple and, moreover, it must be positive: . This is true.
For the sake of curiosity, we check that if , then we expect to see a negative value . We check: . All right.
Now we know how to translate the probability value from to on the entire number line from to . In the next step, we will do the opposite.
In the meantime, we note that in accordance with the rules of logarithm, knowing the value of the function , you can calculate the odds:
This method of determining the odds will be useful to us in the next step.
Step 3. We derive a formula for determining
So we have learned by knowing , find function values . However, in fact, we need everything exactly the opposite - knowing the value find . To do this, we turn to such a concept as the inverse odds function, according to which:
In the article, we will not derive the above formula, but we will check on the numbers from the example above. We know that with odds of 4 to 1 (), the probability of the event occurring is 0.8 (). Let's make a substitution: . This is consistent with our previous calculations. We move on.
In the last step, we deduced that , which means we can make a change in the inverse odds function. We get:
Divide both the numerator and denominator by , then:
For every fireman, in order to make sure that we didnβt make a mistake anywhere, weβll do one more small check. In step 2, we determined that . Then, substituting the value into the logistic response function, we expect to get . Substitute and get:
Congratulations, dear reader, we have just derived and tested the logistic response function. Let's look at the graph of the function.
Graph 3 "Logistic response function"
Chart drawing code
import math
def logit (f):
return 1/(1+math.exp(-f))
f = np.arange(-7,7,0.05)
p = []
for i in f:
p.append(logit(i))
fig, axes = plt.subplots(figsize = (14,6), dpi = 80)
plt.plot(f, p, color = 'grey', label = '$ 1 / (1+e^{-w^Tx_i})$')
plt.xlabel('$f(w,x_i) = w^Tx_i$', size = 16)
plt.ylabel('$p_{i+}$', size = 16)
plt.legend(prop = {'size': 14})
plt.show()
In the literature, you can also find the name of this function as sigmoid function. The graph clearly shows that the main change in the probability of an object belonging to a class occurs in a relatively small range , somewhere from to .
I propose to return to our credit analyst and help him with the calculation of the probability of repayment of loans, otherwise he risks being left without a premium π
Table 2 Potential Borrowers
Code for generating a table
proba = []
for i in df['f(w,x)']:
proba.append(round(logit(i),2))
df['Probability'] = proba
df[['The borrower', 'Salary', 'Payment', 'f(w,x)', 'Decision', 'Probability']]
So, we have determined the probability of repayment of the loan. In general, this seems to be true.
Indeed, the probability that Vasya, with a salary of 120.000R, will be able to pay 3.000R to the bank every month is close to 100%. By the way, we must understand that the bank can also issue a loan to Lesha if the bank's policy provides, for example, to lend to clients with a probability of repayment of the loan more than, say, 0.3. Just in this case, the bank will form a larger reserve for possible losses.
It should also be noted that the ratio of salary to payment of at least 3 and with a margin of 5.000R was taken from the ceiling. Therefore, we could not use the weight vector in its original form . We needed to greatly reduce the coefficients and in this case we divided each coefficient by 25.000, that is, in fact, we adjusted the result. But this was done on purpose to simplify the understanding of the material at the initial stage. In life, we will not need to invent and adjust the coefficients, but to find them. Just in the following sections of the article, we will derive the equations with which the parameters are selected .
04. The method of least squares in determining the vector of weights in the logistic response function
We already know such a method for selecting the weight vector As least squares method (LSM) and actually, why don't we use it in binary classification problems then? Indeed, nothing prevents you from using MNC, only this method in classification problems gives results less accurate than Logistic Loss. There is a theoretical justification for this. Let's first look at one simple example.
Suppose our models (using MSE ΠΈ Logistic Loss) have already begun to select the weight vector and we stopped the calculation at some step. It doesn't matter if it's in the middle, at the end or at the beginning, the main thing is that we already have some values ββof the weight vector and let's say that at this step, the weight vector for both models have no differences. Then we take the obtained weights and substitute them into logistic response function () for some object that belongs to the class . Let us study two cases when, in accordance with the selected weight vector, our model is very wrong and vice versa - the model is strongly confident that the object belongs to the class . Let's see what penalties will be "issued" when using MNC ΠΈ Logistic Loss.
Code for calculating penalties depending on the loss function used
# ΠΊΠ»Π°ΡΡ ΠΎΠ±ΡΠ΅ΠΊΡΠ°
y = 1
# Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΡ ΠΎΡΠ½Π΅ΡΠ΅Π½ΠΈΡ ΠΎΠ±ΡΠ΅ΠΊΡΠ° ΠΊ ΠΊΠ»Π°ΡΡΡ Π² ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΠΈΠΈ Ρ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠ°ΠΌΠΈ w
proba_1 = 0.01
MSE_1 = (y - proba_1)**2
print 'Π¨ΡΡΠ°Ρ MSE ΠΏΡΠΈ Π³ΡΡΠ±ΠΎΠΉ ΠΎΡΠΈΠ±ΠΊΠ΅ =', MSE_1
# Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΡ f(w,x) ΠΏΡΠΈ ΠΈΠ·Π²Π΅ΡΡΠ½ΠΎΠΉ Π²Π΅ΡΠΎΡΡΠ½ΠΎΡΡΠΈ ΠΎΡΠ½Π΅ΡΠ΅Π½ΠΈΡ ΠΎΠ±ΡΠ΅ΠΊΡΠ° ΠΊ ΠΊΠ»Π°ΡΡΡ +1 (f(w,x)=ln(odds+))
def f_w_x(proba):
return math.log(proba/(1-proba))
LogLoss_1 = math.log(1+math.exp(-y*f_w_x(proba_1)))
print 'Π¨ΡΡΠ°Ρ Log Loss ΠΏΡΠΈ Π³ΡΡΠ±ΠΎΠΉ ΠΎΡΠΈΠ±ΠΊΠ΅ =', LogLoss_1
proba_2 = 0.99
MSE_2 = (y - proba_2)**2
LogLoss_2 = math.log(1+math.exp(-y*f_w_x(proba_2)))
print '**************************************************************'
print 'Π¨ΡΡΠ°Ρ MSE ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ²Π΅ΡΠ΅Π½Π½ΠΎΡΡΠΈ =', MSE_2
print 'Π¨ΡΡΠ°Ρ Log Loss ΠΏΡΠΈ ΡΠΈΠ»ΡΠ½ΠΎΠΉ ΡΠ²Π΅ΡΠ΅Π½Π½ΠΎΡΡΠΈ =', LogLoss_2
The blunder case - the model relates the object to the class with a probability of 0,01
Use Penalty MNC will be:
Use Penalty Logistic Loss will be:
A Case of Strong Confidence - the model relates the object to the class with a probability of 0,99
Use Penalty MNC will be:
Use Penalty Logistic Loss will be:
This example illustrates well that under a gross error, the loss function log loss penalizes the model much more than MSE. Let's now understand what are the theoretical background of using the loss function log loss in classification problems.
05. Maximum likelihood method and logistic regression
As promised at the beginning, the article is replete with simple examples. In the studio, another example and old guests - bank borrowers: Vasya, Fedya and Lesha.
Just to be on the safe side, before developing an example, let me remind you that in life we ββare dealing with a training sample of thousands or millions of objects with tens or hundreds of features. However, here the numbers are taken so that they easily fit in the head of a novice datasight tester.
Let's go back to the example. Let's imagine that the director of the bank decided to issue a loan to all those in need, despite the fact that the algorithm suggested not to issue it to Lesha. And now enough time has passed and we became aware of which of the three heroes repaid the loan, and who did not. As expected: Vasya and Fedya repaid the loan, but Lesha did not. Now let's imagine that this result will be a new training sample for us and, at the same time, we seem to have lost all data on the factors affecting the probability of repaying the loan (borrower's salary, monthly payment). Then, intuitively, we can assume that every third borrower does not repay the loan to the bank, or in other words, the probability of repayment of the loan by the next borrower . This intuitive assumption has a theoretical confirmation and is based on maximum likelihood method, often referred to in the literature maximum likelihood principle.
First, let's get acquainted with the conceptual apparatus.
Sampling likelihood is the probability of obtaining just such a sample, obtaining just such observations / results, i.e. the product of the probabilities of obtaining each of the results of the sample (for example, Vasya, Fedya and Lesha's loan is repaid or not repaid at the same time).
likelihood function relates the likelihood of the sample to the values ββof the distribution parameters.
In our case, the training sample is a generalized Bernoulli scheme, in which the random variable takes only two values: or . Therefore, the likelihood of a sample can be written as a likelihood function of the parameter in the following way:
The above entry can be interpreted as follows. The joint probability that Vasya and Fedya will repay the loan is , the probability that Lesha will NOT repay the loan is equal to (since it was NOT repayment of the loan), therefore, the joint probability of all three events is equal to .
Maximum likelihood method is a method for estimating an unknown parameter by maximizing likelihood functions. In our case, we need to find such a value at which reaches a maximum.
Where does the idea actually come from - to look for the value of the unknown parameter at which the likelihood function reaches its maximum? The origins of the idea stem from the notion that the sample is the only source of knowledge available to us about the population. Everything we know about the population is represented in the sample. Therefore, all we can say is that the sample is the most accurate reflection of the population available to us. Therefore, we need to find such a parameter at which the existing sample becomes the most probable.
Obviously, we are dealing with an optimization problem in which it is required to find the extremum point of the function. To find the extremum point, it is necessary to consider the first-order condition, that is, equate the derivative of the function to zero and solve the equation for the desired parameter. However, the search for the derivative of the product of a large number of factors can turn out to be a protracted matter; to avoid this, there is a special trick - the transition to the logarithm likelihood functions. Why is such a transition possible? Let us pay attention to the fact that we are not looking for the extremum of the function, and the extremum point, that is, the value of the unknown parameter at which reaches a maximum. When passing to the logarithm, the extremum point does not change (although the extremum itself will be different), since the logarithm is a monotonic function.
Let's, in accordance with the above, continue to develop our example with loans from Vasya, Fedya and Lesha. To start, let's go to logarithm of the likelihood function:
Now we can easily differentiate the expression with respect to :
And finally, consider the first-order condition - we equate the derivative of the function to zero:
Thus, our intuitive estimate of the probability of loan repayment was theoretically justified.
Great, but what do we do with this information now? If we assume that every third borrower does not return the money to the bank, then the latter will inevitably go bankrupt. So it is, but only when assessing the probability of repaying the loan equal to we did not take into account the factors affecting the repayment of the loan: the borrower's salary and the amount of the monthly payment. Recall that earlier we calculated the probability of repayment of a loan by each client, taking into account these same factors. It is logical that our probabilities turned out to be different from the constant equal to .
Let's determine the likelihood of the samples:
Code for Calculating Likelihoods of Samples
from functools import reduce
def likelihood(y,p):
line_true_proba = []
for i in range(len(y)):
ltp_i = p[i]**y[i]*(1-p[i])**(1-y[i])
line_true_proba.append(ltp_i)
likelihood = []
return reduce(lambda a, b: a*b, line_true_proba)
y = [1.0,1.0,0.0]
p_log_response = df['Probability']
const = 2.0/3.0
p_const = [const, const, const]
print 'ΠΡΠ°Π²Π΄ΠΎΠΏΠΎΠ΄ΠΎΠ±ΠΈΠ΅ Π²ΡΠ±ΠΎΡΠΊΠΈ ΠΏΡΠΈ ΠΊΠΎΠ½ΡΡΠ°Π½ΡΠ½ΠΎΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠΈ p=2/3:', round(likelihood(y,p_const),3)
print '****************************************************************************************************'
print 'ΠΡΠ°Π²Π΄ΠΎΠΏΠΎΠ΄ΠΎΠ±ΠΈΠ΅ Π²ΡΠ±ΠΎΡΠΊΠΈ ΠΏΡΠΈ ΡΠ°ΡΡΠ΅ΡΠ½ΠΎΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠΈ p:', round(likelihood(y,p_log_response),3)
Sampling likelihood at a constant value :
Sampling likelihood in calculating the probability of repayment of a loan, taking into account factors :
The likelihood of a sample with a probability calculated depending on the factors turned out to be higher than the likelihood with a constant probability value. What does it say? This suggests that knowledge of the factors made it possible to more accurately select the probability of loan repayment for each client. Therefore, when issuing another loan, it would be more correct to use the model proposed at the end of the 3rd section of the article for assessing the probability of debt repayment.
But then, if we want to maximize sample likelihood function, then why not use some algorithm that will give the probabilities for Vasya, Fedya and Lesha, for example, equal to 0.99, 0.99 and 0.01, respectively. Perhaps such an algorithm will perform well on the training sample, as it will bring the value of the likelihood of the sample closer to , but, firstly, such an algorithm will most likely have difficulties with generalizing ability, and secondly, this algorithm will definitely not be linear. And if the methods of dealing with overfitting (equally weak generalizing ability) are clearly not included in the plan of this article, then let's go over the second point in more detail. To do this, it is enough to answer a simple question. Can the probability of repaying the loan to Vasya and Fedya be the same, taking into account the factors known to us? From the point of view of sound logic, of course not, it cannot. So, Vasya will give 2.5% of his salary per month to repay the loan, and Fedya - almost 27,8%. Also on Chart 2 βClients Classificationβ we see that Vasya is much further from the line separating the classes than Fedya. Finally, we know that the function for Vasya and Fedya takes different values: 4.24 for Vasya and 1.0 for Fedya. Now, if Fedya, for example, earned an order of magnitude more or asked for a smaller loan, then the probabilities of repaying the loan from Vasya and Fedya would be similar. In other words, you can't deceive a linear relationship. And if we really calculated the odds , and did not take them from the ceiling, we could safely say that our values best allow to estimate the probability of loan repayment by each borrower, but since we agreed to consider that the definition of coefficients was carried out in accordance with all the rules, then we will assume so - our coefficients allow us to give a better estimate of the probability π
However, we digress. In this section, we need to understand how the weight vector is determined , which is necessary to assess the probability of loan repayment by each borrower.
Briefly summarize what arsenal we use in search of coefficients :
1. We assume that the relationship between the target variable (forecast value) and the factor influencing the result is linear. For this reason, it applies linear regression function form , the line of which divides objects (clients) into classes ΠΈ or (customers able to repay the loan and not able). In our case, the equation has the form .
2. We use inverse logit function form to determine the probability of an object belonging to a class .
3. We consider our training sample as a realization of the generalized Bernoulli schemes, that is, for each object, a random variable is generated, which with probability (its own for each object) takes the value 1 and with probability - 0.
4. We know we need to maximize sample likelihood function taking into account the accepted factors in order to make the available sample the most plausible. In other words, we need to select parameters that will make the sample most likely. In our case, the selected parameter is the probability of repaying the loan , which in turn depends on the unknown coefficients . So we need to find such a vector of weights , at which the likelihood of the sample will be maximum.
5. We know what to maximize sample likelihood functions can be used maximum likelihood method. And we know all the tricky tricks to work with this method.
Here is such a multi-way turns out π
And now remember that at the very beginning of the article we wanted to derive two types of loss function Logistic Loss depending on how the classes of objects are designated. It so happened that in classification problems with two classes, classes are denoted as ΠΈ or . Depending on the notation, the output will be the corresponding loss function.
Case 1. Classification of objects on ΠΈ
Earlier, when determining the plausibility of the sample, in which the probability of debt repayment by the borrower was calculated based on factors and given coefficients , we applied the formula:
Actually is the meaning logistic response functions for a given weight vector
Then nothing prevents us from writing the likelihood function of the sample as follows:
It happens that sometimes, it is difficult for some novice analysts to immediately understand how this function works. Let's look at 4 short examples that will make everything clear:
1. If (i.e., according to the training sample, the object belongs to class +1), and our algorithm determines the probability of an object being assigned to a class equal to 0.9, then this piece of sample likelihood will be calculated as follows:
2. If , , then the calculation will be:
3. If , , then the calculation will be:
4. If , , then the calculation will be:
It is obvious that the likelihood function will be maximized in cases 1 and 3, or in the general case - with correctly guessed values ββof the probabilities of classifying an object to a class .
Due to the fact that when determining the probability of referring an object to a class we do not know only the coefficients , then we will look for them. As mentioned above, this is an optimization problem in which we first need to find the derivative of the likelihood function with respect to the weight vector . However, it makes sense to simplify the task beforehand: we will look for the derivative of the logarithm likelihood functions.
Why, after taking a logarithm, in logistic error functions, we have changed sign from on . Everything is simple, since in the tasks of assessing the quality of the model it is customary to minimize the value of the function, we have multiplied the right side of the expression by and accordingly, instead of maximizing, we now minimize the function.
Actually, now, before your eyes, the loss function was deduced a lot - Logistic Loss for a training sample with two classes: ΠΈ .
Now, to find the coefficients, we only need to find the derivative logistic error functions and then, using numerical optimization methods, such as gradient descent or stochastic gradient descent, find the most optimal coefficients . But, given the already large volume of the article, it is proposed to carry out differentiation on your own, or perhaps this will be the topic for the next article with a lot of arithmetic without such detailed examples.
Case 2. Classification of objects on ΠΈ
The approach here will be the same as with classes ΠΈ , but the path itself to the output of the loss function Logistic Loss, will be more ornate. Let's get started. We will use the operator for the likelihood function "if...then..."... That is, if -th object belongs to the class , then to calculate the likelihood of the sample, we use the probability if the object belongs to the class , then we substitute into the likelihood . This is what the likelihood function looks like:
Let's write on the fingers how it works. Consider 4 cases:
1. If ΠΈ , then the likelihood of the sample will "go"
2. If ΠΈ , then the likelihood of the sample will "go"
3. If ΠΈ , then the likelihood of the sample will "go"
4. If ΠΈ , then the likelihood of the sample will "go"
Obviously, in cases 1 and 3, when the probabilities were correctly determined by the algorithm, likelihood function will be maximized, which is exactly what we wanted to get. However, this approach is quite cumbersome and further we will consider a more compact notation. But first, let's log the likelihood function with sign change, since now we'll be minimizing it.
Substitute instead of expression :
Simplify the right term under the logarithm using simple arithmetic tricks and get:
And now it's time to get rid of the operator "if...then...". Note that when an object belongs to the class , then in the expression under the logarithm, in the denominator, raised to the power if the object belongs to the class , then $e$ is raised to the power . Therefore, the notation of the degree can be simplified - combine both cases into one: . Then logistic error function will take the form:
In accordance with the rules of logarithm, we turn the fraction and take out the sign "" (minus) for the logarithm, we get:
Here is the loss function logistics loss, which is used in the training sample with objects related to classes: ΠΈ .
Well, at this point I take my leave and we conclude the article.
auxiliary materials
1. Literature
1) Applied regression analysis / N. Draper, G. Smith - 2nd ed. - M .: Finance and statistics, 1986 (translated from English)
2) Probability theory and mathematical statistics / V.E. Gmurman - 9th ed. - M .: Higher School, 2003
3) Probability theory / N.I. Chernova - Novosibirsk: Novosibirsk State University, 2007
4) Business analytics: from data to knowledge / Paklin N. B., Oreshkov V. I. - 2nd ed. - St. Petersburg: Peter, 2013
5) Data Science Data science from scratch / Joel Gras - St. Petersburg: BHV Petersburg, 2017
6) Practical statistics for Data Science specialists / P. Bruce, E. Bruce - St. Petersburg: BHV Petersburg, 2018
2. Lectures, courses (video)
1)
2)
3)
4)
5)
3. Internet sources
1)
2)
4)
7)
Source: habr.com