The article discusses several ways to determine the mathematical equation of a simple (pair) regression line.
All methods of solving the equation considered here are based on the least squares method. We denote the methods as follows:
- Analytical solution
- gradient descent
- Stochastic Gradient Descent
For each of the ways to solve the equation of a straight line, the article presents various functions, which are mainly divided into those that are written without using the library NumPy and those that are used for calculations NumPy. It is believed that the skillful use NumPy will reduce the cost of computing.
All code in this article is written in python-2.7 using Jupyter Notebook. The source code and the sample data file are available at
The article is more focused on both beginners and those who have already gradually begun to master the study of a very extensive section in artificial intelligence - machine learning.
Let's use a very simple example to illustrate the material.
Example Conditions
We have five values ββthat characterize addiction Y from X (Table #1):
Table No. 1 "Conditions of the example"
We will assume that the values is the month of the year, and - earnings this month. In other words, revenue depends on the month of the year, and - the only sign on which revenue depends.
The example is so-so, both in terms of the conditional dependence of revenue on the month of the year, and in terms of the number of values ββ- there are very few of them. However, such a simplification will allow, as they say on the fingers, to explain, not always with ease, the material assimilated by beginners. And also the simplicity of the numbers will allow those who wish to solve the example on βpaperβ without significant labor costs.
Suppose that the dependence given in the example can be approximated quite well by the mathematical equation of the line of a simple (pair) regression of the form:
where is the month in which the proceeds were received, - revenue corresponding to the month, ΠΈ are the regression coefficients of the estimated line.
Note that the coefficient often referred to as the slope or gradient of the estimated line; is the amount by which the when it changes .
Obviously, our task in the example is to select such coefficients in the equation ΠΈ , at which the deviations of our estimated revenue values ββby months from the true answers, i.e. values ββpresented in the sample will be minimal.
Least square method
According to the least squares method, the deviation should be calculated by squaring it. Such a technique allows avoiding the mutual repayment of deviations, if they have opposite signs. For example, if in one case, the deviation is +5 (plus five), and in the other -5 (minus five), then the sum of the deviations will be mutually canceled and will be 0 (zero). It is possible not to square the deviation, but to use the modulus property and then all the deviations will be positive and will accumulate. We will not dwell on this point in detail, but simply indicate that for the convenience of calculations, it is customary to square the deviation.
This is how the formula looks like, with the help of which we will determine the smallest sum of squared deviations (errors):
where is a function of approximation of true answers (that is, the revenue calculated by us),
are the true answers (revenue provided in the sample),
is the index of the sample (the number of the month in which the deviation is determined)
Let's differentiate the function, define the partial differential equations, and be ready to move on to an analytical solution. But first, let's take a short digression about what differentiation is and recall the geometric meaning of the derivative.
Differentiation
Differentiation is the operation of finding the derivative of a function.
What is the derivative for? The derivative of a function characterizes the rate of change of the function and shows us its direction. If the derivative at a given point is positive, then the function is increasing; otherwise, the function is decreasing. And the greater the value of the modulo derivative, the higher the rate of change of the function values, as well as the steeper the slope of the function graph.
For example, under the conditions of a Cartesian coordinate system, the value of the derivative at the point M(0,0) is equal to +25 means that at a given point, when the value is shifted to the right by a conventional unit, value increases by 25 conventional units. On the graph, it looks like a fairly steep angle of rise in values from a given point.
Another example. The value of the derivative is -0,1 means that when shifting per one conventional unit, the value decreases by only 0,1 conventional unit. At the same time, on the graph of the function, we can observe a barely noticeable downward slope. Drawing an analogy with a mountain, itβs as if we are very slowly descending a gentle slope from a mountain, unlike the previous example, where we had to take very steep peaks :)
Thus, after differentiating the function by odds ΠΈ , we define the equations of partial derivatives of the 1st order. After defining the equations, we will get a system of two equations, solving which we can choose such values ββof the coefficients ΠΈ , at which the values ββof the corresponding derivatives at given points change by a very, very small amount, and in the case of an analytical solution do not change at all. In other words, the error function at the found coefficients will reach a minimum, since the values ββof partial derivatives at these points will be equal to zero.
So, according to the rules of differentiation, the equation of the partial derivative of the 1st order with respect to the coefficient will take the form:
1st order partial derivative equation with respect to will take the form:
As a result, we got a system of equations that has a fairly simple analytical solution:
begin{equation*}
begin{cases}
na + bsumlimits_{i=1}^nx_i - sumlimits_{i=1}^ny_i = 0
sumlimits_{i=1}^nx_i(a +bsumlimits_{i=1}^nx_i - sumlimits_{i=1}^ny_i) = 0
end{cases}
end{equation*}
Before solving the equation, let's preload, check the correctness of loading and format the data.
Loading and formatting data
It should be noted that due to the fact that for the analytical solution, and in the future for gradient and stochastic gradient descent, we will use the code in two variations: using the library NumPy and without using it, then we need the appropriate data formatting (see code).
Data loading and processing code
# ΠΈΠΌΠΏΠΎΡΡΠΈΡΡΠ΅ΠΌ Π²ΡΠ΅ Π½ΡΠΆΠ½ΡΠ΅ Π½Π°ΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import pylab as pl
import random
# Π³ΡΠ°ΡΠΈΠΊΠΈ ΠΎΡΠΎΠ±ΡΠ°Π·ΠΈΠΌ Π² Jupyter
%matplotlib inline
# ΡΠΊΠ°ΠΆΠ΅ΠΌ ΡΠ°Π·ΠΌΠ΅Ρ Π³ΡΠ°ΡΠΈΠΊΠΎΠ²
from pylab import rcParams
rcParams['figure.figsize'] = 12, 6
# ΠΎΡΠΊΠ»ΡΡΠΈΠΌ ΠΏΡΠ΅Π΄ΡΠΏΡΠ΅ΠΆΠ΄Π΅Π½ΠΈΡ Anaconda
import warnings
warnings.simplefilter('ignore')
# Π·Π°Π³ΡΡΠ·ΠΈΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΡ
table_zero = pd.read_csv('data_example.txt', header=0, sep='t')
# ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ ΡΠ°Π±Π»ΠΈΡΠ΅ ΠΈ Π½Π° ΡΠ°ΠΌΡ ΡΠ°Π±Π»ΠΈΡΡ
print table_zero.info()
print '********************************************'
print table_zero
print '********************************************'
# ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΈΠΌ Π΄Π°Π½Π½ΡΠ΅ Π±Π΅Π· ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ NumPy
x_us = []
[x_us.append(float(i)) for i in table_zero['x']]
print x_us
print type(x_us)
print '********************************************'
y_us = []
[y_us.append(float(i)) for i in table_zero['y']]
print y_us
print type(y_us)
print '********************************************'
# ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΈΠΌ Π΄Π°Π½Π½ΡΠ΅ Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ NumPy
x_np = table_zero[['x']].values
print x_np
print type(x_np)
print x_np.shape
print '********************************************'
y_np = table_zero[['y']].values
print y_np
print type(y_np)
print y_np.shape
print '********************************************'
Visualization
Now, after we, firstly, loaded the data, secondly, checked the correctness of loading and finally formatted the data, we will carry out the first visualization. Often this method is used pairplot Library Seaborn. In our example, due to the limited numbers, it makes no sense to use the library Seaborn. We will use the usual library Matplotlib and look only at the scatterplot.
Scatterplot Code
print 'ΠΡΠ°ΡΠΈΠΊ β1 "ΠΠ°Π²ΠΈΡΠΈΠΌΠΎΡΡΡ Π²ΡΡΡΡΠΊΠΈ ΠΎΡ ΠΌΠ΅ΡΡΡΠ° Π³ΠΎΠ΄Π°"'
plt.plot(x_us,y_us,'o',color='green',markersize=16)
plt.xlabel('$Months$', size=16)
plt.ylabel('$Sales$', size=16)
plt.show()
Chart No. 1 "Dependence of revenue on the month of the year"
Analytical solution
We use the most common tools in python and solve the system of equations:
begin{equation*}
begin{cases}
na + bsumlimits_{i=1}^nx_i - sumlimits_{i=1}^ny_i = 0
sumlimits_{i=1}^nx_i(a +bsumlimits_{i=1}^nx_i - sumlimits_{i=1}^ny_i) = 0
end{cases}
end{equation*}
According to Cramer's rule find a common determinant, as well as determinants by and , after which, dividing the determinant by to a common determinant - find the coefficient , similarly we find the coefficient .
Analytical solution code
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΡΠ°ΡΡΠ΅ΡΠ° ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b ΠΏΠΎ ΠΏΡΠ°Π²ΠΈΠ»Ρ ΠΡΠ°ΠΌΠ΅ΡΠ°
def Kramer_method (x,y):
# ΡΡΠΌΠΌΠ° Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ (Π²ΡΠ΅ ΠΌΠ΅ΡΡΡΠ°)
sx = sum(x)
# ΡΡΠΌΠΌΠ° ΠΈΡΡΠΈΠ½Π½ΡΡ
ΠΎΡΠ²Π΅ΡΠΎΠ² (Π²ΡΡΡΡΠΊΠ° Π·Π° Π²Π΅ΡΡ ΠΏΠ΅ΡΠΈΠΎΠ΄)
sy = sum(y)
# ΡΡΠΌΠΌΠ° ΠΏΡΠΎΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π½Π° ΠΈΡΡΠΈΠ½Π½ΡΠ΅ ΠΎΡΠ²Π΅ΡΡ
list_xy = []
[list_xy.append(x[i]*y[i]) for i in range(len(x))]
sxy = sum(list_xy)
# ΡΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
list_x_sq = []
[list_x_sq.append(x[i]**2) for i in range(len(x))]
sx_sq = sum(list_x_sq)
# ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
n = len(x)
# ΠΎΠ±ΡΠΈΠΉ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΠ΅Π»Ρ
det = sx_sq*n - sx*sx
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΠ΅Π»Ρ ΠΏΠΎ a
det_a = sx_sq*sy - sx*sxy
# ΠΈΡΠΊΠΎΠΌΡΠΉ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡ a
a = (det_a / det)
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΠ΅Π»Ρ ΠΏΠΎ b
det_b = sxy*n - sy*sx
# ΠΈΡΠΊΠΎΠΌΡΠΉ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡ b
b = (det_b / det)
# ΠΊΠΎΠ½ΡΡΠΎΠ»ΡΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ (ΠΏΡΠΎΠΎΠ²Π΅ΡΠΊΠ°)
check1 = (n*b + a*sx - sy)
check2 = (b*sx + a*sx_sq - sxy)
return [round(a,4), round(b,4)]
# Π·Π°ΠΏΡΡΡΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ ΠΈ Π·Π°ΠΏΠΈΡΠ΅ΠΌ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΡΠ΅ ΠΎΡΠ²Π΅ΡΡ
ab_us = Kramer_method(x_us,y_us)
a_us = ab_us[0]
b_us = ab_us[1]
print ' 33[1m' + ' 33[4m' + "ΠΠΏΡΠΈΠΌΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b:" + ' 33[0m'
print 'a =', a_us
print 'b =', b_us
print
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΠΏΠΎΠ΄ΡΡΠ΅ΡΠ° ΡΡΠΌΠΌΡ ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΈΠ±ΠΎΠΊ
def errors_sq_Kramer_method(answers,x,y):
list_errors_sq = []
for i in range(len(x)):
err = (answers[0] + answers[1]*x[i] - y[i])**2
list_errors_sq.append(err)
return sum(list_errors_sq)
# Π·Π°ΠΏΡΡΡΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ ΠΈ Π·Π°ΠΏΠΈΡΠ΅ΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡΠΈΠ±ΠΊΠΈ
error_sq = errors_sq_Kramer_method(ab_us,x_us,y_us)
print ' 33[1m' + ' 33[4m' + "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ" + ' 33[0m'
print error_sq
print
# Π·Π°ΠΌΠ΅ΡΠΈΠΌ Π²ΡΠ΅ΠΌΡ ΡΠ°ΡΡΠ΅ΡΠ°
# print ' 33[1m' + ' 33[4m' + "ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ ΡΠ°ΡΡΠ΅ΡΠ° ΡΡΠΌΠΌΡ ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ:" + ' 33[0m'
# % timeit error_sq = errors_sq_Kramer_method(ab,x_us,y_us)
Here's what we got:
So, the values ββof the coefficients are found, the sum of the squared deviations is set. Let's draw a straight line on the scattering histogram in accordance with the found coefficients.
Regression line code
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΠΌΠ°ΡΡΠΈΠ²Π° ΡΠ°ΡΡΡΠ΅ΡΠ½ΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π²ΡΡΡΡΠΊΠΈ
def sales_count(ab,x,y):
line_answers = []
[line_answers.append(ab[0]+ab[1]*x[i]) for i in range(len(x))]
return line_answers
# ΠΏΠΎΡΡΡΠΎΠΈΠΌ Π³ΡΠ°ΡΠΈΠΊΠΈ
print 'ΠΡΡΠΈΠΊβ2 "ΠΡΠ°Π²ΠΈΠ»ΡΠ½ΡΠ΅ ΠΈ ΡΠ°ΡΡΠ΅ΡΠ½ΡΠ΅ ΠΎΡΠ²Π΅ΡΡ"'
plt.plot(x_us,y_us,'o',color='green',markersize=16, label = '$True$ $answers$')
plt.plot(x_us, sales_count(ab_us,x_us,y_us), color='red',lw=4,
label='$Function: a + bx,$ $where$ $a='+str(round(ab_us[0],2))+',$ $b='+str(round(ab_us[1],2))+'$')
plt.xlabel('$Months$', size=16)
plt.ylabel('$Sales$', size=16)
plt.legend(loc=1, prop={'size': 16})
plt.show()
Chart No. 2 "Correct and calculated answers"
You can look at the graph of deviations for each month. In our case, we will not derive any significant practical value from it, but we will satisfy curiosity in how well the simple linear regression equation characterizes the dependence of revenue on the month of the year.
Deviation chart code
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ ΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ ΠΌΠ°ΡΡΠΈΠ²Π° ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ Π² ΠΏΡΠΎΡΠ΅Π½ΡΠ°Ρ
def error_per_month(ab,x,y):
sales_c = sales_count(ab,x,y)
errors_percent = []
for i in range(len(x)):
errors_percent.append(100*(sales_c[i]-y[i])/y[i])
return errors_percent
# ΠΏΠΎΡΡΡΠΎΠΈΠΌ Π³ΡΠ°ΡΠΈΠΊ
print 'ΠΡΠ°ΡΠΈΠΊβ3 "ΠΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΡ ΠΏΠΎ-ΠΌΠ΅ΡΡΡΠ½ΠΎ, %"'
plt.gca().bar(x_us, error_per_month(ab_us,x_us,y_us), color='brown')
plt.xlabel('Months', size=16)
plt.ylabel('Calculation error, %', size=16)
plt.show()
Chart No. 3 "Deviations,%"
Not perfect, but we did our job.
Let's write a function that, to determine the coefficients ΠΈ uses the library NumPy, more precisely, we will write two functions: one using a pseudo-inverse matrix (not recommended in practice, since the process is computationally complex and unstable), the other using a matrix equation.
Analytic Solution Code (NumPy)
# Π΄Π»Ρ Π½Π°ΡΠ°Π»Π° Π΄ΠΎΠ±Π°Π²ΠΈΠΌ ΡΡΠΎΠ»Π±Π΅Ρ Ρ Π½Π΅ ΠΈΠ·ΠΌΠ΅Π½ΡΡΡΠΈΠΌΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ΠΌ Π² 1.
# ΠΠ°Π½Π½ΡΠΉ ΡΡΠΎΠ»Π±Π΅Ρ Π½ΡΠΆΠ΅Π½ Π΄Π»Ρ ΡΠΎΠ³ΠΎ, ΡΡΠΎΠ±Ρ Π½Π΅ ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡ ΠΎΡΠ΄Π΅Π»ΡΠ½ΠΎ ΠΊΠΎΡΡΡΠΈΡΠ΅Π½Ρ a
vector_1 = np.ones((x_np.shape[0],1))
x_np = table_zero[['x']].values # Π½Π° Π²ΡΡΠΊΠΈΠΉ ΡΠ»ΡΡΠ°ΠΉ ΠΏΡΠΈΠ²Π΅Π΄Π΅ΠΌ Π² ΠΏΠ΅ΡΠ²ΠΈΡΠ½ΡΠΉ ΡΠΎΡΠΌΠ°Ρ Π²Π΅ΠΊΡΠΎΡ x_np
x_np = np.hstack((vector_1,x_np))
# ΠΏΡΠΎΠ²Π΅ΡΠΈΠΌ ΡΠΎ, ΡΡΠΎ Π²ΡΠ΅ ΡΠ΄Π΅Π»Π°Π»ΠΈ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΠΎ
print vector_1[0:3]
print x_np[0:3]
print '***************************************'
print
# Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΏΡΠ΅Π²Π΄ΠΎΠΎΠ±ΡΠ°ΡΠ½ΠΎΠΉ ΠΌΠ°ΡΡΠΈΡΡ
def pseudoinverse_matrix(X, y):
# Π·Π°Π΄Π°Π΅ΠΌ ΡΠ²Π½ΡΠΉ ΡΠΎΡΠΌΠ°Ρ ΠΌΠ°ΡΡΠΈΡΡ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ²
X = np.matrix(X)
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌ ΡΡΠ°Π½ΡΠΏΠΎΠ½ΠΈΡΠΎΠ²Π°Π½Π½ΡΡ ΠΌΠ°ΡΡΠΈΡΡ
XT = X.T
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌ ΠΊΠ²Π°Π΄ΡΠ°ΡΠ½ΡΡ ΠΌΠ°ΡΡΠΈΡΡ
XTX = XT*X
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌ ΠΏΡΠ΅Π²Π΄ΠΎΠΎΠ±ΡΠ°ΡΠ½ΡΡ ΠΌΠ°ΡΡΠΈΡΡ
inv = np.linalg.pinv(XTX)
# Π·Π°Π΄Π°Π΅ΠΌ ΡΠ²Π½ΡΠΉ ΡΠΎΡΠΌΠ°Ρ ΠΌΠ°ΡΡΠΈΡΡ ΠΎΡΠ²Π΅ΡΠΎΠ²
y = np.matrix(y)
# Π½Π°Ρ
ΠΎΠ΄ΠΈΠΌ Π²Π΅ΠΊΡΠΎΡ Π²Π΅ΡΠΎΠ²
return (inv*XT)*y
# Π·Π°ΠΏΡΡΡΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ
ab_np = pseudoinverse_matrix(x_np, y_np)
print ab_np
print '***************************************'
print
# Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅Ρ Π΄Π»Ρ ΡΠ΅ΡΠ΅Π½ΠΈΡ ΠΌΠ°ΡΡΠΈΡΠ½ΠΎΠ΅ ΡΡΠ°Π²Π½Π΅Π½ΠΈΠ΅
def matrix_equation(X,y):
a = np.dot(X.T, X)
b = np.dot(X.T, y)
return np.linalg.solve(a, b)
# Π·Π°ΠΏΡΡΡΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ
ab_np = matrix_equation(x_np,y_np)
print ab_np
Compare the time it took to determine the coefficients ΠΈ , according to the 3 methods presented.
Code for Calculation Time Calculation
print ' 33[1m' + ' 33[4m' + "ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ ΡΠ°ΡΡΠ΅ΡΠ° ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² Π±Π΅Π· ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy:" + ' 33[0m'
% timeit ab_us = Kramer_method(x_us,y_us)
print '***************************************'
print
print ' 33[1m' + ' 33[4m' + "ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ ΡΠ°ΡΡΠ΅ΡΠ° ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΏΡΠ΅Π²Π΄ΠΎΠΎΠ±ΡΠ°ΡΠ½ΠΎΠΉ ΠΌΠ°ΡΡΠΈΡΡ:" + ' 33[0m'
%timeit ab_np = pseudoinverse_matrix(x_np, y_np)
print '***************************************'
print
print ' 33[1m' + ' 33[4m' + "ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ ΡΠ°ΡΡΠ΅ΡΠ° ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΌΠ°ΡΡΠΈΡΠ½ΠΎΠ³ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΡ:" + ' 33[0m'
%timeit ab_np = matrix_equation(x_np, y_np)
On a small amount of data, a "self-written" function comes forward, which finds the coefficients using the Cramer method.
Now you can move on to other ways to find the coefficients ΠΈ .
gradient descent
First, let's define what a gradient is. In a simple way, the gradient is a segment that indicates the direction of maximum growth of the function. By analogy with climbing uphill, where the gradient looks, there is the steepest climb to the top of the mountain. Developing the mountain example, remember that we actually need the steepest descent in order to reach the low as quickly as possible, that is, the minimum - the place where the function does not increase and does not decrease. At this point, the derivative will be zero. Therefore, we do not need a gradient, but an anti-gradient. To find the antigradient, you just need to multiply the gradient by -1 (minus one).
Let us pay attention to the fact that a function can have several minima, and having descended to one of them according to the algorithm proposed below, we will not be able to find another minimum, which may be lower than the found one. Relax, we are not in danger! In our case, we are dealing with a single minimum, since our function on the graph is an ordinary parabola. And as we all should know very well from a school mathematics course, a parabola has only one minimum.
After we figured out why we needed a gradient, and also that the gradient is a segment, that is, a vector with given coordinates, which are just the same coefficients ΠΈ we can implement gradient descent.
Before starting, I suggest reading just a few sentences about the descent algorithm:
- We determine pseudo-randomly the coordinates of the coefficients ΠΈ . In our example, we will define coefficients near zero. This is a common practice, but each case may have its own practice.
- From coordinate subtract the value of the partial derivative of the 1st order at the point . So, if the derivative is positive, then the function is increasing. Therefore, subtracting the value of the derivative, we will move in the opposite direction of growth, that is, in the direction of descent. If the derivative is negative, then the function at this point is decreasing and subtracting the value of the derivative, we move towards the descent.
- We carry out a similar operation with the coordinate : subtract the value of the partial derivative at the point .
- In order not to jump over the minimum and not fly away into deep space, it is necessary to set the step size in the direction of descent. In general, one could write an entire article on how to set the step correctly and how to change it during descent in order to reduce the cost of calculations. But now we have a slightly different task, and we will establish the step size by the scientific method of "poke" or, as they say in the common people, empirically.
- Once we're out of the given coordinates ΠΈ subtract the values ββof the derivatives, we get new coordinates ΠΈ . We take the next step (subtraction), already from the calculated coordinates. And so the cycle starts again and again, until the required convergence is reached.
All! Now we are ready to go in search of the deepest gorge of the Mariana Trench. Let's get started.
Gradient Descent Code
# Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Π±Π΅Π· ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy.
# Π€ΡΠ½ΠΊΡΠΈΡ Π½Π° Π²Ρ
ΠΎΠ΄ ΠΏΡΠΈΠ½ΠΈΠΌΠ°Π΅Ρ Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ x,y, Π΄Π»ΠΈΠ½Ρ ΡΠ°Π³Π° (ΠΏΠΎ ΡΠΌΠΎΠ»ΡΠ°Π½ΠΈΡ=0,1), Π΄ΠΎΠΏΡΡΡΠΈΠΌΡΡ ΠΏΠΎΠ³ΡΠ΅ΡΠ½ΠΎΡΡΡ(tolerance)
def gradient_descent_usual(x_us,y_us,l=0.1,tolerance=0.000000000001):
# ΡΡΠΌΠΌΠ° Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ (Π²ΡΠ΅ ΠΌΠ΅ΡΡΡΠ°)
sx = sum(x_us)
# ΡΡΠΌΠΌΠ° ΠΈΡΡΠΈΠ½Π½ΡΡ
ΠΎΡΠ²Π΅ΡΠΎΠ² (Π²ΡΡΡΡΠΊΠ° Π·Π° Π²Π΅ΡΡ ΠΏΠ΅ΡΠΈΠΎΠ΄)
sy = sum(y_us)
# ΡΡΠΌΠΌΠ° ΠΏΡΠΎΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π½Π° ΠΈΡΡΠΈΠ½Π½ΡΠ΅ ΠΎΡΠ²Π΅ΡΡ
list_xy = []
[list_xy.append(x_us[i]*y_us[i]) for i in range(len(x_us))]
sxy = sum(list_xy)
# ΡΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
list_x_sq = []
[list_x_sq.append(x_us[i]**2) for i in range(len(x_us))]
sx_sq = sum(list_x_sq)
# ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
num = len(x_us)
# Π½Π°ΡΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ², ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΡΠ΅ ΠΏΡΠ΅Π²Π΄ΠΎΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ
a = float(random.uniform(-0.5, 0.5))
b = float(random.uniform(-0.5, 0.5))
# ΡΠΎΠ·Π΄Π°Π΅ΠΌ ΠΌΠ°ΡΡΠΈΠ² Ρ ΠΎΡΠΈΠ±ΠΊΠ°ΠΌΠΈ, Π΄Π»Ρ ΡΡΠ°ΡΡΠ° ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΡ 1 ΠΈ 0
# ΠΏΠΎΡΠ»Π΅ Π·Π°Π²Π΅ΡΡΠ΅Π½ΠΈΡ ΡΠΏΡΡΠΊΠ° ΡΡΠ°ΡΡΠΎΠ²ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΡΠ΄Π°Π»ΠΈΠΌ
errors = [1,0]
# Π·Π°ΠΏΡΡΠΊΠ°Π΅ΠΌ ΡΠΈΠΊΠ» ΡΠΏΡΡΠΊΠ°
# ΡΠΈΠΊΠ» ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π΄ΠΎ ΡΠ΅Ρ
ΠΏΠΎΡ, ΠΏΠΎΠΊΠ° ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠ΅ ΠΏΠΎΡΠ»Π΅Π΄Π½Π΅ΠΉ ΠΎΡΠΈΠ±ΠΊΠΈ ΡΡΠΌΠΌΡ ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡ ΠΏΡΠ΅Π΄ΡΠ΄ΡΡΠ΅ΠΉ, Π½Π΅ Π±ΡΠ΄Π΅Ρ ΠΌΠ΅Π½ΡΡΠ΅ tolerance
while abs(errors[-1]-errors[-2]) > tolerance:
a_step = a - l*(num*a + b*sx - sy)/num
b_step = b - l*(a*sx + b*sx_sq - sxy)/num
a = a_step
b = b_step
ab = [a,b]
errors.append(errors_sq_Kramer_method(ab,x_us,y_us))
return (ab),(errors[2:])
# Π·Π°ΠΏΠΈΡΠ΅ΠΌ ΠΌΠ°ΡΡΠΈΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
list_parametres_gradient_descence = gradient_descent_usual(x_us,y_us,l=0.1,tolerance=0.000000000001)
print ' 33[1m' + ' 33[4m' + "ΠΠ½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b:" + ' 33[0m'
print 'a =', round(list_parametres_gradient_descence[0][0],3)
print 'b =', round(list_parametres_gradient_descence[0][1],3)
print
print ' 33[1m' + ' 33[4m' + "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ:" + ' 33[0m'
print round(list_parametres_gradient_descence[1][-1],3)
print
print ' 33[1m' + ' 33[4m' + "ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ Π² Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠΌ ΡΠΏΡΡΠΊΠ΅:" + ' 33[0m'
print len(list_parametres_gradient_descence[1])
print
We dived to the very bottom of the Mariana Trench and there we found all the same values ββof the coefficients ΠΈ which is actually to be expected.
Let's make another dive, only this time, the stuffing of our deep-sea vehicle will be other technologies, namely the library NumPy.
Gradient Descent Code (NumPy)
# ΠΏΠ΅ΡΠ΅Π΄ ΡΠ΅ΠΌ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΡΡ ΡΡΠ½ΠΊΡΠΈΡ Π΄Π»Ρ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy,
# Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ ΡΡΠΌΠΌΡ ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΡΠ°ΠΊΠΆΠ΅ Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ NumPy
def error_square_numpy(ab,x_np,y_np):
y_pred = np.dot(x_np,ab)
error = y_pred - y_np
return sum((error)**2)
# Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy.
# Π€ΡΠ½ΠΊΡΠΈΡ Π½Π° Π²Ρ
ΠΎΠ΄ ΠΏΡΠΈΠ½ΠΈΠΌΠ°Π΅Ρ Π΄ΠΈΠ°ΠΏΠ°Π·ΠΎΠ½Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ x,y, Π΄Π»ΠΈΠ½Ρ ΡΠ°Π³Π° (ΠΏΠΎ ΡΠΌΠΎΠ»ΡΠ°Π½ΠΈΡ=0,1), Π΄ΠΎΠΏΡΡΡΠΈΠΌΡΡ ΠΏΠΎΠ³ΡΠ΅ΡΠ½ΠΎΡΡΡ(tolerance)
def gradient_descent_numpy(x_np,y_np,l=0.1,tolerance=0.000000000001):
# ΡΡΠΌΠΌΠ° Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ (Π²ΡΠ΅ ΠΌΠ΅ΡΡΡΠ°)
sx = float(sum(x_np[:,1]))
# ΡΡΠΌΠΌΠ° ΠΈΡΡΠΈΠ½Π½ΡΡ
ΠΎΡΠ²Π΅ΡΠΎΠ² (Π²ΡΡΡΡΠΊΠ° Π·Π° Π²Π΅ΡΡ ΠΏΠ΅ΡΠΈΠΎΠ΄)
sy = float(sum(y_np))
# ΡΡΠΌΠΌΠ° ΠΏΡΠΎΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ Π½Π° ΠΈΡΡΠΈΠ½Π½ΡΠ΅ ΠΎΡΠ²Π΅ΡΡ
sxy = x_np*y_np
sxy = float(sum(sxy[:,1]))
# ΡΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
sx_sq = float(sum(x_np[:,1]**2))
# ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
num = float(x_np.shape[0])
# Π½Π°ΡΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ², ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΡΠ΅ ΠΏΡΠ΅Π²Π΄ΠΎΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ
a = float(random.uniform(-0.5, 0.5))
b = float(random.uniform(-0.5, 0.5))
# ΡΠΎΠ·Π΄Π°Π΅ΠΌ ΠΌΠ°ΡΡΠΈΠ² Ρ ΠΎΡΠΈΠ±ΠΊΠ°ΠΌΠΈ, Π΄Π»Ρ ΡΡΠ°ΡΡΠ° ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΡ 1 ΠΈ 0
# ΠΏΠΎΡΠ»Π΅ Π·Π°Π²Π΅ΡΡΠ΅Π½ΠΈΡ ΡΠΏΡΡΠΊΠ° ΡΡΠ°ΡΡΠΎΠ²ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΡΠ΄Π°Π»ΠΈΠΌ
errors = [1,0]
# Π·Π°ΠΏΡΡΠΊΠ°Π΅ΠΌ ΡΠΈΠΊΠ» ΡΠΏΡΡΠΊΠ°
# ΡΠΈΠΊΠ» ΡΠ°Π±ΠΎΡΠ°Π΅Ρ Π΄ΠΎ ΡΠ΅Ρ
ΠΏΠΎΡ, ΠΏΠΎΠΊΠ° ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠ΅ ΠΏΠΎΡΠ»Π΅Π΄Π½Π΅ΠΉ ΠΎΡΠΈΠ±ΠΊΠΈ ΡΡΠΌΠΌΡ ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡ ΠΏΡΠ΅Π΄ΡΠ΄ΡΡΠ΅ΠΉ, Π½Π΅ Π±ΡΠ΄Π΅Ρ ΠΌΠ΅Π½ΡΡΠ΅ tolerance
while abs(errors[-1]-errors[-2]) > tolerance:
a_step = a - l*(num*a + b*sx - sy)/num
b_step = b - l*(a*sx + b*sx_sq - sxy)/num
a = a_step
b = b_step
ab = np.array([[a],[b]])
errors.append(error_square_numpy(ab,x_np,y_np))
return (ab),(errors[2:])
# Π·Π°ΠΏΠΈΡΠ΅ΠΌ ΠΌΠ°ΡΡΠΈΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
list_parametres_gradient_descence = gradient_descent_numpy(x_np,y_np,l=0.1,tolerance=0.000000000001)
print ' 33[1m' + ' 33[4m' + "ΠΠ½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b:" + ' 33[0m'
print 'a =', round(list_parametres_gradient_descence[0][0],3)
print 'b =', round(list_parametres_gradient_descence[0][1],3)
print
print ' 33[1m' + ' 33[4m' + "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ:" + ' 33[0m'
print round(list_parametres_gradient_descence[1][-1],3)
print
print ' 33[1m' + ' 33[4m' + "ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ Π² Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠΌ ΡΠΏΡΡΠΊΠ΅:" + ' 33[0m'
print len(list_parametres_gradient_descence[1])
print
Coefficient values ΠΈ are unchanged.
Let's look at how the error changed during gradient descent, that is, how the sum of the squared deviations changed with each step.
Code for Sum Squared Deviation Plot
print 'ΠΡΠ°ΡΠΈΠΊβ4 "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΠΏΠΎ-ΡΠ°Π³ΠΎΠ²ΠΎ"'
plt.plot(range(len(list_parametres_gradient_descence[1])), list_parametres_gradient_descence[1], color='red', lw=3)
plt.xlabel('Steps (Iteration)', size=16)
plt.ylabel('Sum of squared deviations', size=16)
plt.show()
Chart #4 "Sum of Squared Deviations in Gradient Descent"
On the graph, we see that the error decreases with each step, and after a certain number of iterations, we observe an almost horizontal line.
Finally, let's evaluate the difference in code execution time:
Code for timing gradient descent computation
print ' 33[1m' + ' 33[4m' + "ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Π±Π΅Π· ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy:" + ' 33[0m'
%timeit list_parametres_gradient_descence = gradient_descent_usual(x_us,y_us,l=0.1,tolerance=0.000000000001)
print '***************************************'
print
print ' 33[1m' + ' 33[4m' + "ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy:" + ' 33[0m'
%timeit list_parametres_gradient_descence = gradient_descent_numpy(x_np,y_np,l=0.1,tolerance=0.000000000001)
Perhaps we are doing something wrong, but again a simple "self-written" function that does not use the library NumPy ahead of the calculation time of the function using the library NumPy.
But we are not standing still, but are moving towards learning another exciting way to solve the simple linear regression equation. Meet!
Stochastic Gradient Descent
In order to quickly understand how stochastic gradient descent works, it is better to define its differences from conventional gradient descent. We, in the case of gradient descent, in the equations of derivatives of ΠΈ used the sum of the values ββof all features and true answers available in the sample (that is, the sum of all ΠΈ ). In stochastic gradient descent, we will not use all the values ββin the sample, but instead, we will pseudo-randomly choose the so-called sample index and use its values.
For example, if the index is defined as number 3 (three), then we take the values ΠΈ , then we substitute the values ββinto the equations of derivatives and determine new coordinates. Then, having determined the coordinates, we again determine the sample index in a pseudo-random manner, substitute the values ββ\uXNUMXb\uXNUMXbcorresponding to the index into the partial differential equations, and determine the coordinates in a new way ΠΈ etc. until green convergence. At first glance, it may seem like it can work at all, but it works. True, it is worth noting that the error does not decrease with each step, but there is certainly a trend.
What are the advantages of stochastic gradient descent over conventional gradient descent? If our sample size is very large and is measured in tens of thousands of values, then it is much easier to process, say, a random thousand of them, than the entire sample. This is where stochastic gradient descent kicks in. In our case, of course, we will not notice a big difference.
Let's look at the code.
Code for Stochastic Gradient Descent
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΡ
.Π³ΡΠ°Π΄.ΡΠ°Π³Π°
def stoch_grad_step_usual(vector_init, x_us, ind, y_us, l):
# Π²ΡΠ±ΠΈΡΠ°Π΅ΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ ΠΈΠΊΡ, ΠΊΠΎΡΠΎΡΠΎΠ΅ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΠ΅Ρ ΡΠ»ΡΡΠ°ΠΉΠ½ΠΎΠΌΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΠ° ind
# (ΡΠΌ.Ρ-ΡΠΈΡ stoch_grad_descent_usual)
x = x_us[ind]
# ΡΠ°ΡΡΡΠΈΡΡΠ²ΡΠ°Π΅ΠΌ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ y (Π²ΡΡΡΡΠΊΡ), ΠΊΠΎΡΠΎΡΠ°Ρ ΡΠΎΠΎΡΠ²Π΅ΡΡΡΠ²ΡΠ΅Ρ Π²ΡΠ±ΡΠ°Π½Π½ΠΎΠΌΡ Π·Π½Π°ΡΠ΅Π½ΠΈΡ x
y_pred = vector_init[0] + vector_init[1]*x_us[ind]
# Π²ΡΡΠΈΡΠ»ΡΠ΅ΠΌ ΠΎΡΠΈΠ±ΠΊΡ ΡΠ°ΡΡΠ΅ΡΠ½ΠΎΠΉ Π²ΡΡΡΡΠΊΠΈ ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»Π΅Π½Π½ΠΎΠΉ Π² Π²ΡΠ±ΠΎΡΠΊΠ΅
error = y_pred - y_us[ind]
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌ ΠΏΠ΅ΡΠ²ΡΡ ΠΊΠΎΠΎΡΠ΄ΠΈΠ½Π°ΡΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ° ab
grad_a = error
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΡΠ΅ΠΌ Π²ΡΠΎΡΡΡ ΠΊΠΎΠΎΡΠ΄ΠΈΠ½Π°ΡΡ ab
grad_b = x_us[ind]*error
# Π²ΡΡΠΈΡΠ»ΡΠ΅ΠΌ Π½ΠΎΠ²ΡΠΉ Π²Π΅ΠΊΡΠΎΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ²
vector_new = [vector_init[0]-l*grad_a, vector_init[1]-l*grad_b]
return vector_new
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΡ
.Π³ΡΠ°Π΄.ΡΠΏΡΡΠΊΠ°
def stoch_grad_descent_usual(x_us, y_us, l=0.1, steps = 800):
# Π΄Π»Ρ ΡΠ°ΠΌΠΎΠ³ΠΎ Π½Π°ΡΠ°Π»Π° ΡΠ°Π±ΠΎΡΡ ΡΡΠ½ΠΊΡΠΈΠΈ Π·Π°Π΄Π°Π΄ΠΈΠΌ Π½Π°ΡΠ°Π»ΡΠ½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ²
vector_init = [float(random.uniform(-0.5, 0.5)), float(random.uniform(-0.5, 0.5))]
errors = []
# Π·Π°ΠΏΡΡΡΠΈΠΌ ΡΠΈΠΊΠ» ΡΠΏΡΡΠΊΠ°
# ΡΠΈΠΊΠ» ΡΠ°ΡΡΠΈΡΠ°Π½ Π½Π° ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠ΅ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ°Π³ΠΎΠ² (steps)
for i in range(steps):
ind = random.choice(range(len(x_us)))
new_vector = stoch_grad_step_usual(vector_init, x_us, ind, y_us, l)
vector_init = new_vector
errors.append(errors_sq_Kramer_method(vector_init,x_us,y_us))
return (vector_init),(errors)
# Π·Π°ΠΏΠΈΡΠ΅ΠΌ ΠΌΠ°ΡΡΠΈΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
list_parametres_stoch_gradient_descence = stoch_grad_descent_usual(x_us, y_us, l=0.1, steps = 800)
print ' 33[1m' + ' 33[4m' + "ΠΠ½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b:" + ' 33[0m'
print 'a =', round(list_parametres_stoch_gradient_descence[0][0],3)
print 'b =', round(list_parametres_stoch_gradient_descence[0][1],3)
print
print ' 33[1m' + ' 33[4m' + "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ:" + ' 33[0m'
print round(list_parametres_stoch_gradient_descence[1][-1],3)
print
print ' 33[1m' + ' 33[4m' + "ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ Π² ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠΌ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠΌ ΡΠΏΡΡΠΊΠ΅:" + ' 33[0m'
print len(list_parametres_stoch_gradient_descence[1])
We look carefully at the coefficients and catch ourselves on the question "How so?". We got other values ββof the coefficients ΠΈ . Maybe stochastic gradient descent has found more optimal parameters for the equation? Unfortunately no. It is enough to look at the sum of squared deviations and see that with new values ββof the coefficients, the error is greater. We are not in a hurry to despair. Let's build a graph of error change.
Code for plotting the sum of squared deviations in stochastic gradient descent
print 'ΠΡΠ°ΡΠΈΠΊ β5 "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΠΏΠΎ-ΡΠ°Π³ΠΎΠ²ΠΎ"'
plt.plot(range(len(list_parametres_stoch_gradient_descence[1])), list_parametres_stoch_gradient_descence[1], color='red', lw=2)
plt.xlabel('Steps (Iteration)', size=16)
plt.ylabel('Sum of squared deviations', size=16)
plt.show()
Chart #5 "Sum of Squared Deviations in Stochastic Gradient Descent"
After looking at the schedule, everything falls into place and now we will fix everything.
So what happened? The following happened. When we randomly select a month, it is for the selected month that our algorithm seeks to reduce the error in the calculation of revenue. Then we select another month and repeat the calculation, but we reduce the error for the second selected month. And now let's remember that we have the first two months significantly deviate from the line of the simple linear regression equation. This means that when either of these two months is chosen, by reducing the error of each of them, our algorithm seriously increases the error over the entire sample. So what to do? The answer is simple: you need to reduce the step of descent. Indeed, by reducing the descent step, the error will also stop βjumpingβ either up or down. Or rather, the βjumpingβ error will not stop, but it will not do it so quickly :) Let's check.
Code to run SGD with a smaller step
# Π·Π°ΠΏΡΡΡΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ, ΡΠΌΠ΅Π½ΡΡΠΈΠ² ΡΠ°Π³ Π² 100 ΡΠ°Π· ΠΈ ΡΠ²Π΅Π»ΠΈΡΠΈΠ² ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ°Π³ΠΎΠ² ΡΠΎΠΎΡΠ²Π΅ΡΡΠ²ΡΡΡΠ΅
list_parametres_stoch_gradient_descence = stoch_grad_descent_usual(x_us, y_us, l=0.001, steps = 80000)
print ' 33[1m' + ' 33[4m' + "ΠΠ½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b:" + ' 33[0m'
print 'a =', round(list_parametres_stoch_gradient_descence[0][0],3)
print 'b =', round(list_parametres_stoch_gradient_descence[0][1],3)
print
print ' 33[1m' + ' 33[4m' + "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ:" + ' 33[0m'
print round(list_parametres_stoch_gradient_descence[1][-1],3)
print
print ' 33[1m' + ' 33[4m' + "ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ Π² ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠΌ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠΌ ΡΠΏΡΡΠΊΠ΅:" + ' 33[0m'
print len(list_parametres_stoch_gradient_descence[1])
print 'ΠΡΠ°ΡΠΈΠΊ β6 "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΠΏΠΎ-ΡΠ°Π³ΠΎΠ²ΠΎ"'
plt.plot(range(len(list_parametres_stoch_gradient_descence[1])), list_parametres_stoch_gradient_descence[1], color='red', lw=2)
plt.xlabel('Steps (Iteration)', size=16)
plt.ylabel('Sum of squared deviations', size=16)
plt.show()
Chart #6 "Sum of squared deviations for stochastic gradient descent (80k steps)"
The odds have improved, but are still not ideal. Hypothetically, this can be corrected in this way. We select, for example, on the last 1000 iterations the values ββof the coefficients with which the minimum error was made. True, for this we will have to write down the values ββof the coefficients themselves. We will not do this, but rather pay attention to the schedule. It looks smooth and the error seems to decrease evenly. Actually it is not. Let's look at the first 1000 iterations and compare them with the last ones.
Code for SGD chart (first 1000 steps)
print 'ΠΡΠ°ΡΠΈΠΊ β7 "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΠΏΠΎ-ΡΠ°Π³ΠΎΠ²ΠΎ. ΠΠ΅ΡΠ²ΡΠ΅ 1000 ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ"'
plt.plot(range(len(list_parametres_stoch_gradient_descence[1][:1000])),
list_parametres_stoch_gradient_descence[1][:1000], color='red', lw=2)
plt.xlabel('Steps (Iteration)', size=16)
plt.ylabel('Sum of squared deviations', size=16)
plt.show()
print 'ΠΡΠ°ΡΠΈΠΊ β7 "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ ΠΏΠΎ-ΡΠ°Π³ΠΎΠ²ΠΎ. ΠΠΎΡΠ»Π΅Π΄Π½ΠΈΠ΅ 1000 ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ"'
plt.plot(range(len(list_parametres_stoch_gradient_descence[1][-1000:])),
list_parametres_stoch_gradient_descence[1][-1000:], color='red', lw=2)
plt.xlabel('Steps (Iteration)', size=16)
plt.ylabel('Sum of squared deviations', size=16)
plt.show()
Chart No. 7 "Sum of squared deviations of SGD (first 1000 steps)"
Chart #8 "Sum of squared deviations of SGD (last 1000 steps)"
At the very beginning of the descent, we observe a fairly uniform and steep decrease in the error. At the last iterations, we see that the error goes around and around the value of 1,475 and at some moments even equals this optimal value, but then it still goes up ... I repeat, you can write down the values ββof the coefficients ΠΈ , and then choose those for which the error is minimal. However, we had a bigger problem: we had to take 80 thousand steps (see code) to get values ββclose to optimal. And this already contradicts the idea of ββsaving computation time in stochastic gradient descent with respect to gradient descent. What can be corrected and improved? It is not difficult to see that in the first iterations we are steadily going down and, therefore, we should leave a large step in the first iterations and decrease the step as we move forward. We will not do this in this article - it has already dragged on. Those who wish can think for themselves how to do it, it's not difficult π
Now let's perform stochastic gradient descent using the library NumPy (and let's not trip over the rocks we've identified earlier)
Code for Stochastic Gradient Descent (NumPy)
# Π΄Π»Ρ Π½Π°ΡΠ°Π»Π° Π½Π°ΠΏΠΈΡΠ΅ΠΌ ΡΡΠ½ΠΊΡΠΈΡ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠ°Π³Π°
def stoch_grad_step_numpy(vector_init, X, ind, y, l):
x = X[ind]
y_pred = np.dot(x,vector_init)
err = y_pred - y[ind]
grad_a = err
grad_b = x[1]*err
return vector_init - l*np.array([grad_a, grad_b])
# ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠΌ ΡΡΠ½ΠΊΡΠΈΡ ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ°
def stoch_grad_descent_numpy(X, y, l=0.1, steps = 800):
vector_init = np.array([[np.random.randint(X.shape[0])], [np.random.randint(X.shape[0])]])
errors = []
for i in range(steps):
ind = np.random.randint(X.shape[0])
new_vector = stoch_grad_step_numpy(vector_init, X, ind, y, l)
vector_init = new_vector
errors.append(error_square_numpy(vector_init,X,y))
return (vector_init), (errors)
# Π·Π°ΠΏΠΈΡΠ΅ΠΌ ΠΌΠ°ΡΡΠΈΠ² Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ
list_parametres_stoch_gradient_descence = stoch_grad_descent_numpy(x_np, y_np, l=0.001, steps = 80000)
print ' 33[1m' + ' 33[4m' + "ΠΠ½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΡΠΈΡΠΈΠ΅Π½ΡΠΎΠ² a ΠΈ b:" + ' 33[0m'
print 'a =', round(list_parametres_stoch_gradient_descence[0][0],3)
print 'b =', round(list_parametres_stoch_gradient_descence[0][1],3)
print
print ' 33[1m' + ' 33[4m' + "Π‘ΡΠΌΠΌΠ° ΠΊΠ²Π°Π΄ΡΠ°ΡΠΎΠ² ΠΎΡΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠΉ:" + ' 33[0m'
print round(list_parametres_stoch_gradient_descence[1][-1],3)
print
print ' 33[1m' + ' 33[4m' + "ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΈΡΠ΅ΡΠ°ΡΠΈΠΉ Π² ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠΌ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠΌ ΡΠΏΡΡΠΊΠ΅:" + ' 33[0m'
print len(list_parametres_stoch_gradient_descence[1])
print
The values ββturned out to be almost the same as when descending without using NumPy. However, this is logical.
Let's find out how much time stochastic gradient descents took us.
Code to determine SGD calculation time (80k steps)
print ' 33[1m' + ' 33[4m' +
"ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Π±Π΅Π· ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy:"
+ ' 33[0m'
%timeit list_parametres_stoch_gradient_descence = stoch_grad_descent_usual(x_us, y_us, l=0.001, steps = 80000)
print '***************************************'
print
print ' 33[1m' + ' 33[4m' +
"ΠΡΠ΅ΠΌΡ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ ΡΡΠΎΡ
Π°ΡΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π³ΡΠ°Π΄ΠΈΠ΅Π½ΡΠ½ΠΎΠ³ΠΎ ΡΠΏΡΡΠΊΠ° Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ NumPy:"
+ ' 33[0m'
%timeit list_parametres_stoch_gradient_descence = stoch_grad_descent_numpy(x_np, y_np, l=0.001, steps = 80000)
The further into the forest, the darker the clouds: again, the "self-written" formula shows the best result. All this suggests that there must be even more subtle ways of using the library. NumPy, which really speed up computational operations. In this article, we will not learn about them. Something to think about at your leisure :)
We summarize the
Before summarizing, I would like to answer a question that most likely arose from our dear reader. Why, in fact, such βtormentsβ with descents, why do we need to walk up and down the mountain (mostly down) to find the treasured lowland, if we have such a powerful and simple device in our hands, in the form of an analytical solution that instantly teleports us to Right place?
The answer to this question lies on the surface. Now we have analyzed a very simple example in which the true answer is depends on one feature . You donβt see this often in life, so letβs imagine that we have 2, 30, 50 or more signs. Add to this thousands, and even tens of thousands of values ββfor each feature. In this case, the analytical solution may fail the test and fail. In turn, gradient descent and its variations will slowly but surely bring us closer to the goal - the minimum of the function. And don't worry about the speed - we will probably still analyze the ways that will allow us to set and adjust the step length (that is, the speed).
And now for a short summary.
Firstly, I hope that the material presented in the article will help beginner "data scientists" in understanding how to solve simple (and not only) linear regression equations.
Second, we looked at several ways to solve the equation. Now, depending on the situation, we can choose the one that is best suited for the task at hand.
Thirdly, we saw the power of additional settings, namely the step length of gradient descent. This parameter must not be neglected. As noted above, in order to reduce the cost of performing calculations, the step length should be changed during the descent.
Fourth, in our case, "self-written" functions showed the best time result of calculations. This is probably due to not the most professional use of the library's capabilities. NumPy. But be that as it may, the conclusion suggests itself as follows. On the one hand, sometimes it is worth questioning established opinions, and on the other hand, it is not always worth complicating everything - on the contrary, sometimes a simpler way of solving a problem is more effective. And since our goal was to analyze three approaches to solving a simple linear regression equation, the use of βself-writtenβ functions was enough for us.
Literature (or something like that)
1. Linear regression
2. Least squares method
3. Derivative
4. Gradient
5. Gradient Descent
6. NumPy Library
Source: habr.com