No, well, of course I'm not serious. There must be a limit to the extent to which it is possible to simplify the subject. But for the first stages, understanding the basic concepts and quickly “entering” the topic, it may be acceptable. And how to name this material correctly (options: “Machine learning for dummies”, “Data analysis from diapers”, “Algorithms for the smallest ones”), we will discuss at the end.
To business. Wrote several application programs in MS Excel to visualize and visualize the processes that occur in various machine learning methods when analyzing data. Seeing is believing, in the end, as the bearers of the culture that developed most of these methods say (by the way, not all of them. The most powerful “support vector machine”, or SVM, support vector machine is an invention of our compatriot Vladimir Vapnik, Moscow Institute of Management. 1963, by the way! Now, however, he teaches and works in the USA).
1. Clustering by k-means
Problems of this kind refer to "unsupervised learning", when we need to break the source data into some pre-known number of categories, but at the same time we do not have any number of "correct answers", we must extract them from the data itself. The fundamental classical problem of finding subspecies of iris flowers (Ronald Fisher, 1936!), which is considered the first sign of this field of knowledge, is of such a nature.
The method is quite simple. We have a set of objects represented as vectors (sets of N numbers). In irises, these are sets of 4 numbers that characterize the flower: the length and width of the outer and inner perianth segments, respectively (
Further, arbitrarily (or not arbitrarily, see below) the centers of the clusters are selected, and the distances from each object to the centers of the clusters are calculated. Each object at a given iteration step is marked as belonging to the nearest center. Then the center of each cluster is transferred to the arithmetic mean of the coordinates of its members (by analogy with physics, it is also called the "center of mass"), and the procedure is repeated.
The process converges fairly quickly. On the pictures in two dimensions it looks like this:
1. Initial random distribution of points on the plane and the number of clusters
2. Setting the centers of clusters and assigning points to their clusters
3. Transferring the coordinates of the centers of clusters, recalculating the belonging of points until the centers stabilize. The trajectory of the movement of the cluster center to the final position is visible.
At any time, you can set new cluster centers (without generating a new distribution of points!) and see that the partitioning process is not always unambiguous. Mathematically, this means that the optimized function (the sum of the squared distances from the points to the centers of their clusters) we find not a global, but a local minimum. This problem can be overcome either by a non-random choice of initial centers of clusters, or by enumeration of possible centers (sometimes it is advantageous to place them exactly in one of the points, then at least there is a guarantee that we will not get empty clusters). In any case, a finite set always has an infimum.
Description of the method on Wikipedia −
2. Approximation by polynomials and breakdown of data. Retraining
A remarkable scientist and popularizer of data science K.V. Vorontsov briefly talks about machine learning methods as "the science of drawing curves through points." In this example, we will find a pattern in the data using the least squares method.
The technique of splitting the initial data into "training" and "control", as well as such a phenomenon as overfitting, or "re-adjustment" to the data, is shown. With a correct approximation, we will have some error on the training data and a slightly larger error on the control data. If incorrect, fine adjustment to the training data and a huge error on the control ones.
(It is a well-known fact that through N points it is possible to draw a single curve of the N-1th degree, and this method generally does not give the desired result.
1. Set the initial distribution
2. We divide the points into "training" and "control" in the ratio of 70 to 30.
3. We draw an approximating curve along the training points, we see the error that it gives on the control data
4. We draw an exact curve through the training points, and we see a monstrous error on the control data (and zero on the training ones, but what's the point?).
Of course, the simplest version is shown with a single partition into “training” and “control” subsets; in the general case, this is done repeatedly for the best adjustment of the coefficients.
3. Gradient descent and error dynamics
There will be a 4-dimensional case and linear regression here. Linear regression coefficients will be determined step by step using the gradient descent method, initially all coefficients are zero. A separate graph shows the dynamics of the error decrease as the coefficients are fine-tuned more and more. It is possible to view all four 2D projections.
If we set the gradient descent step too large, then it is clear that each time we will skip the minimum and reach the result in more steps, although, in the end, we will still come (unless we delay the descent step too much - then the algorithm will go " in disarray"). And the graph of the dependence of the error on the iteration step will not be smooth, but “twitchy”.
1. Generate data, set the gradient descent step
2. With the correct selection of the gradient descent step, we smoothly and quickly enough come to a minimum
3. If the gradient descent step is chosen incorrectly, we skip the maximum, the error graph is “twitchy”, convergence takes a greater number of steps
и
4. With a completely wrong selection of the gradient descent step, we move away from the minimum
(To reproduce the process at the gradient descent step values shown in the pictures, check the "reference data" box).
According to the respected community, is such a simplification and method of presenting material acceptable? Should the article be translated into English?
Source: habr.com