Neural networks. Where is it all heading

The article consists of two parts:

  1. A brief description of some network architectures for object detection in the image and image segmentation with the most understandable resource links for me. I tried to choose video explanations and preferably in Russian.
  2. The second part consists in trying to understand the direction of development of neural network architectures. And technologies based on them.

Neural networks. Where is it all heading

Figure 1 - Understanding the architecture of neural networks is not easy

It all started by making two demo apps for classifying and detecting objects on an android phone:

  • back-end demowhen the data is processed on the server and transferred to the phone. Image classification of three types of bears: brown, black, and teddy.
  • front-end demowhen the data is processed on the phone itself. Object detection of three types: hazelnut, fig and date.

There is a difference between the tasks of image classification, object detection in an image, and image segmentation. Therefore, it became necessary to find out which neural network architectures detect objects in images and which ones can segment. I found the following examples of architectures with the most understandable links to resources for me:

  • A series of architectures based on R-CNN (Rwith Cdevelopment Neural Networks features): R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN. To detect an object in an image using the Region Proposal Network (RPN) mechanism, bounding boxes are allocated. Initially, the slower Selective Search mechanism was used instead of RPN. Then the selected limited regions are fed to the input of a conventional neural network for classification. The R-CNN architecture has explicit "for" loops over limited regions, up to 2000 runs in total through the internal AlexNet network. Explicit "for" loops slow down image processing speed. The number of explicit loops, runs through the internal neural network, decreases with each new version of the architecture, and dozens of other changes are being made to increase speed and to replace the object detection task with object segmentation in Mask R-CNN.
  • YOLO (You Only Lalso Once) is the first neural network to recognize objects in real time on mobile devices. Distinctive feature: distinguishing objects in one run (it is enough to look at it once). That is, there are no explicit “for” loops in the YOLO architecture, which makes the network fast. For example, this analogy: in NumPy, during operations with matrices, there are also no explicit “for” loops, which in NumPy are implemented at lower levels of the architecture through the C programming language. YOLO uses a grid of predefined windows. To prevent the same object from being defined multiple times, the window overlap factor (IoU, Iintersection osee Union). This architecture operates in a wide range and has a high robustness: The model can be trained on photographs, but still perform well on hand-drawn paintings.
  • SSD (Sgroin Shot MultiBox Detector) - the most successful "hacks" of the YOLO architecture are used (for example, non-maximum suppression) and new ones are added to make the neural network work faster and more accurately. Distinctive feature: discrimination of objects in one run using a given grid of windows (default box) on the pyramid of images. The pyramid of images is encoded in convolution tensors during successive convolution and pooling operations (during the max-pooling operation, the spatial dimension decreases). In this way, both large and small objects are determined in one run of the network.
  • Mobile SSD (MobileNetV2+ SSD) is a combination of two neural network architectures. First network MobileNetV2 works quickly and increases recognition accuracy. MobileNetV2 replaces VGG-16, which was originally used in original article. The second SSD network determines the location of objects in the image.
  • SqueezeNet – a very small but accurate neural network. By itself, it does not solve the problem of object detection. However, it can be used with a combination of different architectures. And be used in mobile devices. The distinguishing feature is that the data is first compressed to four 1x1 convolution filters and then expanded to four 1x1 and four 3x3 convolution filters. One such data compression-expansion iteration is called a "Fire Module".
  • DeepLab (Semantic Image Segmentation with Deep Convolutional Nets) - segmentation of objects in the image. A distinctive feature of the architecture is a sparse (dilated convolution) convolution, which preserves the spatial resolution. This is followed by the stage of post-processing the results using a graphical probabilistic model (conditional random field), which allows you to remove small noise in the segmentation and improve the quality of the segmented image. Behind the formidable name "graphical probabilistic model" is the usual Gaussian filter, which is approximated by five points.
  • Tried to figure out the device RefineDet (Single Shot Refinement Neural Network for Object Thesection), but understood little.
  • I also looked at how the attention technology works: video1, video2, video3. A distinctive feature of the "attention" architecture is the automatic selection of regions of increased attention in the image (RoI, Rlegions of Interest) using a neural network called Attention Unit. Hotspots are similar to bounding boxes, but unlike them, they are not fixed in the image and may have blurry borders. Then, signs (features) are distinguished from the regions of increased attention, which are “fed” to recurrent neural networks with architectures LSDM, GRU or Vanilla RNN. Recurrent neural networks are able to analyze the relationship of features in a sequence. Recurrent neural networks were originally used for translating text into other languages, and now for translation images to text и text to image.

As these architectures are explored I realized that I don't understand anything. And it's not that my neural network has problems with the attention mechanism. Creating all these architectures is like some kind of huge hackathon where the authors compete in hacks. A hack is a quick solution to a difficult software problem. That is, there is no visible and understandable logical connection between all these architectures. All that unites them is a set of the most successful hacks that they borrow from each other, plus a common feedback convolution operation (back propagation of an error, backpropagation). No systems thinking! It is not clear what to change and how to optimize the existing achievements.

As a result of the lack of logical connection between hacks, they are extremely difficult to remember and put into practice. This is fragmented knowledge. At best, a few interesting and unexpected moments are remembered, but most of the understood and incomprehensible disappear from memory after a few days. It will be good if in a week at least the name of the architecture will be remembered. But several hours and even days of working time were spent reading articles and watching overview videos!

Neural networks. Where is it all heading

Figure 2 - Zoo of neural networks

Most of the authors of scientific articles, in my personal opinion, do everything possible so that even this fragmented knowledge is not understood by the reader. But participial phrases in ten line sentences with formulas taken “from the ceiling” are a topic for a separate article (the problem publish or perish).

For this reason, it became necessary to systematize information through neural networks and, thus, increase the quality of understanding and memorization. Therefore, the main topic of analysis of individual technologies and architectures of artificial neural networks was the following task: find out where it's all going, and not the device of a particular neural network separately.

Where is it all heading. Main results:

  • Number of machine learning startups in the last two years fell sharply. Possible reason: "neural networks are no longer something new."
  • Everyone will be able to create a working neural network to solve a simple problem. To do this, he will take a finished model from the “model zoo” (model zoo) and train the last layer of the neural network (transfer learning) on ready data from Google Dataset Search or from 25 thousand Kaggle datasets in free Cloud Jupyter Notebook.
  • Major manufacturers of neural networks began to create "model zoos" (model zoo). Using them, you can quickly make a commercial application: T.F Hub for TensorFlow, MMDetection for pytorch, Detectron for coffee2, chainer-modelzoo for Chainer and others.
  • Neural networks working in real-time (real-time) on mobile devices. 10 to 50 frames per second.
  • The use of neural networks in phones (TF Lite), in browsers (TF.js) and in household items (IoT, Internet of Things). Especially in phones that already support neural networks at the hardware level (neuroaccelerators).
  • “Every device, garment, and perhaps even food will have IP v6 address and communicate with each other" Sebastian Thrun.
  • The growth in the number of publications on machine learning began exceed Moore's law (doubling every two years) since 2015. Obviously, we need neural networks for the analysis of articles.
  • The following technologies are gaining popularity:
    • PyTorch – popularity is growing rapidly and seems to be overtaking TensorFlow.
    • Automatic selection of hyperparameters AutoML - popularity is growing slowly.
    • Gradual decrease in accuracy and increase in calculation speed: fuzzy logic, algorithms boosting, inaccurate (approximate) calculations, quantization (when the weights of the neural network are converted to integers and quantized), neuroaccelerators.
    • Translate images to text и text to image.
    • Creation XNUMXD objects by video, now in real time.
    • The main thing in DL is a lot of data, but it is not easy to collect and label it. Therefore, markup automation is developing (automated abstract) for neural networks using neural networks.
  • With neural networks, Computer Science has suddenly become experimental science and arose reproducibility crisis.
  • IT money and the popularity of neural networks arose at the same time when computing became a market value. The economy from the gold exchange becomes gold-currency-computing. See my article on econophysics and the reason for the emergence of IT-money.

Gradually a new ML/DL programming methodology (Machine Learning & Deep Learning), which is based on the presentation of the program as a set of trained neural network models.

Neural networks. Where is it all heading

Figure 3 - ML/DL as a new programming methodology

However, it did not appear "neural network theory"within which you can think and work systematically. What is now called "theory" are actually experimental, heuristic algorithms.

Links to my and not only resources:

Thank you for attention!

Source: habr.com

Add a comment