Genesis?). Reflections on the nature of the mind. Part II

Genesis?). Reflections on the nature of the mind. Part II

A word about processes, or all of us a little counterwinds.

Continuation of thoughts on the topic of intelligence, both natural and artificial (AI), Part One here


Tricky question: Does the person live in now? No, when we walk down the street and directly contemplate the world around us, we act more or less realtime... Although in fact - as long as what we see goes through the usual mechanisms of recognition / classification - all this will be recent, but still the past. Those. does a person live in the past?

For example: you are walking down the street and see a dog. Or a car. In any case, if we are talking about the moment, this information is already outdated. If we operate with data that has gone through all our cognitive mechanisms (and the brain is far from the fastest calculator!) we simply will not keep up with the world! The dog will attack or, on the contrary, run away, and your desire to pat it behind the ear will remain unfulfilled, and the car will hit you or pass by, although it was this car that you wanted to “catch.”

But thank God it doesn’t happen that way, and here’s why: the brain works differently. The unit of perception is not an object, or even a set of objects, but processes. The dog is running. To you or from you. Or he doesn’t run, but lies down, for example. The car is also stationary (in a parking lot), or moving in a certain direction. In all cases, you perceive a process that extends over time and, accordingly, has a certain development in the future. When I say that we perceive events as unfolding in time, this is not a figure of speech. Conduct an experiment - take a dozen photographs (i.e., snapshots of reality) and describe what you see. Here are several people in a room, they are quarreling, or here is a person walking down the street, or here is sitting watching TV, and here is another person reading a book. These are all processes extended in time! You perceive the snapshot as something that has an extension. You don’t know how to do it any other way, because that’s how the brain works: it is trained to recognize processes, and not isolated objects on the stage. Just like not eyes-nose-mouth, but the face as a whole (hello, convolutional neural networks).

The world consists of processes, not objects. If I ask you what it is apple, then most adults will say that this is fruit, and children - what is it? food. But both are process descriptions, because the first means that this apple grows on a tree, and serves the tree for reproduction, and the second is that it edible. Neither one nor the other is associated with the direct characteristics of an apple - shape, color, size... Because the characteristics allow identification, but do not allow use, or understanding where it is used in the outside world, i.e. define the processes.

If we take a typical debate about the nature of time, then the classic postulates will be about the immutability of the past (outside the context of time travel), the importance of the present (there is only a moment... 😉), and the future, which does not yet exist, which means it can be changed. When we talk about objective reality, it may very well be that this is so. However, a person lives in his own, subjective model of the world, and there everything is almost the opposite!

The past is not nearly as immutable as we would like. Constantly receiving new information, a person rebuilds the past in order to eliminate contradictions (you thought Pyotr Stepanych was at the symposium, and he’s coming out of a strip club... This means nowhere, he, the entertainer, didn’t go and at all... ). At the same time, your subjective future is a constant in many aspects (whatever it is, on Friday I have beer and football!). Moreover, having a specific goal in the future, you not only build a chain of processes in reverse order (To become the director of a large company, you need to graduate from a prestigious university with a diploma, for this you must first enroll in it, for this you need to pass the Unified State Exam well, and study your homework!), but it is also quite likely that in this process you will go into the past (Didn’t we have friends/acquaintances who have now risen and acquired connections and could help a child with university?) - why not counter-emotion? 😉

However, I digress a little. Still, the main thing I wanted to focus on is processes. I am deeply convinced that potential AI should not be trained on photos or even videos. A convolutional network has two levels (minimum) - and in fact these are two different networks: one is trained to find certain graphic patterns in a raw image, the second deals with the output of the first - i.e. with already processed and prepared information. In order to successfully interact with the world of AI, the same thing is needed: at some (by no means the first) level there must be a network that receives as input a map of processes unfolded over time. The concepts of “beginning” and “end”, “movement”, “transformation”, “merging” and “dividing” are what the network must learn to work with.

I'm pretty sure that those who work on game AI, like Alpha Go, understand this one way or another. Perhaps the approaches there are somewhat different, but the essence is the same: the current situation on the board (and in the development of the last few moves) is analyzed for “what is happening in general.” And depending on how much what happens corresponds to what should happen, we select our own moves.

It is very difficult to talk about strategy/behavior when the input is a picture from sensors. And vice versa - a prepared vector containing a complete breakdown of the current state of the field in games with complete information (consider a complete picture of the world) is a completely feasible task, as practice shows. However, if the convolutional network of the first levels has identified objects, and the next levels analyze these objects in dynamics, identifying processes (familiar from training, for example) that complement the data obtained earlier, then it seems possible to work with this...

Questions for experts:

How realistic is it, given current developments in neural networks, to do approximately the following:

At the entrance, let's say a continuous video signal, possibly stereo. As an option: with several degrees of freedom (the ability to rotate the camera - arbitrarily, or according to a pattern). However, if necessary, the video signal can be supplemented/replaced by any other methods of spatial perception - from sonar to lidar.

Strictly speaking…the input can be anything realtime flow - even speech/text, even currency quotes, but... In the process under consideration, it is easier for me to rely on the only sample of the mind available to me for direct study - my own! ) And in this “sample” the sensory channel is beyond competition!
At the exit:

  1. Depth map (if the camera is static) or environment map. space (dynamic camera/lidar, etc.);

    For whatIt is necessary if we want to have a real spatial arrangement of objects to assess their interaction. In this case, the image from the camera is only a two-dimensional projection of a higher-dimensional space, and additional transformations are needed.

  2. Isolation of individual objects (taking into account the depth/space map, and not only/not so much visible contours);
  3. Identification of moving objects (speed/acceleration, construction/prediction of trajectory(?));
  4. Hierarchical classification of objects according to any extracted characteristics (shape/dimensions/color/nuances of movement/Component parts(?)). Those. essentially extracting metrics for Hilbert spaces.

    about the hierarchyPerhaps the word “Hierarchical” is not entirely appropriate in this case. I wanted to emphasize the ability to select metrics at any time so that Heminga distance between them allowed us to consider two different sets of metrics as one concept. How "red car" and "blue bus" should be generalized into the concept of "vehicle", for example.

Important: If possible, the system is not pretrained. Those. some basic things can be laid down (for example, a convolutional network of the first layer, for highlighting contours/geometry), but it must learn to select objects and later recognize them on its own.

  • And, finally, constructing a sweep (based on points 1,4, i.e. a spatial map taking into account metrics) in time (for now, at this stage of the apparently directly observed period), in order to carry out an analysis according to points 2-4, with in order to identify: processes/events (which are essentially changes in time step 3) and their cluster classification (step 4).

Once again: from the image from the sensors, we first extract a description of the world in a more prepared form, marked according to the extracted features and divided not into pixels, but into objects. Then we expand the world consisting of objects in time and received "picture of the world" we feed it to the input of the next network, which works with it the same way the previous layers worked with the sensory image. Where the contours of objects were highlighted, the “contours” of ongoing processes will now be highlighted. The relative position of objects in space is similar to the cause-and-effect relationship of processes in time... Something like that.

Presumably, after this, the system should be able to recognize processes by their parts (as it is able to recognize images, having only their fragment, or as writing a continuation of the text according to the model), and as a consequence, predict them both forward and backward in time, expanding the model of step 5 unlimitedly in both directions. Also, presumably, having an idea of ​​the constituent processes, the system can identify, from several related local processes, larger, global processes and, as a consequence, implicit, hidden processes that are an integral part of the identified global ones, but are not directly perceived.

And the last thing: having a fixed state of the system in the future (where only significant elements of Hilbert metrics are fixed, with a free interpretation of the remaining, non-essential values) - is the network capable of “thinking out” the rest?

Well, that is. if it were an image in which only two unrelated fragments were given, could a network trained on some sample complete a “consistent” complete image? The sample in this case is similar time intervals from experience, the fragments are the current and specified states. The result: a consistent “story” connecting one and the other...

It seems to me that this will already be a quite significant basis for further experiments:

  • inclusion of one’s own actions in the “history”, if possible/necessary
  • priority of “natural” cause-and-effect patterns over uncontrolled stochastic emissions (roulette problem)
  • some version of curiosity, i.e. active cognition of patterns through action... etc

PS I fully admit that I have just invented the wheel, and knowledgeable people have been applying these principles in practice for a long time. 😉 In this case, I ask you to “poke your nose” into the relevant developments. And it would be absolutely wonderful if there is a detailed description of the fundamental problems of this approach or a justification for why it does not work in principle.

PPS I am aware that the text is crude, and the idea jumps from one to another, but I really wanted to ask a couple of people these questions (the “question to the experts” section), and this is difficult to do without at least some presentation. Past text (and I was re-reading it now, and realized that it was very difficult to understand) it served its purpose: I received several discussions that were valuable to me... I hope it works this time too! 😉

Source: habr.com

Add a comment