Globals are treasure swords for storing data. Sparse arrays. Part 3

Globals are treasure swords for storing data. Sparse arrays. Part 3In the previous parts1, 2) we talked about globals as trees, in this one we will consider globals as sparse arrays.

sparse array is a kind of array in which most of the values ​​take on the same value.

In practice, there are often such huge sparse arrays that it makes no sense to occupy memory with the same elements. Therefore, it makes sense to implement sparse arrays in such a way that memory is not spent on storing identical values.
In some programming languages, sparse arrays are part of the language itself, for example in J, MATLAB. Other programming languages ​​have special libraries that allow you to implement them. For C++ - Own and more

Globals are good candidates for implementing sparse arrays because:

  1. They store the values ​​of only defined nodes and do not store the values ​​of undefined ones;
  2. The interface for accessing a node's value is extremely similar to how many programming languages ​​implement access to an element of a multidimensional array.
    Set ^a(1, 2, 3)=5
    Write ^a(1, 2, 3)

  3. Global is a fairly low-level structure for storing data, therefore, they have outstanding speed characteristics (from hundreds of thousands to tens of millions of transactions per second, depending on the hardware, see below). 1)

Since the global is a persistent structure, it makes sense to make sparse arrays on them when it is known in advance that the amount of RAM will not be enough.

One of the properties of sparse array implementations is to return some default value if the access is to an undefined cell.

This can be done using the function $GET in COS. In this example, a 3-dimensional array is considered.

SET a = $GET(^a(x,y,z), defValue)

What tasks require sparse arrays and how can globals help out?

Adjacency (connectivity) matrix

Such matrices are used to represent graphs:

Globals are treasure swords for storing data. Sparse arrays. Part 3

Obviously, the larger the graph, the more zeros will be in the matrix. If, for example, we take a graph of a social network and represent it in the form of a similar matrix, then almost all of it will consist of zeros, i.e. will be a sparse array.

Set ^m(id1, id2) = 1 
Set ^m(id1, id3) = 1 
Set ^m(id1, id4) = 1 
Set ^m(id1) = 3 
Set ^m(id2, id4) = 1 
Set ^m(id2, id5) = 1 
Set ^m(id2) = 2
....

In this example, we save in the global ^m the connectivity matrix, as well as the number of edges at each node (who is friends with whom and the number of friends).

If the number of elements in the graph is not more than 29 million (this number is taken as the product of 8 * maximum line size), that is, an even more economical way to store such matrices is bit strings, since large gaps are optimized in a special way in their implementation.

Bit strings are manipulated by the function $bit.

; установка бита
SET $BIT(rowID, positionID) = 1
; получение бита
Write $BIT(rowID, positionID)

State Machine Jump Table

Since the transition graph of a finite automaton is an ordinary graph, then the transition table of a finite automaton is the same adjacency matrix that was discussed above.

Cellular automata

Globals are treasure swords for storing data. Sparse arrays. Part 3

The most famous cellular automaton game "Life", which, due to its rules (when a cell has many neighbors, it dies) is a sparse array.

Stephen Wolfram, believes that cellular automata are new field of science. In 2002, he published A New Kind of Science, a 1280-page book, where he argues broadly that advances in cellular automata are not isolated, but are very stable and of great importance for all areas of science.

It is proved that any algorithm that can be implemented on a computer can be implemented using a cellular automaton. Cellular automata are used to model dynamic environments and systems, to solve algorithmic problems, and for other purposes.

If we have a huge field and we need to record all the intermediate states of a cellular automaton, then it is quite reasonable to use globals.

Cartography

The first thing that comes to my mind when it comes to using sparse arrays is mapping tasks.

As a rule, there is a lot of empty space on the cards. If the map is represented as large pixels, then 71% of the Earth's pixels will be occupied by the ocean. Sparse array. And if you apply only the works of human hands, then there will be more than 95% of empty space.

Of course, no one stores maps as raster arrays, a vector representation is used.
But what are vector maps? This is a kind of frame and polylines and polygons consisting of points.
Actually a database of points and connections between them.

One of the most ambitious mapping missions is the Gaia telescope's mapping mission to our galaxy. Figuratively speaking, our galaxy, like the entire universe, is a continuous sparse array: vast expanses of emptiness, in which there are rare small dots - stars. Empty space 99,999999…….%. To store the map of our galaxy, a database on globals - Caché was chosen.

I don't know the exact structure of globals in this project, I can assume that it is something similar to:

Set ^galaxy(b, l, d) = 1; Номер звезды по каталогу, если есть
Set ^galaxy(b, l, d, "name") = "Sun"
Set ^galaxy(b, l, d, "type") = "normal" ; варианты blackhole, quazar, red_dwarf и т.д.
Set ^galaxy(b, l, d, "weight") = 14E50
Set ^galaxy(b, l, d, "planetes") = 7
Set ^galaxy(b, l, d, "planetes", 1) = "Mercury"
Set ^galaxy(b, l, d, "planetes", 1, weight) = 1E20
...

Where b, l, d are galactic coordinates latitude, longitude and distance from the sun.

The flexible structure of the globals allows you to save any desired characteristics of stars and planets, since the bases on the globals are schema-less.

To store a map of our universe, Caché was chosen not only because of its flexibility, but also because of its ability to save a stream of data very quickly, while creating index globals along the way for quick searches.

If we return to the Earth, then cartographic projects were created on globals OpenStreetMap XAPI and a fork of OpenStreetMap - FOSM.

Recently on Hackathon Caché geospatial indexes were implemented Geospatial. We are waiting for articles with implementation details from the authors.

Implementation of spatial indexes on the global in OpenStreetMap XAPI

Pictures taken from this presentation.

The whole globe is divided into squares, then sub-squares, and sub-squares into sub-sub-squares, and so on. In general, we get a hierarchical structure for storing which globals are created.

Globals are treasure swords for storing data. Sparse arrays. Part 3

At any time, we can almost instantly request the desired square or clear it, while all sub-squares will also be returned or cleared.

Such a scheme on globals can be implemented in several ways.

Option 1:

Set ^m(a, b, a, c, d, a, b,c, d, a, b, a, c, d, a, b,c, d, a, 1) = idПервойТочки
Set ^m(a, b, a, c, d, a, b,c, d, a, b, a, c, d, a, b,c, d, a, 2) = idВторойТочки
...

Option 2:

Set ^m('abacdabcdabacdabcda', 1) = idПервойТочки
Set ^m('abacdabcdabacdabcda', 2) = idВторойТочки
...

In both cases, it is not difficult to request points located in a square of any level on COS/M. It will be somewhat easier to clear square pieces of space of any level in the first option, but this is rarely necessary.

An example of one of the lower level squares:

Globals are treasure swords for storing data. Sparse arrays. Part 3

And here are some globals from the XAPI project: index representation on globals:

Globals are treasure swords for storing data. Sparse arrays. Part 3

Global ^way used to store points polylines (roads, shallow rivers, etc.) and polygons (closed areas: buildings, forests, etc.).

A rough classification of the use of sparse arrays on globals.

  1. We store the coordinates of certain objects and their states (mapping, cellular automata)
  2. We store sparse matrices.

For case 2) when requesting a specific coordinate where the element is not assigned a value, we should get the value of the element of the sparse array by default.

Bonuses that we get when storing multidimensional matrices in globals

Fast deletion and/or selection of pieces of space, multiples of strings, planes, cubes, etc. For cases where integer indexes are used, it can be useful to be able to quickly remove and / or select pieces of space that are multiples of strings, planes, cubes, etc.

Team Kill we can delete both a single element and a line, and even an entire plane. Thanks to the properties of globals, this is very fast - thousands of times faster than element-by-element deletion.

The figure shows a three-dimensional array in the global ^a and different types of deletions.

Globals are treasure swords for storing data. Sparse arrays. Part 3

To select pieces of space by known indices, you can use the command Go.

Selecting a matrix column into the Column variable:

; Зададим трёхмерный разреженный массив 3x3x3
Set ^a(0,0,0)=1,^a(2,2,0)=1,^a(2,0,1)=1,^a(0,2,1)=1,^a(2,2,2)=1,^a(2,1,2)=1
Merge Column = ^a(2,2)
; Выведем переменную Column
Zwrite Column

Conclusion:

Column(0)=1
Column(2)=1

What is interesting in the Column variable, we also got a sparse array, which must also be accessed through $GET, because default values ​​are not stored in it.

Sampling of chunks of space can also be done through a small program using the function $Order. This is especially convenient on spaces whose indices are not quantized (cartography).

Conclusion

The present times set new ambitious tasks. Graphs can have billions of vertices, maps can have billions of points, and someone might even want to run their own universe on cellular automata (1, 2).

When the amount of data of sparse arrays can no longer fit into RAM, and you need to work with them, then it is worth considering the possibility of implementing such projects on globals and COS.

Thank you for your attention! We are waiting for your questions and wishes in the comments.

Disclaimer: This article and my comments on it are my opinion and do not represent the official position of InterSystems Corporation.

Source: habr.com

Add a comment