Efficient search for functional dependencies in databases

The search for functional dependencies in data is used in various areas of data analysis: database management, data cleansing, database reverse engineering and data exploitation. We have already published about the dependencies themselves Article Anastasia Birillo and Nikita Bobrov. This time, Anastasia, a graduate of this year's Computer Science Center, shares the development of this research work, which she defended at the center.

Efficient search for functional dependencies in databases

Select a task

While studying at the CS center, I began to study databases in depth, namely, the search for functional and differential dependencies. This topic was related to the topic of my coursework at the university, so while working on my coursework, I started reading articles about various dependencies in databases. I wrote a review of this area - one of my first articles in English and submitted it to the SEIM-2017 conference. I was very glad when I found out that she was still accepted, and decided to delve into the topic. The concept itself is not new - it began to be used back in the 90s, but even now it is used in many areas.

In the second semester of studying at the center, I started a research project to improve algorithms for finding functional dependencies. I worked on it together with Nikita Bobrov, a graduate student at St Petersburg University, on the basis of JetBrains Research.

Computational complexity of searching for functional dependencies

The main problem is computational complexity. The number of possible minimal and non-trivial dependencies is bounded from above by the value Efficient search for functional dependencies in databasesWhere Efficient search for functional dependencies in databases β€” number of table attributes. The running time of the algorithms depends not only on the number of attributes, but also on the number of rows. In the 90s, FD search algorithms on a typical desktop PC could process datasets containing up to 20 attributes and tens of thousands of lines for up to several hours. Modern algorithms running on multi-core processors find dependencies for datasets consisting of hundreds of attributes (up to 200) and hundreds of thousands of rows in about the same time. However, this is not enough: such a time is unacceptable for most real applications. Therefore, we developed approaches to speed up existing algorithms.

Partition crossing caching schemes

In the first part of the work, we developed caching schemes for a class of algorithms using the partition intersection method. A partition for an attribute is a set of lists, where each list contains line numbers with the same values ​​for that attribute. Each such list is called a cluster. Many modern algorithms use partitions to determine whether a dependency is being held or not, namely adhere to the lemma: Dependency Efficient search for functional dependencies in databases held if Efficient search for functional dependencies in databases. Here Efficient search for functional dependencies in databases the partition is denoted and the concept of the partition size is used - the number of clusters in it. Algorithms that use partitions, when a dependency is violated, add additional attributes to the left side of the dependency, and then recalculate it, performing the partition intersection operation. Such an operation in the articles is called specialization. But we have noticed that partitions for dependencies, which will be retained only after a few rounds of specialization, can be actively reused, which can significantly reduce the running time of the algorithms, since the intersection operation is expensive.

Therefore, we proposed a heuristic based on Shannon's Entropy and Ginny's Uncertainty, as well as our own metric, which we called Inverse Entropy. It is a slight modification of Shannon's Entropy and grows as the uniqueness of the data set grows. The proposed heuristic is as follows:

Efficient search for functional dependencies in databases

Here Efficient search for functional dependencies in databases - the degree of uniqueness of the newly calculated partition Efficient search for functional dependencies in databases, Efficient search for functional dependencies in databases is the median of the degrees of uniqueness for individual attributes. As a uniqueness metric, all three metrics described above were tested. You can also notice that there are two modifiers in the heuristic. The first indicates how close the current partition is to the primary key and allows you to cache those partitions that are far from the candidate key to a greater extent. The second modifier allows you to keep track of cache occupancy and thus stimulates the addition of more partitions to the cache when there is free space. The successful solution of this problem made it possible to speed up the PYRO algorithm by 10-40%, depending on the dataset. It is worth noting that the PYRO algorithm is the most successful in this area.

In the figure below, you can see the results of applying the proposed heuristic compared to the basic coin toss caching approach. The x-axis is logarithmic.

Efficient search for functional dependencies in databases

Alternative way to store partitions

We then proposed an alternative way to store partitions. Partitions are a set of clusters, each of which stores the numbers of tuples with the same values ​​for certain attributes. These clusters can contain long sequences of tuple numbers, for example, if the data in the table is ordered. Therefore, we proposed a compression scheme for storing partitions, namely, interval storage of values ​​in clusters of partitions:

$$display$$pi(X) = {{underbrace{1, 2, 3, 4, 5}_{First~interval}, underbrace{7, 8}_{Second~interval}, 10}}\ downarrow{ Compression}\ pi(X) = {{underbrace{$, 1, 5}_{First~interval}, underbrace{7, 8}_{Second~interval}, 10}}$$display$$

This method was able to reduce memory consumption during the operation of the TANE algorithm from 1 to 25%. The TANE algorithm is a classic FD search algorithm, it uses partitions in the course of its work. As part of the practice, it was the TANE algorithm that was chosen, since it was much easier to implement interval storage in it than, for example, in PYRO, in order to assess whether the proposed approach works. The results obtained are shown in the figure below. The x-axis is logarithmic.

Efficient search for functional dependencies in databases

Conference ADBIS-2019

Based on the results of the study in September 2019, I published an article Smart Caching for Efficient Functional Dependency Discovery at the 23rd European Conference on Advances in Databases and Information Systems (ADBIS-2019). During the speech, the work was noted by Bernhard Thalheim, a significant person in the field of databases. The results of the research formed the basis of my dissertation in the master's program in mathematics at St. Petersburg State University, during which both proposed approaches (caching and compression) were implemented in both algorithms: TANE and PYRO. At the same time, the results showed that the proposed approaches are universal, since both algorithms with both approaches showed a significant reduction in memory consumption, as well as a significant reduction in the time of the algorithms.

Source: habr.com

Add a comment