🥇Introduction to functional dependencies

In this article, we will talk about functional dependencies in databases - what they are, where they are used, and what algorithms exist to find them.

We will consider functional dependencies in the context of relational databases. Speaking very roughly, in such databases information is stored in the form of tables. Next, we use approximate concepts that are not interchangeable in strict relational theory: the table itself will be called a relation, the columns will be called attributes (their set is a relation schema), and the set of row values on a subset of attributes will be called a tuple.

For example, in the table above, (Benson, M, Morgan) is a tuple by attributes (Patient, Paul, Doctor).
More formally, this is written as follows: [Patient, Paul, Doctor] = (Benson, M, Morgan).
Now we can introduce the concept of functional dependence (FC):

Definition 1. The relation R satisfies the FD X → Y (where X, Y ⊆ R) if and only if for any tuples , ∈ R is satisfied: if [x]= [X], then [Y] = [Y]. In such a case, X (the determinant or defining set of attributes) is said to functionally determine Y (the dependent set).

In other words, the presence of a federal law X→Y means that if we have two tuples in R and they match in attributes X, then they will also match by attributes Y.
And now in order. Consider attributes Patient и Gender for which we want to know if there are dependencies between them or not. For such a set of attributes, the following dependencies may exist:

Patient → Gender
Gender → Patient

According to the definition above, in order to hold the first dependency, each unique value of the column Patient only one column value must match Gender. And for the example table, this is true. However, this does not work in the opposite direction, that is, the second dependency is not fulfilled, and the attribute Gender is not a determinant for The patient. Similarly, if we take the dependence Doctor → Patient, you can see that it is violated, since the value Robin on this attribute has several different values − Ellis and Graham.

Thus, functional dependencies allow you to determine the existing relationships between the sets of table attributes. From here and henceforth, we will consider the most interesting connections, or rather, such X→Ywhat they are:

non-trivial, that is, the right side of the dependence is not a subset of the left (Y ⊆ X);
minimal, that is, there is no such dependence Z → Y that Z ⊂ X.

The dependencies considered up to this point were strict, that is, they did not provide for any violations on the table, but in addition to them, there are also those that allow some inconsistency between the values of the tuples. Such dependencies are taken out into a separate class, called approximate ones, and they are allowed to be violated on a certain number of tuples. This number is adjusted by the maximum error indicator emax. For example, the error rate = 0.01 may mean that the dependency may be violated by 1% of the available tuples on the considered set of attributes. That is, for 1000 records, a maximum of 10 tuples can violate the Federal Law. We will consider a slightly different metric based on pairwise different values of the compared tuples. For addiction X→Y in relation r it counts like this:

Let's calculate the error for Doctor → Patient from the example above. We have two tuples whose values differ on the attribute Patient, but coincide with doctor: [Doctor, Patient] = (Robin, Ellis) and [Doctor, Patient] = (Robin, Graham). Following the definition of an error, we must take into account all conflicting pairs, which means there will be two of them :(, ) and its inverse (, ). Substitute in the formula and get:

And now let's try to answer the question: "Why is it all?". In fact, the FZ are different. The first type is those dependencies that are defined by the administrator at the database design stage. There are usually few, they are strict, and the main application is data normalization and relationship schema design.

The second type is dependencies representing "hidden" data and previously unknown relationships between attributes. That is, such dependencies were not thought at the time of design and they are already found for the existing data set, so that later, based on the set of identified FDs, any conclusions about the stored information can be drawn. It is with such dependencies that we work. They are engaged in a whole field of data mining with various search techniques and algorithms built on their basis. Let's figure out how the found functional dependencies (exact or approximate) in any data can be useful.

Today, among the main areas of application of dependencies, data cleaning is distinguished. It involves the development of processes for identifying "dirty data" and then correcting them. Bright representatives of "dirty data" are duplicates, data errors or typos, missing values, outdated data, extra spaces, and the like.

Data error example:

Example of duplicates in data:

For example, we have a table and a set of federal laws that must be fulfilled. Data cleaning in this case involves changing the data in such a way that the federal laws become true. At the same time, the number of modifications should be minimal (there are algorithms for this procedure, which we will not focus on in this article). Below is an example of such a data transformation. On the left is the initial relation, in which, obviously, the necessary FLs are not met (an example of a violation of one of the FLs is highlighted in red). On the right is the updated relationship, with green cells showing the changed values. After carrying out such a procedure, the necessary dependencies began to be maintained.

Another popular area of application is database design. Here it is worth recalling about normal forms and normalization. Normalization is the process of bringing a relation into conformity with some set of requirements, each of which is defined by the normal form in its own way. We will not describe the requirements of various normal forms (this is done in any book on the database course for beginners), but only note that each of them uses the concept of functional dependencies in its own way. After all, FDs are inherently integrity constraints that are taken into account when designing a database (in the context of this task, FDs are sometimes called superkeys).

Consider their application for the four normal forms in the picture below. Recall that the Boyce-Codd normal form is more rigorous than the third form, but less strict than the fourth. We do not consider the latter yet, since its formulation requires an understanding of many-valued dependencies, which are not of interest to us in this article.

Another area in which dependencies have found their way is to reduce the dimension of the feature space in such problems as building a naive Bayes classifier, extracting significant features, and reparametrizing a regression model. In the original articles, this problem is called the determination of redundant features (feature redundancy) and relevant (feature relevancy) [5, 6], and it is solved with the active use of database concepts. With the appearance of such works, we can say that today there is a demand for solutions that allow combining the database, analytics and the implementation of the above optimization problems into one tool [7, 8, 9].

There are many algorithms (both modern and not very modern) to search for FD in a data set. Such algorithms can be divided into three groups:

Lattice traversal algorithms
Algorithms based on the search for consistent values (Diﬀerence- and agree-set algorithms)
Algorithms based on pairwise comparisons (Dependency induction algorithms)

A brief description of each type of algorithm is presented in the table below:

More details about this classification can be found in [4]. Below are examples of algorithms for each type:

Currently, new algorithms are emerging that combine several approaches to finding functional dependencies at once. Examples of such algorithms are Pyro [2] and HyFD [3]. An analysis of their work is expected in the following articles of this series. In this article, we will only analyze the basic concepts and lemma that are necessary to understand the techniques for identifying dependencies.

Let's start with a simple one - diﬀerence- and agree-set, used in the second type of algorithms. Difference-set is a set of tuples that do not match in value, and agree-set, on the contrary, is tuples that match in value. It should be noted that in this case we consider only the left side of the dependence.

Also an important concept that was encountered above is the algebraic lattice. Since many modern algorithms operate on this concept, we need to have an idea of \uXNUMXb\uXNUMXbwhat it is.

In order to introduce the concept of a lattice, it is necessary to define a partially ordered set (or partially ordered set, or poset for short).

Definition 2. A set S is said to be partially ordered by a binary relation ⩽ if the following properties hold for any a, b, c ∈ S:

Reflexivity, i.e. a ⩽ a

Antisymmetry, that is, if a ⩽ b and b ⩽ a, then a = b

Transitivity, that is, for a ⩽ b and b ⩽ c, it follows that a ⩽ c

Such a relation is called a relation of (non-strict) partial order, and the set itself is called a partially ordered set. Formal notation: ⟨S, ⩽⟩.

As the simplest example of a partially ordered set, we can take the set of all natural numbers N with the usual order relation ⩽. It is easy to check that all the necessary axioms are satisfied.

More meaningful example. Consider the set of all subsets {1, 2, 3} ordered by the inclusion relation ⊆. Indeed, this relation satisfies all partial order conditions, so ⟨P ({1, 2, 3}), ⊆⟩ is a partially ordered set. The figure below shows the structure of this set: if from one element it is possible to walk along the arrows to another element, then they are in a relation of order.

We need two more simple definitions from the field of mathematics - supremum (supremum) and infimum (infimum).

Definition 3. Let ⟨S, ⩽⟩ be a poset, A ⊆ S. The upper bound of A is an element u ∈ S such that ∀x ∈ S: x ⩽ u. Let U be the set of all upper bounds of S. If U has a smallest element, then it is called the supremum and denoted as sup A.

The notion of exact lower bound is introduced similarly.

Definition 4. Let ⟨S, ⩽⟩ be a poset, A ⊆ S. The lower bound of A is an element l ∈ S such that ∀x ∈ S: l ⩽ x. Let L be the set of all lower bounds of S. If there is a greatest element in L, then it is called an infimum and is denoted as inf A.

As an example, consider the above partially ordered set ⟨P ({1, 2, 3}), ⊆⟩ and find the supremum and infimum in it:

Now we can formulate the definition of an algebraic lattice.

Definition 5. Let ⟨P, ⩽⟩ be a poset such that every two-element subset has sharp upper and lower bounds. Then P is called an algebraic lattice. In this case, sup{x, y} is written as x ∨ y, and inf {x, y} as x ∧ y.

Let's check that our working example ⟨P ({1, 2, 3}), ⊆⟩ is a lattice. Indeed, for any a, b ∈ P ({1, 2, 3}), a∨b = a∪b and a∧b = a∩b. For example, consider the sets {1, 2} and {1, 3} and find their infimum and supremum. If we intersect them, we get the set {1}, which will be the infimum. The supremum is obtained by their union - {1, 2, 3}.

In FD detection algorithms, the search space is often represented in the form of a lattice, where sets of one element (read the first level of the search lattice, where the left side of the dependencies consists of one attribute) represent each attribute of the original relation.
At the beginning, dependences of the form ∅ → single attribute. This step allows you to determine which attributes are primary keys (there are no determinants for such attributes, and therefore the left side is empty). Further, such algorithms move up the lattice. At the same time, it is worth noting that not the entire lattice can be bypassed, that is, if the desired maximum size of the left side is passed to the input, then the algorithm will not go beyond the level with this size.

The figure below shows how you can use the algebraic lattice in the problem of searching for a FD. Here, each edge (X,XY) represents the dependence X→Y. For example, we have passed the first level and we know that the dependency is being held A→B (we will display this with a green connection between the vertices A и B). So further, when we move up the lattice, we can not check the dependence A, C → B, because it will no longer be minimal. Similarly, we would not check it if the dependency was held C→B.

In addition, as a rule, all modern algorithms for searching for FDs use such a data structure as a partition (in the original source - stripped partition [1]). The formal definition of a partition is as follows:

Definition 6. Let X ⊆ R be a set of attributes for relation r. A cluster is a set of indices of tuples from r that have the same value for X, i.e. c(t) = {i|ti[X] = t[X]}. A partition is a set of clusters, excluding clusters of unit length:

In simple terms, a partition for an attribute X is a set of lists, where each list contains line numbers with the same values for X. In modern literature, a structure representing partitions is called a position list index (PLI). Single length clusters are excluded for PLI compression purposes because they are clusters containing only a record number with a unique value that will always be easy to set.

Consider an example. Let's return to the same table with patients and build partitions for the columns Patient и Gender (a new column has appeared on the left, in which the row numbers of the table are marked):

In this case, according to the definition, the partition for the column Patient will actually be empty, since single clusters are excluded from the partition.

Partitions can be obtained by several attributes. And for this there are two ways: by walking through the table, build a partition at once by all the necessary attributes, or build it using the operation of crossing partitions by a subset of attributes. FD search algorithms use the second option.

In simple words, for example, to get a partition by columns ABC, you can take partitions for AC и B (or any other set of disjoint subsets) and intersect them with each other. The operation of crossing two partitions selects the clusters of the greatest length that are common to both partitions.

Let's look at an example:

In the first case, we got an empty partition. If you look closely at the table, then indeed, there are no identical values for two attributes. If we slightly modify the table (the case on the right), then we will already get a non-empty intersection. At the same time, lines 1 and 2 really contain the same values for attributes Gender и Doctor.

Next, we need such a concept as the size of the partition. Formally:

Simply put, the partition size is the number of clusters included in the partition (remember that single clusters are not included in the partition!):

Now we can define one of the key lemmas, which, for given partitions, allows us to determine whether the dependency is held or not:

Lemma 1. Dependency A, B → C is held if and only if

According to the lemma, there are four steps to determine if a dependency is being held:

Calculate partition for left side of dependency
Calculate the partition for the right side of the dependency
Calculate the product of the first and second steps
Compare the partition sizes obtained in the first and third steps

Below is an example of checking if the dependency is held by this lemma:

In this article, we analyzed such concepts as functional dependence, approximate functional dependence, examined where they are used, and also what algorithms for searching for FD exist. We also analyzed in detail the basic, but important concepts that are actively used in modern algorithms for searching for FDs.

Links to literature:

Huhtala Y. et al. TANE: An efficient algorithm for discovering functional and approximate dependencies //The computer journal. - 1999. - T. 42. - No. 2. - S. 100-111.
Kruse S., Naumann F. Efficient discovery of approximate dependencies // Proceedings of the VLDB Endowment. - 2018. - T. 11. - No. 7. - S. 759-772.
Papenbrock T., Naumann F. A hybrid approach to functional dependency discovery // Proceedings of the 2016 International Conference on Management of Data. - ACM, 2016. - P. 821-833.
Papenbrock T. et al. Functional dependency discovery: An experimental evaluation of seven algorithms //Proceedings of the VLDB Endowment. - 2015. - T. 8. - No. 10. - S. 1082-1093.
Kumar A. et al. To join or not to join?: Thinking twice about joins before feature selection //Proceedings of the 2016 International Conference on Management of Data. - ACM, 2016. - S. 19-34.
Abo Khamis M. et al. In-database learning with sparse tensors //Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. - ACM, 2018. - P. 325-340.
Hellerstein JM et al. The MADlib analytics library: or MAD skills, the SQL //Proceedings of the VLDB Endowment. - 2012. - V. 5. - No. 12. - S. 1700-1711.
Qin C., Rusu F. Speculative approximations for terascale distributed gradient descent optimization // Proceedings of the Fourth Workshop on Data analytics in the Cloud. - ACM, 2015. - P. 1.
Meng X. et al. Mllib: Machine learning in apache spark //The Journal of Machine Learning Research. - 2016. - T. 17. - No. 1. - S. 1235-1241.

Article authors: Anastasia Birillo, researcher in JetBrains Research, CS student и Nikita Bobrov, researcher in JetBrains Research

Source: habr.com

Introduction to functional dependencies

Add a comment Отменить ответ