In the previous article I described the concept and implementation of a database built around functions rather than tables and fields as in relational databases. It provided many examples showing the advantages of this approach over the classical one. Many considered them insufficiently convincing.
In this article, I will show how this concept allows you to quickly and conveniently balance writing and reading to the database without any change in the logic of work. Similar functionality was tried to be implemented in modern commercial DBMS (in particular, Oracle and Microsoft SQL Server). At the end of the article, I will show what they did, to put it mildly, not very well.
Description
As before, for a better understanding, I will start the description with examples. Suppose we need to implement logic that will return a list of departments with the number of employees in them and their total salary.
In a functional database, this would look like this:
CLASS Department ‘Отдел’;
name ‘Наименование’ = DATA STRING[100] (Department);
CLASS Employee ‘Сотрудник’;
department ‘Отдел’ = DATA Department (Employee);
salary ‘Зарплата’ = DATA NUMERIC[10,2] (Employee);
countEmployees ‘Кол-во сотрудников’ (Department d) =
GROUP SUM 1 IF department(Employee e) = d;
salarySum ‘Суммарная зарплата’ (Department d) =
GROUP SUM salary(Employee e) IF department(e) = d;
The complexity of executing this query in any DBMS will be equivalent to O(number of employees), because this calculation needs to scan the entire table of employees and then group them by department. There will also be some small (we believe that there are much more employees than departments) supplement, depending on the chosen plan O(log number of employees) or O(number of departments) for grouping, etc.
It is clear that the execution overhead may be different in different DBMS, but the complexity will not change in any way.
In the proposed implementation, the functional DBMS will form one subquery that will calculate the desired values for the department, and then JOIN with the department table to get the name. However, for each function, when declaring, it is possible to set a special marker MATERIALIZED. The system will automatically create a corresponding field for each such function. When the value of the function changes, the value of the field will also change in the same transaction. When calling this function, the already precalculated field will be called.
In particular, if you set MATERIALIZED for functions countEmployees и salarySum, then two fields will be added to the table with the list of departments, in which the number of employees and their total salary will be stored. With any change in employees, their salaries or department affiliation, the system will automatically change the values of these fields. The above query will refer directly to these fields and will be executed in O(number of departments).
What are the restrictions? Only one thing: such a function must have a finite number of input values for which its value is defined. Otherwise, it will be impossible to build a table that stores all its values, since there cannot be a table with an infinite number of rows.
Example:
employeesCount ‘Количество сотрудников с зарплатой > N’ (Department d, NUMERIC[10,2] N) =
GROUP SUM salary(Employee e) IF department(e) = d AND salary(e) > N;
This function is defined for an infinite number of values of N (for example, any negative value is fine). Therefore, you cannot put MATERIALIZED on it. So this is a logical and not a technical limitation (i.e. not because we couldn't implement it). Otherwise, there are no restrictions. You can use groupings, sorting, AND and OR, PARTITION, recursion, etc.
For example, in problem 2.2 of the previous article, you can put MATERIALIZED on both functions:
bought 'Купил' (Customer c, Product p, INTEGER y) =
GROUP SUM sum(Detail d) IF
customer(order(d)) = c AND
product(d) = p AND
extractYear(date(order(d))) = y MATERIALIZED;
rating 'Рейтинг' (Customer c, Product p, INTEGER y) =
PARTITION SUM 1 ORDER DESC bought(c, p, y), p BY c, y MATERIALIZED;
SELECT contactName(Customer c), name(Product p) WHERE rating(c, p, 1997) < 3;
The system itself will create one table with type keys Customer, Product и INTEGER, will add two fields to it and will update the field values in them with any changes. With further calls to these functions, they will not be calculated, but the values from the corresponding fields will be read.
Using this mechanism, you can, for example, get rid of recursions (CTE) in queries. In particular, consider the groups that form a tree using the child/parent relationship (each group has a link to its parent):
parent = DATA Group (Group);
In a functional database, recursion logic can be defined as follows:
level (Group child, Group parent) = RECURSION 1l IF child IS Group AND parent == child
STEP 2l IF parent == parent($parent);
isParent (Group child, Group parent) = TRUE IF level(child, parent) MATERIALIZED;
Since for the function isParent is set to MATERIALIZED, then a table with two keys (groups) will be created under it, in which the field isParent will only be true if the first key is a child of the second. The number of entries in this table will be equal to the number of groups multiplied by the average depth of the tree. If it is necessary, for example, to count the number of descendants of a certain group, then you can refer to this function:
childrenCount (Group g) = GROUP SUM 1 IF isParent(Group child, g);
There will be no CTE in the SQL query. Instead, there will be a simple GROUP BY.
Using this mechanism, you can also easily denormalize the database if necessary:
CLASS Order 'Заказ';
date 'Дата' = DATA DATE (Order);
CLASS OrderDetail 'Строка заказа';
order 'Заказ' = DATA Order (OrderDetail);
date 'Дата' (OrderDetail d) = date(order(d)) MATERIALIZED INDEXED;
When calling a function data for the order line, reading from the table with the order lines of the field on which there is an index will go. When changing the order date, the system will automatically recalculate the denormalized date in the line itself.
Advantages
What is all this mechanism for? In classical DBMS, without rewriting queries, the developer or DBA can only change indexes, define statistics and tell the query planner how to execute them (and only commercial DBMS have HINTs). No matter how hard they try, they will not be able to complete the first request in the article for O (number of departments) without changing requests and adding triggers. In the proposed scheme, at the development stage, you can not think about the data storage structure and what aggregations to use. All this can be easily changed on the fly already directly in operation.
In practice, it looks like this. Some people develop directly logic based on the task at hand. They do not understand algorithms and their complexity, nor execution plans, nor types of joins, nor any other technical component. These people are more business analysts than developers. Then, it all goes into testing or operation. Logging of long-running queries is enabled. When a long request is detected, then other people (more technical - essentially DBA) make a decision to enable MATERIALIZED on some intermediate function. This slows down the write a little (because an additional field in the transaction needs to be updated). However, not only this request is greatly accelerated, but also all others that use this function. At the same time, making a decision about which function to materialize is relatively easy. Two main parameters: the number of possible input values (that is how many records will be in the corresponding table), and how often it is used in other functions.
Analogues
Modern commercial DBMSs have similar mechanisms: MATERIALIZED VIEW with FAST REFRESH (Oracle) and INDEXED VIEW (Microsoft SQL Server). In PostgreSQL, MATERIALIZED VIEW cannot be updated in a transaction, but only on request (and even with very strict restrictions), so we do not consider it. But they have several problems that significantly limit their use.
First, you can only enable materialization if you have already created a regular VIEW. Otherwise, you will have to rewrite the rest of the requests to access the newly created view in order to use this materialization. Or leave everything as it is, but it will be at least inefficient if there is already certain pre-calculated data, but many queries do not always use them, but calculate them again.
Secondly, they have a huge number of restrictions:
Oracle
5.3.8.4 General Restrictions on Fast Refresh
The defining query of the materialized view is restricted as follows:
The materialized view must not contain references to non-repeating expressions like SYSDATE and ROWNUM.
The materialized view must not contain references to RAW or LONGRAW data types.
It cannot contain SELECT list subqueries.
It cannot contain analytical functions (for example, RANK) in the SELECT clause
It cannot reference a table on which an XMLIndex index is defined.
It cannot contain MODEL clause
It cannot contain HAVING clause with a subquery.
It cannot contain nested queries that have ANY, ALL, Or NOTEXISTS.
It cannot contain [START WITH …] CONNECT BY clause
It cannot contain multiple detail tables at different sites.
ONCOMMIT materialized views cannot have remote detail tables.
Nested materialized views must have a join or aggregate.
Materialized join views and materialized aggregate views with a GROUPBY clause cannot select from an index-organized table.
5.3.8.5 Restrictions on Fast Refresh on Materialized Views with Joins Only
Defining queries for materialized views with joins only and no aggregates have the following restrictions on fast refresh:
Rowids of all the tables in the FROM list must appear in the SELECT list of the query.
Materialized view logs must exist with rowids for all the base tables in the FROM list of the query.
You cannot create a fast refreshable materialized view from multiple tables with simple joins that include an object type column in the SELECT statement.
Also, the refresh method you choose will not be optimally efficient if:
The defining query uses an outer join that behaves like an inner join. If the defining query contains such a join, consider rewriting the defining query to contain an inner join.
The SELECT list of the materialized view contains expressions on columns from multiple tables.
5.3.8.6 Restrictions on Fast Refresh on Materialized Views with Aggregates
Defining queries for materialized views with aggregates or joins have the following restrictions on fast refresh:
Fast refresh is supported for both ONCOMMIT and ONDEMAND materialized views, however the following restrictions apply:
All tables in the materialized view must have materialized view logs, and the materialized view logs must:
Contain all columns from the table referenced in the materialized view.
Specify with ROWID and INCLUDINGNEWVALUES.
Specify the SEQUENCE clause if the table is expected to have a mix of inserts/direct-loads, deletes, and updates.
Only SUM, COUNT, AVG, STDDEV, VARIANCE, MIN and MAX are supported for fast refresh.
COUNT(*) must be specified.
Aggregate functions must occur only as the outermost part of the expression. That is, aggregates such as AVG(AVG(x)) or AVG(x)+ AVG(x) are not allowed.
For each aggregate such as AVG(expr), the corresponding COUNT(expr) must be present. Oracle recommends that SUM(expr) be specified.
If VARIANCE(expr) or STDDEV(expr) is specified, COUNT(expr) and SUM(expr) must be specified. Oracle recommends that SUM(expr *expr) be specified.
The SELECT column in the defining query cannot be a complex expression with columns from multiple base tables. A possible workaround to this is to use a nested materialized view.
The SELECT list must contain all GROUPBY columns.
The materialized view is not based on one or more remote tables.
If you use the CHAR data type in the filter columns of a materialized view log, the character sets of the master site and the materialized view must be the same.
If the materialized view has one of the following, then fast refresh is supported only on conventional DML inserts and direct loads.
materialized views with MIN or MAX aggregates
Materialized views which have SUM(expr) shoe no COUNT(expr)
materialized views without COUNT(*)
Such a materialized view is called an insert-only materialized view.
A materialized view with MAX or MIN is fast refreshable after delete or mixed DML statements if it does not have a WHERE clause
The max/min fast refresh after delete or mixed DML does not have the same behavior as the insert-only case. It deletes and recomputes the max/min values for the affected groups. You need to be aware of its performance impact.
Materialized views with named views or subqueries in the FROM clause can be quickly refreshed provided the views can be completely merged. For information on which views will merge, see Oracle Database SQL Language Reference.
If there are no outer joins, you may have arbitrary selections and joins in the WHERE clause
Materialized aggregate views with outer joins are fast refreshable after conventional DML and direct loads, provided only the outer table has been modified. Also, unique constraints must exist on the join columns of the inner join table. If there are outer joins, all the joins must be connected by ANDs and must use the equality (=) operator.
For materialized views with CUBE, ROLLUP, grouping sets, or concatenation of them, the following restrictions apply:
The SELECT list should contain grouping distinguisher that can either be a GROUPING_ID function on all GROUPBY expressions or GROUPING functions one for each GROUPBY expression. For example, if the GROUPBY clause of the materialized view is "GROUPBYCUBE(a, b)", then the SELECT list should contain either "GROUPING_ID(a, b)»Or«GROUPING(a)ANDGROUPING(b)» for the materialized view to be fast refreshable.
GROUPBY should not result in any duplicate groupings. For example, "GROUP BY a, ROLLUP(a, b)" is not fast refreshable because it results in duplicate groupings "(a), (a, b), AND (a)«.
5.3.8.7 Restrictions on Fast Refresh on Materialized Views with UNION ALL
materialized views with the UNIONALL set operator support the REFRESHFAST option if the following conditions are satisfied:
The defining query must have the UNIONALL operator at the top level.
The UNIONALL operator cannot be embedded inside a subquery, with one exception: The UNIONALL can be in a subquery in the FROM clause provided the defining query is of the form SELECT * FROM (view or subquery with UNIONALL) as in the following example:
CREATE VIEW view_with_unionall AS (SELECT c.rowid crid, c.cust_id, 2 umarker FROM customers c WHERE c.cust_last_name = 'Smith' UNION ALL SELECT c.rowid crid, c.cust_id, 3 umarker FROM customers c WHERE c.cust_last_name = 'Jones'); CREATE MATERIALIZED VIEW unionall_inside_view_mv REFRESH FAST ON DEMAND AS SELECT * FROM view_with_unionall;
Note that the view view_with_unionall satisfies the requirements for fast refresh.
Each query block in the UNIONALL query must satisfy the requirements of a fast refreshable materialized view with aggregates or a fast refreshable materialized view with joins.
The appropriate materialized view logs must be created on the tables as required for the corresponding type of fast refreshable materialized view.
Note that the Oracle Database also allows the special case of a single table materialized view with joins only provided the ROWID column has been included in the SELECT list and in the materialized view log. This is shown in the defining query of the view view_with_unionall.
The SELECT list of each query must include a UNIONALL marker, and the UNIONALL column must have a distinct constant numeric or string value in each UNIONALL branch. Further, the marker column must appear in the same ordinal position in the SELECT list of each query block. see "UNION ALL Marker and Query Rewrite» for more information regarding UNIONALL markers.
Some features such as outer joins, insert-only aggregate materialized view queries and remote tables are not supported for materialized views with UNIONALL. Note, however, that materialized views used in replication, which do not contain joins or aggregates, can be fast refreshed when UNIONALL or remote tables are used.
The compatibility initialization parameter must be set to 9.2.0 or higher to create a fast refreshable materialized view with UNIONALL.
I don’t want to offend Oracle fans, but judging by their list of restrictions, it seems that this mechanism was written not in the general case, using some kind of model, but thousands of Indians, where everyone was given to write their own branch, and each of them did what they could, then and did. Using this mechanism for real logic is like walking through a minefield. At any time, you can get a mine by hitting one of the non-obvious restrictions. How it works is also a separate issue, but it is outside the scope of this article.
Microsoft SQL Server
Additional requirements
In addition to the SET options and deterministic function requirements, the following requirements must be met:
The user that executes CREATE INDEX must be the owner of the view.
When you create the index, the IGNORE_DUP_KEY option must be set to OFF (the default setting).
Tables must be referenced by two-part names, schedule.tablename in the view definition.
User-defined functions referenced in the view must be created by using the WITH SCHEMABINDING option.
Any user-defined functions referenced in the view must be referenced by two-part names, ..
The data access property of a user-defined function must be NO SQL, and external access property must be NO.
Common language runtime (CLR) functions can appear in the select list of the view, but cannot be part of the definition of the clustered index key. CLR functions cannot appear in the WHERE clause of the view or the ON clause of a JOIN operation in the view.
CLR functions and methods of CLR user-defined types used in the view definition must have the properties set as shown in the following table.
Property
Footnotes
DETERMINISTIC = TRUE
Must be declared explicitly as an attribute of the Microsoft .NET Framework method.
PRECISE = TRUE
Must be declared explicitly as an attribute of the .NET Framework method.
DATA ACCESS = NO SQL
Determined by setting DataAccess attribute to DataAccessKind.None and SystemDataAccess attribute to SystemDataAccessKind.None.
EXTERNAL ACCESS = NO
This property defaults to NO for CLR routines.
The view must be created by using the WITH SCHEMABINDING option.
The view must reference only base tables that are in the same database as the view. The view cannot reference other views.
The SELECT statement in the view definition must not contain the following Transact-SQL elements:
1 The indexed view can contain float columns; however, such columns cannot be included in the clustered index key.
If GROUP BY is present, the VIEW definition must contain COUNT_BIG(*) and must not contain HAVING. Thesis GROUP BY restrictions are applicable only to the indexed view definition. A query can use an indexed view in its execution plan even if it does not satisfy these GROUP BY limitations.
If the view definition contains a GROUP BY clause, the key of the unique clustered index can reference only the columns specified in the GROUP BY clause
It can be seen here that the Indians were not attracted, since they decided to do it according to the “we will do little, but well” scheme. That is, they have more mines on the field, but their location is more transparent. This limitation is the most frustrating:
The view must reference only base tables that are in the same database as the view. The view cannot reference other views.
In our terminology, this means that a function cannot call another materialized function. It cuts the whole ideology in the bud.
Also, this restriction (and further down the text) greatly reduces the use cases:
The SELECT statement in the view definition must not contain the following Transact-SQL elements:
OUTER JOINS, UNION, ORDER BY and others are prohibited. Perhaps it was easier to specify what can be used than what cannot. The list would probably be much smaller.
Summing up: a huge set of restrictions in each (I note commercial) DBMS vs none (with the exception of one logical, not technical) in the LGPL technology. However, it should be noted that it is somewhat more difficult to implement this mechanism in relational logic than in the described functional one.
implementation
How it works? PostgreSQL is used as a "virtual machine". Inside there is a complex algorithm that builds queries. Here source. And there is not just a large set of heuristics with a bunch of if's. So, if you have a couple of months to study, then you can try to understand the architecture.
Does it work effectively? Quite efficient. Unfortunately, this is hard to prove. I can only say that if you look at the thousands of requests that are in large applications, then on average they are more efficient than a good developer. An excellent SQL programmer can write any query more efficiently, but with a thousand queries, he simply will not have the motivation or time to do it. The only thing that I can now cite as proof of effectiveness is that several projects work on the basis of the platform built on this DBMS ERP systems, which have thousands of different MATERIALIZED functions, with a thousand users and terabyte databases with hundreds of millions of records, running on a normal two-processor server. However, anyone can check / refute the effectiveness by downloading platform and PostgreSQL turned on logging SQL queries and trying to change the logic and data there.
In the following articles, I will also talk about how you can set restrictions on functions, work with change sessions, and much more.