Overview of Agile DWH Design Methodologies

Storage development is a long and serious business.

Much in the life of a project depends on how well the object model and base structure are thought out at the start.

The generally accepted approach has been and remains various combinations of the star schema with the third normal form. As a rule, according to the principle: initial data - 3NF, showcases - a star. This approach, time-tested and backed up by a lot of research, is the first (and sometimes the only) thing that comes to mind for an experienced DWH specialist when thinking about what an analytical repository should look like.

On the other hand, business in general and customer requirements in particular tend to change rapidly, while data grows both “in depth” and “in breadth”. And here the main drawback of the star appears - limited flexibility.

And if in your quiet and comfortable life as a DWH developer all of a sudden:

  • the task arose “to do at least something quickly, and then we’ll see”;
  • a rapidly developing project appeared, with the connection of new sources and the alteration of the business model at least once a week;
  • a customer appeared who has no idea how the system should look and what functions it should perform in the end, but is ready for experiments and consistent refinement of the desired result with a consistent approach to it;
  • the project manager looked in with the good news: “And now we have agile!”.

Or if you are just interested in learning how else you can build storage - welcome to the cat!

Overview of Agile DWH Design Methodologies

What does "flexibility" mean?

To begin with, let's define what properties a system must have in order to be called “flexible”.

Separately, it is worth mentioning that the described properties should refer specifically to system, not to process its development. Therefore, if you wanted to read about Agile as a development methodology, it is better to read other articles. For example, right there, on Habré, there are a lot of interesting materials (as review и practicalAnd problematic).

This does not mean that the development process and the structure of the data warehouse are completely unrelated. In general, Agile development of an agile storage should be much easier. However, in practice, there are more options with Agile development of classic DWH according to Kimbal and DataVault - according to waterfall than happy coincidences of flexibility in its two forms on one project.

So, what features should flexible storage have? There are three points here:

  1. Early delivery and fast completion - this means that ideally the first business result (for example, the first working reports) should be obtained as early as possible, that is, even before the entire system is designed and implemented. At the same time, each subsequent revision should also take as little time as possible.
  2. Iterative refinement - this means that each subsequent revision, ideally, should not affect the already working functionality. It is this moment that often becomes the biggest nightmare on large projects - sooner or later individual objects begin to acquire so many relationships that it becomes easier to completely repeat the logic in a copy side by side than adding a field to an existing table. And if you are surprised that the analysis of the impact of improvements on existing objects can take longer than the revision itself, you most likely have not worked with large data warehouses in banking or telecom.
  3. Constant adaptation to changing business requirements - the general object structure should be designed not just taking into account the possible expansion, but with the expectation that the direction of this next expansion could not even be dreamed of at the design stage.

And yes, compliance with all these requirements in one system is possible (of course, in certain cases and with some reservations).

Below I will review two of the most popular agile design methodologies for HD − anchor model и Data Vault. Outside the brackets are such excellent tricks as, for example, EAV, 6NF (in its pure form) and everything related to NoSQL solutions - not because they are somehow worse, and not even because in this case the article would threaten to acquire the volume of an average disser. It's just that all this refers to solutions of a slightly different class - either to techniques that you can apply in specific cases, regardless of the general architecture of your project (like EAV), or to globally different information storage paradigms (such as graph databases and other options). NoSQL).

Problems of the “classical” approach and their solutions in flexible methodologies

By the “classic” approach I mean the good old star (regardless of the specific implementation of the underlying layers, forgive me the adherents of Kimball, Inmon and CDM).

1. Rigid cardinality of connections

This model is based on a clear division of data into measurements (Dimension) и facts (Fact). And this, damn it, is logical - after all, data analysis in the overwhelming majority of cases comes down to the analysis of certain numerical indicators (facts) in certain sections (dimensions).

At the same time, links between objects are laid in the form of links between tables by a foreign key. This looks quite natural, but immediately leads to the first limitation of flexibility − strict definition of the cardinality of relationships.

This means that at the design stage of tables, you must specify for each pair of related objects whether they can be related as many-to-many, or only 1-to-many, and “in which direction”. It directly depends on which of the tables will have a primary key and which one will have a foreign key. Changing this ratio when new requirements are received will most likely lead to a reworking of the base.

For example, when designing the “cash receipt” object, you, relying on the sworn assurances of the sales department, laid down the possibility of action one promotion for several checking positions (but not vice versa):

Overview of Agile DWH Design Methodologies
And after a while, colleagues introduced a new marketing strategy in which multiple promotions at the same time. And now you need to finalize the tables by highlighting the relationship in a separate object.

(All derived objects, in which the promo check joins, now also need to be improved).

Overview of Agile DWH Design Methodologies
Links in Data Vault and Anchor Model

It turned out to be quite simple to avoid such a situation: you don’t have to trust the sales department, it’s enough all relationships are initially stored in separate tables and process as many-to-many.

This approach has been proposed Dan Linstedt as part of the paradigm Data Vault and fully supported Lars Rönnbäck в Anchor Model.

As a result, we get the first distinctive feature of flexible methodologies:

Relationships between objects are not stored in the attributes of parent entities, but are a separate type of objects.

В Data Vault such tables are called Link, And in Anchor ModelTie. At first glance, they are very similar, although their differences are not exhausted by the name (which will be discussed below). In both architectures, link tables can link any number of entities (not necessarily 2).

This at first sight redundancy gives essential flexibility at completions. Such a structure becomes tolerant not only to changing the cardinalities of existing links, but also to adding new ones - if now a check position also has a link to the cashier who has broken it, the appearance of such a link will simply be a superstructure over existing tables without affecting any existing objects and processes.

Overview of Agile DWH Design Methodologies

2. Data duplication

The second problem solved by flexible architectures is less obvious and inherent in the first place. measurements type SCD2 (slowly changing measurements of the second type), although not only them.

In classic storage, a dimension is usually a table that contains a surrogate key (as PK) and a set of business keys and attributes in separate columns.

Overview of Agile DWH Design Methodologies

If the dimension supports versioning, version time limits are added to the standard set of fields, and multiple versions appear in the repository per row in the source (one for each change to versioned attributes).

If a dimension contains at least one versioned attribute that changes frequently, the number of versions of such a dimension will be impressive (even if the other attributes are not versioned or never change), and if there are several such attributes, the number of versions can grow exponentially from their number. Such a dimension can take up a significant amount of disk space, although most of the data stored in it is simply duplicates of immutable attribute values ​​from other rows.

Overview of Agile DWH Design Methodologies

At the same time, it is also often used denormalization - some of the attributes are intentionally stored as a value, and not as a reference to a reference book or another dimension. This approach speeds up data access by reducing the number of joins when accessing a dimension.

Typically, this results in the same information is stored simultaneously in several places. For example, information about the region of residence and membership of the customer category can be simultaneously stored in the “Customer” dimensions and the “Purchase”, “Delivery” and “Call Center Contacts” facts, as well as in the “Customer - Customer Manager” link table.

In general, the above applies to regular (non-versioned) measurements, but in versioned ones they can have a different scale: the appearance of a new version of an object (especially in hindsight) leads not only to updating all related tables, but to a cascade appearance of new versions of related objects - when Table 1 is used to build Table 2, and Table 2 is used to build Table 3, and so on. Even if not a single attribute of Table 1 is involved in the construction of Table 3 (and other attributes of Table 2 obtained from other sources are involved), the versioning of this construction will at least lead to additional overhead, and at most to extra versions in Table 3, which is generally “nothing to do with it” and further down the chain.

Overview of Agile DWH Design Methodologies

3. Nonlinear complexity of refinement

At the same time, each new storefront that is built on top of another increases the number of places where data can “diverge” when changes are made to the ETL. This, in turn, leads to an increase in the complexity (and duration) of each subsequent revision.

If the above applies to systems with rarely modified ETL processes, you can live in such a paradigm - just make sure that new improvements are correctly made to all related objects. If revisions occur frequently, the probability of accidentally “missing” several links increases significantly.

If, in addition, we take into account that the “versioned” ETL is much more complicated than the “non-versioned”, it becomes quite difficult to avoid mistakes during the frequent refinement of this whole economy.

Storing objects and attributes in Data Vault and Anchor Model

The approach proposed by the authors of flexible architectures can be formulated as follows:

It is necessary to separate what changes from what remains unchanged. That is to store keys separately from attributes.

However, do not confuse not versioned attribute with unchanged: the first one does not store the history of its change, but can change (for example, when correcting an input error or receiving new data), the second one never changes.

Points of view on what exactly can be considered unchanged in the Data Vault and the Anchor model differ.

In terms of architecture Data Vault, can be considered unchanged the whole set of keys — natural (TIN of the organization, product code in the source system, etc.) and surrogate. At the same time, the remaining attributes can be divided into groups according to the source and / or frequency of changes and keep a separate table for each group with an independent set of versions.

In the same paradigm Anchor Model considered unchanged surrogate key only entities. Everything else (including natural keys) is just a special case of its attributes. Wherein all attributes are independent of each other by default, so for each attribute must be created separate table.

В Data Vault tables containing entity keys are called Hubami (Hub). Hubs always contain a fixed set of fields:

  • Natural Entity Keys
  • Surrogate Key
  • Link to source
  • Recording time

Entries in Hubs never change and have no versions. Outwardly, hubs are very similar to ID-map tables used in some systems to generate surrogates, however, it is recommended to use not an integer sequence, but a hash from a set of business keys as surrogates in Data Vault. This approach simplifies the loading of links and attributes from sources (no need to join the hub to get a surrogate, just calculate the hash from the natural key), but it can cause other problems (for example, with collisions, case and non-printing characters in string keys, etc. .p.), therefore it is not generally accepted.

All other entity attributes are stored in special tables called Satellites (Satellit). One hub can have several satellites that store different sets of attributes.

Overview of Agile DWH Design Methodologies

The distribution of attributes among satellites occurs according to the principle joint change - in one satellite, non-versioned attributes can be stored (for example, date of birth and SNILS for an individual), in the other - rarely changing versioned (for example, last name and passport number), in the third - frequently changing (for example, delivery address, category, date of last order, etc.). Versioning in this case is carried out at the level of individual satellites, and not the entity as a whole, therefore it is advisable to distribute the attributes in such a way that the intersection of versions within one satellite is minimal (which reduces the total number of stored versions).

Also, to optimize the process of loading data, attributes obtained from various sources are often placed in separate satellites.

Satellites communicate with the Hub via foreign key (which corresponds to 1-to-many cardinality). This means that multiple attribute values ​​(for example, multiple contact phone numbers for the same customer) are supported by this “default” architecture.

В Anchor Model tables that store keys are called Anchors. And they keep:

  • Only surrogate keys
  • Link to source
  • Recording time

Natural keys from the point of view of the Anchor Model are considered ordinary attributes. This option may seem more difficult to understand, but it gives a lot more scope for identifying an object.

Overview of Agile DWH Design Methodologies

For example, if data about the same entity can come from different systems, each of which uses its own natural key. In the Data Vault, this can lead to rather cumbersome constructions of several hubs (one per source + merging master version), while in the Anchor model, the natural key of each source falls into its own attribute and can be used when loading independently of all the others.

But here lies one insidious moment: if attributes from different systems are combined in one entity, most likely there are some glue rules, by which the system must understand that records from different sources correspond to one instance of the entity.

В Data Vault these rules are likely to determine the formation “surrogate hub” of the master entity and in no way affect the Hubs that store the natural keys of the sources and their original attributes. If at some point the rules for merging change (or the attributes that are used for merging come to be updated), it will be enough to re-form the surrogate hubs.

В anchor model such an entity is likely to be stored in single anchor. This means that all attributes, no matter from which source they are obtained, will be bound to the same surrogate. Separating erroneously merged records and, in general, tracking the relevance of merging in such a system can be much more difficult, especially if the rules are quite complex and change frequently, and the same attribute can be obtained from different sources (although it is definitely possible, because each the attribute version retains a reference to its origin).

In any case, if your system is supposed to implement the functionality deduplication, merging records and other MDM elements, you should especially carefully read the aspects of storing natural keys in flexible methodologies. Perhaps the more unwieldy design of the Data Vault is suddenly safer in terms of merge errors.

anchor model also provides an additional object type called Knot in fact it is a special degenerate type of anchor, which can contain only one attribute. The nodes are supposed to be used for storing flat directories (for example, gender, marital status, customer service category, etc.). Unlike Anchor, Knot has no associated attribute tables, and its only attribute (name) is always stored in the same table with the key. Nodes are linked to Anchors by tie tables (Tie) in the same way that anchors are connected to each other.

There is no unambiguous opinion about the use of Nodes. For example, Nikolay Golov, who actively promotes the use of the Anchor Model in Russia, believes (not unreasonably) that it is impossible to say for a single reference book that he always will be static and single-level, so it is better to use a full-fledged Anchor for all objects at once.

Another important difference between Data Vault and Anchor Model is the presence attributes for links:

В Data Vault Links are the same full-fledged objects as Hubs, and can have own attributes. In anchor model Links are only used to connect Anchors and cannot have their own attributes. This difference leads to significantly different modeling approaches. facts, which will be discussed next.

Fact storage

So far, we have mainly talked about modeling measurements. The facts are a little less clear cut.

В Data Vault a typical object for storing facts − Link, in whose satellites real indicators are added.

This approach seems to be intuitive. It provides easy access to the analyzed indicators and is generally similar to a traditional fact table (only the indicators are not stored in the table itself, but in the “adjacent table”). But there are also pitfalls: one of the typical refinements of the model - the expansion of the fact key - makes it necessary adding a new foreign key to the Link. And this, in turn, “breaks” modularity and potentially causes the need for improvements to other objects.

В anchor model A link cannot have its own attributes, so this approach will not work - absolutely all attributes and indicators must be linked to one specific anchor. The conclusion from this is simple - each fact also needs its own anchor. For some of what we are accustomed to perceive as facts, this may seem natural - for example, the fact of a purchase is perfectly reduced to the object “order” or “receipt”, visiting a site is reduced to a session, etc. But there are also facts for which it is not so easy to find such a natural “carrier object” - for example, the balance of goods in warehouses at the beginning of each day.

Accordingly, there are no problems with modularity when expanding the fact key in the Anchor Model (it is enough just to add a new Relationship to the corresponding Anchor), but designing the model to display facts is less straightforward, “artificial” Anchors may appear that reflect the object model of the business is not obvious.

How flexibility is achieved

The resulting construction in both cases contains significantly more tablesthan traditional measurement. But it can take significantly less disk space with the same set of versioned attributes as the traditional dimension. Naturally, there is no magic here - it's all about normalization. By distributing attributes across Satellites (in the Data Vault) or individual tables (Anchor Model), we reduce (or completely eliminate) duplicating the values ​​of some attributes when changing others.

For Data Vault the gain will depend on the distribution of attributes among the Satellites, and for anchor model — is almost directly proportional to the average number of versions per measurement object.

However, taking up space is an important, but not the main, advantage of storing attributes separately. Together with separate storage of links, this approach makes the storage modular design. This means that adding both individual attributes and whole new subject areas in such a model looks like superstructure over an existing set of objects without changing them. And this is exactly what makes the described methodologies flexible.

It also resembles the transition from piece production to mass production - if in the traditional approach each model table is unique and requires separate attention, then in flexible methodologies it is already a set of typical “details”. On the one hand, there are more tables, the processes of loading and fetching data should look more complicated. On the other hand, they become typical. This means that there may be automated and managed by metadata. The question “how are we going to lay?”, the answer to which could occupy a significant part of the work on the design of improvements, is now simply not worth it (as is the question of the impact of changing the model on working processes).

This does not mean that analysts in such a system are not needed at all - someone still has to work out a set of objects with attributes and figure out where and how to load it all. But the amount of work, as well as the probability and cost of an error, are significantly reduced. Both at the stage of analysis and during the development of ETL, which in a significant part can be reduced to editing metadata.

Dark side

All of the above makes both approaches really flexible, manufacturable and suitable for iterative refinement. Of course, there is also a “barrel of tar”, which I think you already know about.

Data decomposition underlying the modularity of flexible architectures leads to an increase in the number of tables and, accordingly, overhead for joins when fetching. In order to simply get all the attributes of a dimension, a single select is sufficient in the classical storage, and a flexible architecture will require a number of joins. Moreover, if for reports all these joins can be written in advance, then analysts who are used to writing SQL by hand will suffer doubly.

There are several facts that make this situation easier:

When working with large dimensions, almost all of its attributes are almost never used simultaneously. This means that there may be fewer joins than it seems at first glance at the model. In Data Vault, you can also take into account the expected frequency of sharing when allocating attributes to satellites. At the same time, Hubs or Anchors themselves are needed primarily for generating and mapping surrogates at the loading stage and are rarely used in queries (this is especially true for Anchors).

All joins are by key. In addition, a more “compressed” way of storing data reduces the overhead of table scanning where it is needed (for example, when filtering by an attribute value). This can lead to the fact that fetching from a normalized database with a bunch of joins will be even faster than scanning one heavy dimension with a lot of versions per row.

For example, here in this article there is a detailed comparative performance test of the Anchor model with a selection from one table.

A lot depends on the engine. Many modern platforms have internal mechanisms for optimizing joins. For example, MS SQL and Oracle can “skip” joins to tables if their data is not used anywhere except for other joins and does not affect the final selection (table/join elimination), while MPP Vertica experience of colleagues from Avito, proved to be an excellent engine for the Anchor Model, with some manual optimization of the query plan. On the other hand, storing the Anchor Model, for example, on Click House, which has limited support for join, does not seem like a good idea yet.

In addition, for both architectures there are special tricks, which make it easier to access data (both in terms of query performance and for end users). For example, Point-in-time tables in Data Vault or special table functions in the anchor model.

Total

The main essence of the considered flexible architectures is the modularity of their “design”.

This property allows:

  • After some initial preparation related to deploying metadata and writing basic ETL algorithms, quickly provide the customer with the first result in the form of a couple of reports containing data from just a few source objects. It is not necessary to fully think through (even top-level) the entire object model for this.
  • A data model can start working (and useful) with just 2-3 objects, and then grow gradually (regarding the Anchor model Nikolai applied beautiful comparison with mycelium).
  • Most improvements, including expanding the subject area and adding new sources does not affect existing functionality and does not cause the danger of breaking something that is already working.
  • Thanks to decomposition into standard elements, ETL processes in such systems look the same, their writing lends itself to algorithmization and, ultimately, automation.

The price of this flexibility is performance. This does not mean that it is impossible to achieve acceptable performance on such models. More often than not, you just may need more effort and attention to detail to achieve the desired metrics.

Applications

Entity types Data Vault

Overview of Agile DWH Design Methodologies

More about Data Vault:
Dan Listadt website
All about Data Vault in Russian
About Data Vault on Habré

Entity types Anchor Model

Overview of Agile DWH Design Methodologies

More about Anchor Model:

Site of Anchor Model creators
An article about the experience of implementing the Anchor Model in Avito

Summary table with common features and differences between the considered approaches:

Overview of Agile DWH Design Methodologies

Source: habr.com

Add a comment