More developers need to know this about databases

Note. transl.: Jaana Dogan is an experienced engineer at Google who is currently working on the observability of the company's production services written in Go. In this article, which has gained great popularity among the English-speaking audience, she collected in 17 points important technical details regarding DBMS (and sometimes distributed systems in general) that developers of large/demanding applications can usefully consider.

More developers need to know this about databases

The vast majority of computer systems keep track of their state and, accordingly, need some kind of data storage system. I accumulated knowledge about databases for a long time, making design mistakes along the way that led to data loss and outages. In systems that process large volumes of information, databases are at the heart of the system architecture and are a key element in choosing the optimal solution. Despite the fact that close attention is paid to the operation of the database, the problems that application developers try to anticipate are often just the tip of the iceberg. In this series of articles, I share some ideas that will be useful for developers who do not specialize in this area.

  1. You are in luck if 99,999% of the time the network is not the source of the problem.
  2. ACID means many different things.
  3. Each base has its own mechanisms for ensuring consistency and isolation.
  4. Optimistic blocking comes to the rescue when it is difficult to hold the usual one.
  5. There are other anomalies besides dirty reads and data loss.
  6. The database and the user do not always agree on the course of action.
  7. Sharding at the application level can be moved outside the application.
  8. Auto-incrementing can be dangerous.
  9. Stale data can be useful and doesn't need to be locked.
  10. Distortions are typical for any time sources.
  11. Delay has many meanings.
  12. Performance requirements should be evaluated for a specific transaction.
  13. Nested transactions can be dangerous.
  14. Transactions should not be tied to application state.
  15. Query planners can tell you a lot about databases.
  16. Online migration is difficult, but possible.
  17. A significant increase in the database entails an increase in unpredictability.

I would like to thank Emmanuel Odeke, Rein Henrichs, and others for their feedback on an early version of this article.

You're in luck if 99,999% of the time the network isn't the source of the problem.

It remains an open question how reliable modern network technologies are and how often systems are down due to network failures. There is insufficient information on this subject, and studies are often dominated by large organizations with specialized networks, equipment, and personnel.

With a 99,999% availability rate for Spanner (Google's globally distributed database), Google claims that only 7,6% network related problems. In doing so, the company calls its dedicated network the "mainstay" of high availability. Study Bailis and Kingsbury, held in 2014, challenges one of the "misconceptions about distributed computingwhich Peter Deutsch formulated in 1994. Is the network really reliable?

Full-fledged studies outside the giant companies conducted for the wide Internet simply do not exist. There is also insufficient data from major players on what percentage of their customers' problems are network-related. We are well aware of outages in the network stack of large cloud providers that can bring down a whole chunk of the Internet for several hours simply because these are the most resonant events that affect a large number of people and companies. Network outages can be a source of problems in many more cases, even if not all of these cases are in the spotlight. Cloud service customers also do not know anything about the causes of problems. In the event of a failure, it is almost impossible to attribute it to a network error on the service provider's side. For them, third-party services are black boxes. It is impossible to assess the impact without being a major service provider.

Given what the big players report about their systems, it's safe to say you're in luck if network issues are responsible for only a small percentage of potential downtime issues. Network communications still suffer from mundane things like hardware failures, topology changes, administrative configuration changes, and power outages. Recently, I was surprised to learn that the list of possible problems was added shark bites (Yes, you heard right).

ACID means many different things

The acronym ACID stands for Atomicity, Consistency, Isolation, Reliability. These properties of transactions are intended to ensure their validity in the event of failures, errors, hardware failures, and so on. Without ACID or similar schemes, it would be difficult for application developers to draw a line between their area of ​​responsibility and what the database is responsible for. Most relational transactional databases try to be ACID compliant, but newer approaches like NoSQL have spawned many non-ACID transactional databases because they are expensive to implement.

When I first entered the industry, our tech lead was talking about how relevant the concept of ACID is. To be fair, ACID is considered an exemplary description, not a strict implementation standard. Today, I find it mostly useful because it raises a certain category of questions (and suggests a range of possible solutions).

Not every DBMS is ACID compliant; at the same time, base implementations that support ACID understand the set of requirements differently. One reason for the heterogeneous implementation of ACID is due to the many trade-offs that must be made to implement ACID requirements. Creators can present their databases as ACID compliant, but the interpretation of edge cases can vary dramatically, as can the mechanism for handling "unlikely" events. At the very least, developers can understand the intricacies of implementing bases at a high level to get a good idea of ​​their special modes and design trade-offs.

The debate about how MongoDB is ACID compliant continues even after version 4 is released. MongoDB has not been supported for a long time logging, although by default, data was committed to disk no more than once every 60 seconds. Imagine the following scenario: an application spends two writes (w1 and w2). MongoDB saves w1 successfully, but w2 is lost due to a hardware failure.

More developers need to know this about databases
Diagram illustrating the scenario. MongoDB crashes before it can write data to disk

Committing to disk is an expensive process. By avoiding frequent commits, developers improve write performance at the expense of reliability. Today, MongoDB supports logging, but dirty writes can still affect data integrity, as logs are committed every 100ms by default. That is, a similar scenario is still possible for the logs and the changes presented in them, although the risk is much lower.

Each database has its own mechanisms for ensuring consistency and isolation

Of the ACID requirements, consistency and isolation have the largest number of different implementations because the range of trade-offs is wider. I must say that consistency and isolation are quite costly functions. They require coordination and increase competition for data consistency. The complexity of the problem increases significantly when it is necessary to horizontally scale the base across several data centers (especially if they are located in different geographical regions). Achieving a high level of consistency is very difficult, because along the way, availability decreases and network segmentation increases. For a more general explanation of this phenomenon, I advise you to refer to CAP theorem. It's also worth noting that applications can handle small inconsistencies, and programmers can understand the nuances of the problem well enough to implement additional logic in the application to deal with inconsistencies without relying heavily on the database for that matter.

DBMSs often provide different levels of isolation. Application developers can choose the most effective one based on their preferences. Low isolation allows you to increase speed, along with increasing the risk of a data race (data race). High isolation reduces this chance, but slows things down and can lead to contention that will slow down the database so much that it will crash.

More developers need to know this about databases
Overview of Existing Concurrency Models and Their Relationships

The SQL standard defines only four levels of isolation, although in theory and practice there are many more. Jepson.io offers an excellent overview of existing concurrency models. For example, Google Spanner guarantees external serializability with clock synchronization, and although it is a stricter isolation layer, it is not defined in the standard isolation layers.

The SQL standard mentions the following isolation levels:

  • Serializable (hardest and most costly): Serializable execution has an effect similar to some serialized execution of transactions. Sequential execution means that each subsequent transaction starts only after the previous one has completely completed. It should be noted that the level Serializable often implemented as so-called snapshot isolation (for example, in Oracle) due to differences in interpretation, although snapshot isolation itself is not represented in the SQL standard.
  • Repeatable reads: uncommitted records in the current transaction are available to the current transaction, but changes made by other transactions (such as new rows) not visible.
  • Read committed: Uncommitted data is not available for transactions. In this case, transactions can only see committed data, and phantom reads can occur. If a transaction inserts and commits new rows, the current transaction will be able to see them when requested.
  • Read uncommitted (least strict and costly level): Dirty reads are allowed, transactions can see uncommitted changes made by other transactions. In practice, this level can be useful for rough estimates, for example, for queries COUNT(*) on the table.

Level of Serializable minimizes the risk of data races, while being the most expensive to implement and leading to the highest competitive load on the system. Other isolation levels are easier to implement, but data races are more likely. Some DBMSs allow you to set a custom isolation level, while others have strong preferences and not all levels are supported.

Support for isolation levels is often advertised in a particular DBMS, but only a careful study of its behavior allows you to clarify what is actually happening.

More developers need to know this about databases
An overview of concurrency anomalies at different isolation levels for different DBMS

Martin Kleppmann in his project hermitage compares different isolation levels, talks about concurrency anomalies and whether the database is able to adhere to a particular isolation level. Kleppmann's research shows how different database developers think about isolation levels.

Optimistic blocking comes to the rescue when it is difficult to keep the usual

Locking can be very expensive, not only because it increases contention in the database, but also because it requires a constant connection from application servers to the database. Network segmentation can exacerbate exclusive blocking situations and lead to deadlocks that are difficult to identify and fix. In cases where exclusive locking is not appropriate, optimistic locking helps.

optimistic locking is a method in which when reading a string, its version, checksum, or last modification time is taken into account. This ensures that there is no atomic version change before changing the entry:

UPDATE products
SET name = 'Telegraph receiver', version = 2
WHERE id = 1 AND version = 1

In this case, updating the table products will not be executed if another operation has previously made changes to this row. If no other operations were performed on this row, the change for one row will occur and we can say that the update was successful.

There are other anomalies besides dirty reads and data loss.

When it comes to data consistency, the focus is on the possible occurrence of race conditions that can lead to dirty reads and data loss. However, data anomalies are not limited to this.

One example of such anomalies is the distortion of the record (write skews). Distortions are difficult to detect because they are not usually actively searched for. They are not associated with "dirty" reads or data loss, but with violations of the logical constraints imposed on the data.

For example, let's consider a monitoring application that requires one agent to be constantly on-call:

BEGIN tx1;                      BEGIN tx2;
SELECT COUNT(*)
FROM operators
WHERE oncall = true;
0                               SELECT COUNT(*)
                                FROM operators
                                WHERE oncall = TRUE;
                                0
UPDATE operators                UPDATE operators
SET oncall = TRUE               SET oncall = TRUE
WHERE userId = 4;               WHERE userId = 2;
COMMIT tx1;                     COMMIT tx2;

In the situation above, a corruption of the record will occur if both transactions are successfully committed. Although there were no dirty reads or data loss, data integrity was violated: now two people are considered on-call at the same time.

Serializable isolation, schema design, or database constraints can help eliminate write corruption. Developers should be able to detect such anomalies during development in order to avoid them in production. At the same time, recording distortions are extremely difficult to look for in the code base. Especially in large systems, when different development teams are responsible for implementing functions based on the same tables and do not coordinate data access features among themselves.

The database and the user do not always agree on the course of action

One of the key features of databases is the guarantee of the order of execution, but this order itself can be opaque to the software developer. Databases execute transactions in the order in which they are received, not in the order that programmers intend. The order in which transactions are executed is difficult to predict, especially in highly loaded parallel systems.

During development, especially when working with non-blocking libraries, poor style and poor readability can lead users to believe that transactions are executed sequentially, when in fact they can enter the database in any order.

At first glance, in the program below, T1 and T2 are called sequentially, but if these functions are non-blocking and immediately return the result in the form promise, then the call order will be determined by the moments when they entered the database:

result1 = T1() // real results are promises
result2 = T2()

If atomicity is required (that is, either all operations must be completed or aborted) and sequence matters, then operations T1 and T2 should be performed within a single transaction.

Sharding at the application level can be moved outside the application

Sharding is a way of horizontally partitioning a database. Some databases can automatically split data horizontally, while others don't, or aren't very good at it. When data architects/developers have the ability to predict exactly how data will be accessed, they can create horizontal partitions in user space instead of delegating this work to the database. This process is called "application layer sharding" (application-level sharding).

Alas, this name often creates the misconception that sharding resides in application services. In fact, it can be implemented as a separate layer in front of the database. Depending on data growth and schema iterations, sharding requirements can become quite complex. For some strategies, it can be useful to be able to iterate without having to redeploy application servers.

More developers need to know this about databases
An example of an architecture in which application servers are separated from the sharding service

Moving sharding to a separate service expands the possibilities for using various sharding strategies without the need to re-deploy applications. Speed is an example of such a sharding system at the application level. Vitess provides horizontal sharding for MySQL and allows clients to connect to it via the MySQL protocol. The system segments the data into different MySQL nodes that don't know anything about each other.

Auto-incrementing can be dangerous

AUTOINCREMENT is a common way to generate primary keys. Often there are cases when databases are used as ID generators, and the database contains tables designed to generate identifiers. There are several reasons why auto-incrementing primary key generation is less than ideal:

  • In a distributed database, auto-incrementing is a big problem. A global lock is required to generate an ID. Instead, you can generate a UUID: this does not require the interaction of different base nodes. Auto-incrementing with locks can lead to contention and significantly degrade performance on inserts in distributed situations. Some DBMSs (such as MySQL) may require special configuration and close attention to get master-to-master replication set up correctly. And when configuring, it is easy to make a mistake, which will lead to recording failures.
  • Some databases have partitioning algorithms based on primary keys. Sequential IDs can lead to unpredictable hotspots and increased load on some partitions while others are idle.
  • A primary key is the fastest way to access a row in a database. Given better ways to identify records, sequential IDs can turn the most important column in a table into a useless one filled with meaningless values. Therefore, whenever possible, please choose a globally unique and natural primary key (eg username).

Before deciding on an approach, consider the impact of auto-incremented IDs and UUIDs on indexing, partitioning, and sharding.

Stale data can be useful and does not require locking

Multiversion Concurrency Control (MVCC) implements many of the consistency requirements discussed briefly above. Some databases (for example, Postgres, Spanner) with the help of MVCC "feed" snapshots to transactions - older versions of the database. Snapshot transactions can also be serialized for consistency. When reading from an old snapshot, obsolete data is read.

Reading slightly outdated data can be useful, for example, when generating analytics from the data or calculating approximate aggregated values.

The first advantage of working with stale data is low latency (especially if the database is distributed across different geographies). The second is that read-only transactions are lock-free. This is a significant advantage for applications that do a lot of reading, as long as they can handle stale data.

More developers need to know this about databases
The application server reads data from the local replica that is 5 seconds out of date, even if the latest version is available on the other side of the Pacific

DBMS automatically clean up old versions and in some cases allow you to do this on demand. For example, Postgres allows users to do VACUUM upon request, and also periodically performs this operation in automatic mode. Spanner runs a garbage collector to get rid of snapshots older than one hour.

Any source of time is subject to distortion

The most closely guarded secret in computer science is that all temporary APIs lie. In fact, our machines do not know the exact current time. Computers contain quartz crystals that generate vibrations that are used to keep time. However, they are not accurate enough and may lead/lag behind the exact time. The shift can reach 20 seconds per day. Therefore, the time on our computers must be periodically synchronized with the network.

NTP servers are used for synchronization, however, the synchronization process itself is subject to delays due to the network. Even synchronization with an NTP server in the same data center takes some time. It is clear that working with a public NTP server can lead to even more distortion.

Atomic clocks and their GPS counterparts are better suited for determining the current time, but they are expensive and require complex settings, so they cannot be installed on every machine. Because of this, data centers use a layered approach. Atomic and / or GPS clocks show the exact time, after which it is broadcast to other machines through secondary servers. This means that each machine will experience a certain shift from the exact time.

The situation is aggravated by the fact that applications and databases are often located on different machines (if not in different data centers). Thus, the time will differ not only on DB nodes distributed over different machines. It will also be different on the application server.

Google TrueTime takes a completely different approach. Most people believe that Google's progress in this direction is due to the banal transition to atomic and GPS clocks, but this is only part of the big picture. Here's how TrueTime works:

  • TrueTime uses two different sources: GPS and atomic clocks. These watches have uncorrelated failure modes [details see page 5 here - approx. transl.), so their combined use improves reliability.
  • TrueTime has an unusual API. It returns time as an interval with an inherent measurement error and uncertainty. The real moment of time is somewhere between the upper and lower boundaries of the interval. Spanner, Google's distributed database, simply waits until it's safe to say that the current time is out of range. This method introduces some latency into the system, especially if the uncertainty on the masters is high, but ensures correctness even in a globally distributed situation.

More developers need to know this about databases
Spanner components use TrueTime, where TT.now() returns an interval, so Spanner just sleeps until it can be certain that the current time has passed a certain time.

Reducing the accuracy of determining the current time means an increase in the duration of Spanner operations and a decrease in performance. That's why it's important to maintain the highest accuracy even though it's impossible to get a perfectly accurate clock.

Delay has many meanings

If you ask a dozen experts what latency is, you will probably get different answers. In DBMS, latency is often referred to as "database latency" and is different from what the client perceives. The matter is that the client observes the sum of a network delay and a delay of a DB. The ability to isolate the type of delay is critical when debugging growing problems. When collecting and displaying metrics, always try to keep an eye on both types.

Performance requirements should be evaluated for a specific transaction

Sometimes the performance characteristics of a DBMS and its limitations are specified in terms of write/read throughput and latency. This allows you to get a general idea of ​​the key parameters of the system, however, when evaluating the performance of a new DBMS, a much more comprehensive approach is to separately evaluate critical operations (for each request and / or transaction). Examples:

  • Write throughput and latency when inserting a new row into table X (with 50 million rows) with limits set and row padding in related tables.
  • Delay in displaying the friends of a user's friends when the average number of friends is 500.
  • Delay in retrieving the top 100 entries from a user's history when they are following 500 other users with X entries per hour.

Evaluation and experimentation can include such critical cases until you are sure that the database meets the performance requirements. A similar rule of thumb also takes this breakdown into account when collecting latency metrics and determining SLOs.

Keep high cardinality in mind when gathering metrics for each operation. Use logging, event collection, or distributed tracing for high power debugging data. In the article "Want to Debug Latency?» you can learn about latency debugging methodologies.

Nested transactions can be dangerous

Not every DBMS supports nested transactions, but when they do, such transactions can lead to unexpected errors that are not always easy to detect (i.e., it should become obvious that there is some kind of anomaly).

Nested transactions can be avoided by using client libraries that can detect and bypass them. If nested transactions cannot be abandoned, take special care in their implementation to avoid unexpected situations when committed transactions are accidentally interrupted due to nested ones.

Encapsulating transactions in different layers can lead to unexpected occurrence of nested transactions, and from the point of view of code readability, it can complicate the understanding of the author's intentions. Take a look at the following program:

with newTransaction():
   Accounts.create("609-543-222")
   with newTransaction():
       Accounts.create("775-988-322")
       throw Rollback();

What will be the output of the above code? Will it roll back both transactions, or just the inner one? What happens if we rely on multiple layers of libraries that encapsulate transaction creation from us? Will we be able to identify and improve such cases?

Imagine a data layer with multiple operations (for example, newAccount) is already implemented in native transactions. What happens if you run them within a higher level business logic that runs within its own transaction? What would be isolation and coherence in this case?

function newAccount(id string) {
  with newTransaction():
      Accounts.create(id)
}

Instead of looking for answers to such endless questions, it is better to avoid nested transactions. After all, your data layer may well conduct high-level operations without creating its own transactions. In addition, the business logic itself is able to initiate a transaction, perform operations on it, commit or abort the transaction.

function newAccount(id string) {
   Accounts.create(id)
}
// In main application:
with newTransaction():
   // Read some data from database for configuration.
   // Generate an ID from the ID service.
   Accounts.create(id)
   Uploads.create(id) // create upload queue for the user.

Transactions should not be tied to application state

Sometimes it's tempting to use application state in transactions to change certain values ​​or tweak query parameters. The critical nuance to consider is the right scope. Clients often restart transactions when there are network problems. If the transaction is dependent on a state that is being changed by some other process, then it may choose the wrong value depending on the possibility of a data race. Transactions must take into account the risk of data race conditions in the application.

var seq int64
with newTransaction():
    newSeq := atomic.Increment(&seq)
    Entries.query(newSeq)
    // Other operations...

The above transaction will increment the sequence number each time it is executed, regardless of the end result. If the commit fails due to network issues, retrying the request will run with a different sequence number.

Query Planners can tell you a lot about a database

Query planners determine how a query will be executed against a database. They also analyze requests and optimize them before they are sent. Planners can only provide some possible estimates based on the signals at their disposal. For example, what is the best way to search for the following query?

SELECT * FROM articles where author = "rakyll" order by title;

Results can be retrieved in two ways:

  • Full table scan: You can look at each entry in the table and return articles with the same author name, and then order them.
  • Index scan: You can use an index to find matching IDs, get those rows, and then order them.

The task of the query planner is to determine which strategy is the best. It is worth considering that query planners have only limited predictive capabilities. This can lead to bad decisions. DBAs or developers can use them to diagnose and fine-tune inefficient queries. Newer DBMS versions can tune query planners, and self-diagnostics can help you upgrade your database if the new version causes performance issues. Slow query logs, latency issue reports, or runtime statistics can help identify queries that need optimization.

Some metrics provided by the query planner can be subject to noise (especially when evaluating latency or CPU time). A good addition to schedulers are tools for tracing and tracking the execution path. They allow you to diagnose such problems (alas, not all DBMS provide such tools).

Online migration is difficult but possible

Online migration, "live" migration or real-time migration means moving from one database to another without downtime and data violations. Live migration is easier to do if the migration occurs within the same DBMS/engine. The situation becomes more complicated when you need to move to a new DBMS with different performance and schema requirements.

There are various online migration models. Here is one of them:

  • Enable double entry in both bases. The new database at this stage does not have all the data, but only receives fresh data. After making sure of this, you can proceed to the next step.
  • Enable reading from both databases.
  • Set up the system so that reads and writes are done to the new base first.
  • Stop writing to the old database while continuing to read data from it. At this stage, the new database is still stripped of some data. They should be copied from the old database.
  • The old database is read-only. Copy the missing data from the old database to the new one. After the migration is completed, switch the paths to the new database, and stop the old one and remove it from the system.

For more information, I recommend contacting articledetailing the Stripe migration strategy based on this model.

A significant increase in the database entails an increase in unpredictability

The growth of the database leads to unpredictable problems associated with its scale. The more we know about the internals of a database, the better we can predict how it will scale. At the same time, some moments are still impossible to foresee.
As the base grows, previous assumptions and expectations regarding data volume and network bandwidth requirements may become obsolete. That's when the question arises of a major schema overhaul, large-scale operational improvements, rethinking deployments, or migrating to other DBMSs to avoid potential problems.

But do not think that excellent knowledge of the internal structure of the existing database is the only thing that is needed. New dimensions will bring with them new unknowns. Unpredictable hotspots, uneven data distribution, unexpected bandwidth and hardware issues, ever-increasing traffic and new network shards will force you to rethink your database approach, data model, deployment model, and size.

...

When I first thought about publishing this article, there were already five more items on my original list. Then came a huge number new ideas about what else can be covered. Therefore, the article touches upon the least obvious problems that require maximum attention. However, this does not mean that the topic has been exhausted and I will not return to it in my future materials and will not make changes to the current one.

PS

Read also on our blog:

Source: habr.com

Add a comment