The data dichotomy: rethinking the relationship between data and services

Hi all! We have great news, OTUS is launching the course again in June "Software Architect", in connection with which we traditionally share useful material with you.

The data dichotomy: rethinking the relationship between data and services

If you've come across this whole microservices story without any context, you'd be forgiven for thinking it's a little weird. Splitting an application into fragments connected by a network necessarily means adding complex fault tolerance modes to the resulting distributed system.

Although this approach involves splitting into many independent services, the end goal is much more than just running those services on different machines. We are talking here about interaction with the outside world, which is also distributed in its essence. Not in a technical sense, but rather in the sense of an ecosystem that consists of many people, teams, programs, and each of these parts has to do its job in one way or another.

Companies, for example, are a set of distributed systems that together contribute to the achievement of some goal. We've been ignoring this fact for decades, trying to achieve unification by FTP transfers or enterprise integration tools, while focusing on our own separate goals. But with the advent of services, everything has changed. Services have helped us look beyond the horizon and see a world of interdependent programs that work together. However, in order to work successfully, it is necessary to recognize and design two fundamentally different worlds: the external world, where we live in an ecosystem of many other services, and our personal, internal world, where we rule alone.

The data dichotomy: rethinking the relationship between data and services

Such a distributed world is different from the one in which we grew up and are used to. The principles of building a traditional monolithic architecture do not stand up to criticism. So getting these systems right is about more than creating a cool whiteboard diagram or a cool proof of concept. The idea is that such a system will work successfully for a long time. Luckily, services have been around for quite some time, although they look different. SOA Lessons still relevant, even spiced up with Docker, Kubernetes, and slightly shabby hipster beards.

So today we're going to take a look at how the rules have changed, why we need to rethink our approach to services and the data they pass to each other, and why we'll need a completely different toolkit for that.

Encapsulation won't always be your friend

Microservices can work independently of each other. It is this property that gives them the greatest value. This same property allows services to scale and grow. Not so much in terms of scaling to quadrillions of users or petabytes of data (although they can help here too), but in terms of scaling in terms of people as teams and organizations grow continuously.

The data dichotomy: rethinking the relationship between data and services

However, independence is a double-edged sword. That is, the service itself can spin easily and naturally. But if a function is implemented inside a service that requires another service to be involved, then we end up having to make changes to both services almost simultaneously. In a monolith, this is easy to do, you just make a change and send it to release, but in the case of synchronization of independent services, there will be more problems. Coordination between teams and release cycles destroys flexibility.

The data dichotomy: rethinking the relationship between data and services

As part of the standard approach, they simply try to avoid annoying end-to-end changes, clearly dividing the functionality between services. Single sign-on service can be a good example here. It has a well-defined role that sets it apart from other services. This clear separation means that in a world of rapidly changing requirements for the services around it, SSO is unlikely to change. It exists within a strictly limited context.

The data dichotomy: rethinking the relationship between data and services

The problem is that in the real world, business services can't keep the same clean separation of roles all the time. For example, the same business services work more with data coming from other similar services. If you are an online retailer, handling order flow, product catalog, or user information will become a requirement for many of your services. Each of the services will need access to this data in order to work.

The data dichotomy: rethinking the relationship between data and services
Most business services use the same data stream, so their work is invariably intertwined.

Thus we come to an important point worth talking about. While services work well for infrastructure components that operate largely in isolation, most business services end up being much more tightly intertwined.

Data dichotomy

Service-oriented approaches may already exist, but there is still little information on how to exchange large amounts of data between services.

The main problem is that data and services are inseparable. On the one hand, encapsulation encourages us to hide data so that services can be separated from each other and facilitate their growth and further changes. On the other hand, we need to be able to divide and conquer freely over shared data, just like any other. It is about being able to start working immediately, as freely as in any other information system.

However, information systems have little to do with encapsulation. In fact, it's even the other way around. Databases do everything they can to give access to the data they store. They come with a powerful declarative interface that allows you to modify the data as you like. Such functionality is important at the stage of preliminary research, but not for managing the growing complexity of an ever-evolving service.

The data dichotomy: rethinking the relationship between data and services

And here a dilemma arises. Contradiction. Dichotomy. After all, information systems are about providing data, and services are about hiding.

These two forces are fundamental. They underpin much of our work, constantly vying for supremacy in the systems we build.

As service systems grow and evolve, we see different manifestations of the implications of the data dichotomy. Either the service interface will grow to provide more and more features and start to look like a very fancy homegrown database, or we will get frustrated and implement some way to extract or move whole datasets in bulk from service to service.

The data dichotomy: rethinking the relationship between data and services

In turn, creating something that looks like a fancy homebrew database will lead to a whole host of problems. We will not go into the details of what is dangerous shared database, let's just say that it represents a significant costly engineering and operational difficulties for the company that is trying to use it.

Worse, data volumes multiply problems with service boundaries. The more common data lies inside the service, the more complex the interface will become and the more difficult it will be to combine data sets coming from different services.

The alternative approach of extracting and moving entire datasets also has its own problems. A common approach to this question seems to be to simply retrieve and store the entire dataset, and then store it locally in each consuming service.

The data dichotomy: rethinking the relationship between data and services

The problem is that different services interpret the data they consume differently. This data is always at hand. They are modified and processed locally. Pretty soon they cease to have anything to do with the data in the source.

The data dichotomy: rethinking the relationship between data and services
The more mutable the copies, the more the data will vary over time.

Even worse, such data is difficult to correct in retrospect (MDM this is where it really comes in handy.) In fact, some of the intractable technology challenges businesses face stem from heterogeneous data that proliferates from application to application.

To find a solution to this common data problem, you need to think differently. They should become first class objects in the architectures we build. Pat Helland calls such data "external", and this is a very important feature. We need encapsulation so that we don't expose the internals of a service, but we need to make it easy for services to access shared data so they can do their job correctly.

The data dichotomy: rethinking the relationship between data and services

The problem is that neither approach is relevant today, because neither the service interfaces, nor the messaging, nor the Shared Database offer a good solution for working with external data. Service interfaces are ill-suited for exchanging data at any scale. Messaging moves data but does not store its history, so data gets corrupted over time. Shared Databases is too much focused on one point, which is holding back progress. We inevitably get stuck in a cycle of data failure:

The data dichotomy: rethinking the relationship between data and services
Data failure cycle

Flows: a decentralized approach to data and services

Ideally, we need to change the way services work with shared data. At the moment, any approach runs into the aforementioned dichotomy, as there is no magic dust that can be generously sprinkled on it and made to disappear. However, we can rethink the problem and come to a compromise.

This compromise implies a certain degree of centralization. We can take advantage of the distributed logging mechanism as it provides reliable, scalable streams. Now we want services to be able to join and run on these shared threads, but we want to avoid complex centralized God Services that do this processing. Therefore, the best option is to build streaming processing into each consumer service. This allows services to combine data sets from different sources and work with them in the way they need.

One way to achieve this approach is through the use of a streaming platform. There are many options, but today we will consider Kafka, since using its Stateful Stream Processing allows us to effectively solve the presented problem.

The data dichotomy: rethinking the relationship between data and services

Using the distributed logging mechanism allows us to follow the beaten path and use messaging to work with event-driven architecture. This approach is considered to provide better scaling and separation than the request-response mechanism because it gives control of the flow to the receiver rather than the sender. However, you have to pay for everything in this life, and here you need a broker. But for large systems, this trade-off is worth it (which you can't say about your average web applications).

If a broker is responsible for distributed logging, and not a traditional messaging system, you can take advantage of additional features. A transport can scale linearly almost as well as a distributed file system. Data can be stored in logs for a long time, so we get not only messaging, but also a repository of information. Scalable storage without fear of mutable shared state.

You can then use the stateful stream processing mechanism to add declarative database tools to your consuming services. This is a very important thought. As long as the data is stored in shared streams that can be accessed by all services, the aggregation and processing that the service does is private. They are isolated within a strictly limited context.

The data dichotomy: rethinking the relationship between data and services
Get rid of the data dichotomy by splitting the immutable state stream. Then add this functionality to each service using Stateful Stream Processing.

Thus, if your service has to work with orders, a product catalog, a warehouse, it will have full access: only you will decide what data to combine, where to process it and how it should change over time. Despite the fact that the data is shared, working with it is completely decentralized. It is produced within each service, in a world where everything goes according to your rules.

The data dichotomy: rethinking the relationship between data and services
Share data without compromising its integrity. Encapsulate a function, not a source, in every service that needs it.

So it happens that the data needs to be moved in bulk. Sometimes a service needs a local historical dataset in the database engine of choice. The trick is that it can be guaranteed that, if necessary, a copy can be restored from the source by contacting the distributed logging mechanism. Connectors in Kafka do a great job of this.

So, the approach discussed today has several advantages:

  • The data is used in the form of shared streams that can be stored in logs for a long time, and the mechanism for working with shared data is hardwired in each individual context, which allows services to work easily and quickly. In this way, the dichotomy of the data can be balanced.
  • Data coming from various services can be easily combined into sets. This simplifies interaction with shared data and eliminates the need to maintain local datasets in the database.
  • Stateful Stream Processing only caches the data, and the common logs remain the source of truth, so the problem of data corruption over time is not so acute.
  • At their core, services are data-driven, which means that despite the constant growth in data volumes, services can still respond quickly to business events.
  • Scalability issues fall on the broker, not the services. This greatly reduces the complexity of writing services, since there is no need to think about scalability.
  • Adding new services does not require changing old ones, so connecting new services becomes easier.

As you can see, it's more than just REST. We have received a set of tools that allows you to work with shared data in a decentralized way.

Not all aspects have been disclosed in today's article. We still need to figure out how to balance between the request-response paradigm and the event-driven paradigm. But we will deal with this next time. There are topics that you need to get to know better, for example, why Stateful Stream Processing is so good. We will talk about this in the third article. And there are other powerful constructs that we can use if we resort to them, for example, Exactly Once Processing. It is a game changer for distributed business systems as it provides transactional guarantees for XA in scalable form. This will be discussed in the fourth article. Finally, we will need to go over the details of implementing these principles.

The data dichotomy: rethinking the relationship between data and services

But for now, just remember this: the data dichotomy is a force we face when building business services. And we must remember this. The trick is to turn everything on its head and start treating shared data as first-class objects. Stateful Stream Processing provides a unique compromise for this. It avoids centralized "God Components" that hold back progress. Moreover, it provides agility, scalability, and resiliency of data streaming pipelines and adds them to each service. Therefore, we can focus on the general stream of consciousness that any service can connect to and work with its data. This makes services more scalable, interchangeable, and autonomous. Therefore, they will not only look good on whiteboards and when testing hypotheses, but also work and develop for decades.

Learn more about the course.

Source: habr.com

Add a comment