Operational analytics in microservice architecture: help and prompt Postgres FDW

Microservice architecture, like everything in this world, has its pros and cons. Some processes become easier with it, others more difficult. And for the sake of the speed of change and better scalability, you need to make sacrifices. One of them is the complexity of analytics. If in a monolith all operational analytics can be reduced to SQL queries to an analytical replica, then in a multiservice architecture each service has its own database and it seems that one query is not enough (or maybe it will?). For those who are interested in how we solved the problem of operational analytics in our company and how we learned to live with this solution - welcome.

Operational analytics in microservice architecture: help and prompt Postgres FDW
My name is Pavel Sivash, at DomClick I work in a team that is responsible for maintaining the analytical data warehouse. Conventionally, our activities can be attributed to data engineering, but, in fact, the range of tasks is much wider. There are standard data engineering ETL / ELT, support and adaptation of data analysis tools and the development of their own tools. In particular, for operational reporting, we decided to β€œpretend” that we have a monolith and give analysts one database that will contain all the data they need.

In general, we considered different options. It was possible to build a full-fledged repository - we even tried, but, to be honest, we were not able to make friends with fairly frequent changes in the logic with a rather slow process of building a repository and making changes to it (if someone succeeded, write in the comments how). You could say to analysts: β€œGuys, learn python and go to analytical lines,” but this is an additional recruitment requirement, and it seemed that this should be avoided if possible. We decided to try using the FDW (Foreign Data Wrapper) technology: in fact, this is a standard dblink, which is in the SQL standard, but with its much more convenient interface. On the basis of it, we made a decision, which eventually took root, we settled on it. Its details are the topic of a separate article, and maybe more than one, because I want to talk about a lot: from database schema synchronization to access control and depersonalization of personal data. It should also be noted that this solution is not a replacement for real analytical databases and repositories, it only solves a specific problem.

At the top level it looks like this:

Operational analytics in microservice architecture: help and prompt Postgres FDW
There is a PostgreSQL database where users can store their work data, and most importantly, analytical replicas of all services are connected to this database via FDW. This makes it possible to write a query to several databases, and it doesn’t matter what it is: PostgreSQL, MySQL, MongoDB or something else (file, API, if suddenly there is no suitable wrapper, you can write your own). Well, everything seems to be great! Breaking up?

If everything ended so quickly and simply, then, probably, the article would not exist.

It is important to be clear about how postgres handles requests to remote servers. This seems logical, but often people don’t pay attention to it: postgres divides the query into parts that are executed independently on remote servers, collects this data, and performs the final calculations itself, so the query execution speed will greatly depend on how it is written. It should also be noted: when the data comes from a remote server, they no longer have indexes, there is nothing that will help the scheduler, therefore, only we ourselves can help and suggest it. And that's what I want to talk about in more detail.

A simple request and a plan with it

To show how Postgres queries a 6 million row table on a remote server, let's look at a simple plan.

explain analyze verbose  
SELECT count(1)
FROM fdw_schema.table;

Aggregate  (cost=418383.23..418383.24 rows=1 width=8) (actual time=3857.198..3857.198 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..402376.14 rows=6402838 width=0) (actual time=4.874..3256.511 rows=6406868 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Remote SQL: SELECT NULL FROM fdw_schema.table
Planning time: 0.986 ms
Execution time: 3857.436 ms

Using the VERBOSE statement allows you to see the query that will be sent to the remote server and the results of which we will receive for further processing (RemoteSQL string).

Let's go a little further and add several filters to our query: one by boolean field, one by entry timestamp per interval and one by jsonb.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta->>'source' = 'test';

Aggregate  (cost=577487.69..577487.70 rows=1 width=8) (actual time=27473.818..25473.819 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..577469.21 rows=7390 width=0) (actual time=31.369..25372.466 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND (("table".meta ->> 'source'::text) = 'test'::text) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 5046843
        Remote SQL: SELECT created_dt, is_active, meta FROM fdw_schema.table
Planning time: 0.665 ms
Execution time: 27474.118 ms

This is where the moment lies, which you need to pay attention to when writing queries. The filters were not transferred to the remote server, which means that to execute it, postgres pulls all 6 million rows in order to filter locally (the Filter line) and perform aggregation later. The key to success is to write a query so that the filters are transmitted to the remote machine, and we receive and aggregate only the necessary rows.

That's some booleanshit

With boolean fields, everything is simple. In the original query, the problem was due to the operator is. If we replace it with =, then we get the following result:

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table
WHERE is_active = True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta->>'source' = 'test';

Aggregate  (cost=508010.14..508010.15 rows=1 width=8) (actual time=19064.314..19064.314 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.00..507988.44 rows=8679 width=0) (actual time=33.035..18951.278 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: ((("table".meta ->> 'source'::text) = 'test'::text) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 3567989
        Remote SQL: SELECT created_dt, meta FROM fdw_schema.table WHERE (is_active)
Planning time: 0.834 ms
Execution time: 19064.534 ms

As you can see, the filter flew to the remote server, and the execution time was reduced from 27 to 19 seconds.

It should be noted that the operator is different from operator = the one that can work with the Null value. It means that is not true in the filter will leave the values ​​False and Null, while != True will leave only False values. Therefore, when replacing the operator is not you should pass two conditions to the filter with the OR operator, for example, WHERE (col != True) OR (col is null).

With boolean figured out, moving on. In the meantime, let's return the filter by boolean value to its original form in order to independently consider the effect of other changes.

timestamptz? hz

In general, you often have to experiment with how to correctly write a query that involves remote servers, and only then look for an explanation of why this is happening. Very little information about this can be found on the Internet. So, in experiments, we found that a fixed date filter flies to a remote server with a bang, but when we want to set the date dynamically, for example, now() or CURRENT_DATE, this does not happen. In our example, we have added a filter so that the created_at column contains data for exactly 1 month in the past (BETWEEN CURRENT_DATE - INTERVAL '7 month' AND CURRENT_DATE - INTERVAL '6 month'). What did we do in this case?

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt >= (SELECT CURRENT_DATE::timestamptz - INTERVAL '7 month') 
AND created_dt <(SELECT CURRENT_DATE::timestamptz - INTERVAL '6 month')
AND meta->>'source' = 'test';

Aggregate  (cost=306875.17..306875.18 rows=1 width=8) (actual time=4789.114..4789.115 rows=1 loops=1)
  Output: count(1)
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '7 mons'::interval)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.02..306874.86 rows=105 width=0) (actual time=23.475..4681.419 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND (("table".meta ->> 'source'::text) = 'test'::text))
        Rows Removed by Filter: 76934
        Remote SQL: SELECT is_active, meta FROM fdw_schema.table WHERE ((created_dt >= $1::timestamp with time zone)) AND ((created_dt < $2::timestamp with time zone))
Planning time: 0.703 ms
Execution time: 4789.379 ms

We prompted the planner to calculate the date in advance in the subquery and pass the already prepared variable to the filter. And this hint gave us a great result, the query became almost 6 times faster!

Again, it is important to be careful here: the data type in the subquery must be the same as that of the field by which we filter, otherwise the planner will decide that since the types are different and it is necessary to first get all the data and filter it locally.

Let's return the filter by date to its original value.

Freddy vs. jsonb

In general, boolean fields and dates have already sufficiently accelerated our query, but there was one more data type. The battle with filtering by it, to be honest, is still not over, although there are successes here too. So, here's how we managed to pass the filter by jsonb field to a remote server.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active is True
AND created_dt BETWEEN CURRENT_DATE - INTERVAL '7 month' 
AND CURRENT_DATE - INTERVAL '6 month'
AND meta @> '{"source":"test"}'::jsonb;

Aggregate  (cost=245463.60..245463.61 rows=1 width=8) (actual time=6727.589..6727.590 rows=1 loops=1)
  Output: count(1)
  ->  Foreign Scan on fdw_schema."table"  (cost=1100.00..245459.90 rows=1478 width=0) (actual time=16.213..6634.794 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Filter: (("table".is_active IS TRUE) AND ("table".created_dt >= (('now'::cstring)::date - '7 mons'::interval)) AND ("table".created_dt <= ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)))
        Rows Removed by Filter: 619961
        Remote SQL: SELECT created_dt, is_active FROM fdw_schema.table WHERE ((meta @> '{"source": "test"}'::jsonb))
Planning time: 0.747 ms
Execution time: 6727.815 ms

Instead of filtering operators, you must use the presence of one operator. jsonb in a different. 7 seconds instead of the original 29. So far, this is the only successful option for transferring filters over jsonb to a remote server, but here it is important to take into account one limitation: we use version 9.6 of the database, but by the end of April we plan to complete the last tests and move to version 12. As soon as we update, we will write how it affected, because there are a lot of changes for which there are many hopes: json_path, new CTE behavior, push down (existing from version 10). I really want to try it soon.

Finish him

We checked how each change affects the query speed individually. Let's now see what happens when all three filters are written correctly.

explain analyze verbose
SELECT count(1)
FROM fdw_schema.table 
WHERE is_active = True
AND created_dt >= (SELECT CURRENT_DATE::timestamptz - INTERVAL '7 month') 
AND created_dt <(SELECT CURRENT_DATE::timestamptz - INTERVAL '6 month')
AND meta @> '{"source":"test"}'::jsonb;

Aggregate  (cost=322041.51..322041.52 rows=1 width=8) (actual time=2278.867..2278.867 rows=1 loops=1)
  Output: count(1)
  InitPlan 1 (returns $0)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '7 mons'::interval)
  InitPlan 2 (returns $1)
    ->  Result  (cost=0.00..0.02 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
          Output: ((('now'::cstring)::date)::timestamp with time zone - '6 mons'::interval)
  ->  Foreign Scan on fdw_schema."table"  (cost=100.02..322041.41 rows=25 width=0) (actual time=8.597..2153.809 rows=1360025 loops=1)
        Output: "table".id, "table".is_active, "table".meta, "table".created_dt
        Remote SQL: SELECT NULL FROM fdw_schema.table WHERE (is_active) AND ((created_dt >= $1::timestamp with time zone)) AND ((created_dt < $2::timestamp with time zone)) AND ((meta @> '{"source": "test"}'::jsonb))
Planning time: 0.820 ms
Execution time: 2279.087 ms

Yes, the query looks more complicated, it is a forced price, but the execution speed is 2 seconds, which is more than 10 times faster! And we are talking about a simple query on a relatively small set of data. On real requests, we received an increase of up to several hundred times.

To sum it up: if you are using PostgreSQL with FDW, always check if all filters are sent to the remote server and you will be happy... At least until you get to joins between tables from different servers. But that's a story for another article.

Thank you for your attention! I'd love to hear questions, comments, and stories about your experiences in the comments.

Source: habr.com

Add a comment