The story of one SQL investigation

Last December I received an interesting bug report from the VWO support team. The download time for one of the analytical reports for a large corporate client seemed prohibitive. And since this is my area of ​​responsibility, I immediately focused on solving the problem.

prehistory

To make it clear what I'm talking about, I'll tell you quite a bit about VWO. This is a platform with which you can run various targeted campaigns on your sites: conduct A / B experiments, track visitors and conversions, analyze the sales funnel, display heat maps and play visit records.

But the most important thing in the platform is reporting. All of the above functions are interconnected. And for corporate clients, a huge amount of information would be simply useless without a powerful platform that presents them in the form of analytics.

Using the platform, you can make an arbitrary query on a large data set. Here is a simple example:

Show all clicks on "abc.com" FROM <date d1> TO <date d2> for people who used Chrome OR (were in Europe AND used an iPhone)

Pay attention to boolean operators. They are available to clients in the query interface to make arbitrarily complex queries to retrieve selections.

Slow query

The client in question was trying to do something that intuitively should work quickly:

Show all session records for users visiting any page with url containing "/jobs"

This site had a huge amount of traffic and we stored over a million unique URLs just for it. And they wanted to find a pretty simple url template related to their business model.

Preliminary investigation

Let's see what's going on in the database. Below is the original slow SQL query:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND sessions.referrer_id = recordings_urls.id 
    AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   ) 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0 ;

And here are the timings:

Estimated time: 1.480 ms Execution time: 1431924.650 ms

The query bypassed 150 thousand rows. The query planner showed a couple of interesting details, but no obvious bottlenecks.

Let's explore the query further. Apparently he does JOIN three tables:

  1. studio sessions : to display session information: browser, user agent, country, and so on.
  2. recording_data: recorded urls, pages, duration of visits
  3. urls: to avoid duplication of extremely large urls, we store them in a separate table.

Also note that all of our tables are already separated by account_id. Thus, the situation is excluded when, due to one particularly large account, problems arise for the rest.

Looking for clues

Upon closer inspection, we can see that something in a particular request is wrong. It's worth taking a look at this line:

urls && array(
	select id from acc_{account_id}.urls 
	where url  ILIKE  '%enterprise_customer.com/jobs%'
)::text[]

My first thought was that perhaps because ILIKE on all these long URLs (we have over 1,4 million unique URLs collected for this account) performance may sag.

But no, that's not the point!

SELECT id FROM urls WHERE url ILIKE '%enterprise_customer.com/jobs%';
  id
--------
 ...
(198661 rows)

Time: 5231.765 ms

The pattern search request itself takes only 5 seconds. Pattern searching on a million unique URLs is clearly not a problem.

The next suspect on the list is several JOIN. Perhaps their overuse led to the slowdown? Usually JOIN's are the most obvious candidates for performance problems, but I did not believe that our case was typical.

analytics_db=# SELECT
    count(*)
FROM
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data_0 as recording_data,
    acc_{account_id}.sessions_0 as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND sessions.referrer_id = recordings_urls.id
    AND r_time > to_timestamp(1542585600)
    AND r_time < to_timestamp(1545177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
-------
  8086
(1 row)

Time: 147.851 ms

And it was not our case either. JOIN's were quite fast.

Narrowing down the suspects

I was ready to start changing the query to achieve any possible performance improvements. The team and I developed 2 main ideas:

  • Use EXISTS for subrequest URL: We wanted to check again if there are any problems with the subquery for urls. One way to achieve this is to simply use EXISTS. EXISTS can greatly improve performance since it ends as soon as it finds a single line by condition.

SELECT
	count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls,
    acc_{account_id}.recording_data as recording_data,
    acc_{account_id}.sessions as sessions
WHERE
    recording_data.usp_id = sessions.usp_id
    AND  (  1 = 1  )
    AND sessions.referrer_id = recordings_urls.id
    AND  (exists(select id from acc_{account_id}.urls where url  ILIKE '%enterprise_customer.com/jobs%'))
    AND r_time > to_timestamp(1547585600)
    AND r_time < to_timestamp(1549177599)
    AND recording_data.duration >=5
    AND recording_data.num_of_pages > 0 ;
 count
 32519
(1 row)
Time: 1636.637 ms

Well, yes. Subquery when wrapped in EXISTS, makes everything super fast. The next logical question is why the request with JOIN-ami and the subquery itself are fast individually, but terribly slow together?

  • Moving a subquery to a CTE : if the query is fast on its own, we can just calculate the fast result first and then provide it to the main query

WITH matching_urls AS (
    select id::text from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%'
)

SELECT 
    count(*) FROM acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions,
    matching_urls
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id
    AND (urls && array(SELECT id from matching_urls)::text[])
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545107599)
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0;

But it was still very slow.

We find the culprit

All this time, one trifle flashed before my eyes, from which I constantly brushed aside. But since there was nothing else left, I decided to take a look at her as well. I'm talking about && operator. Bye EXISTS just improved performance, && was the only remaining common factor in all versions of the slow query.

Looking at documentation, we see that && used when you need to find common elements between two arrays.

In the original request, this is:

AND  (  urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]   )

Which means that we do a wildcard search on our urls, then we find the intersection with all urls with common records. This is a bit confusing as "urls" here does not refer to the table containing all the URLs, but to the "urls" column in the table recording_data.

With growing suspicions about &&, I tried to find them validation in the query plan generated by EXPLAIN ANALYZE (I already had a saved plan, but I'm usually more comfortable experimenting in SQL than trying to understand the opacity of query planners).

Filter: ((urls && ($0)::text[]) AND (r_time > '2018-12-17 12:17:23+00'::timestamp with time zone) AND (r_time < '2018-12-18 23:59:59+00'::timestamp with time zone) AND (duration >= '5'::double precision) AND (num_of_pages > 0))
                           Rows Removed by Filter: 52710

There were several lines of filters only from &&. Which meant that this operation was not only expensive, but also performed several times.

I tested it by isolating the condition

SELECT 1
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data_30 as recording_data_30, 
    acc_{account_id}.sessions_30 as sessions_30 
WHERE 
	urls &&  array(select id from acc_{account_id}.urls where url  ILIKE  '%enterprise_customer.com/jobs%')::text[]

This query was running slowly. Because the JOIN-s are fast and subqueries are fast, only && operator.

That's just the key operation. We always need to search the entire main table of URLs to search for a pattern, and we always need to find intersections. We can't search url records directly because they are just id's referring to urls.

On the way to a solution

&& slow because both sets are huge. The operation will be relatively fast if I replace urls on { "http://google.com/", "http://wingify.com/" }.

I started looking for a way to do a set intersection in Postgres without using &&, but without much success.

In the end, we decided to just solve the problem in isolation: give me everything urls strings for which the url matches the pattern. Without additional conditions, it will be βˆ’ 

SELECT urls.url
FROM 
	acc_{account_id}.urls as urls,
	(SELECT unnest(recording_data.urls) AS id) AS unrolled_urls
WHERE
	urls.id = unrolled_urls.id AND
	urls.url  ILIKE  '%jobs%'

Instead of JOIN syntax I just used a subquery and expanded recording_data.urls array so that you can directly apply the condition in WHERE.

The most important thing here is that && is used to check if a given entry contains a matching URL. Squinting a little, you can see in this operation moving through the elements of the array (or rows of the table) and stopping when the condition (match) is met. Doesn't it remind you of anything? Yeah EXISTS.

Since on recording_data.urls can be referenced from outside the context of the subquery, when this happens we can return to our old friend EXISTS and wrap a subquery with it.

Putting it all together, we get the final optimized query:

SELECT 
    count(*) 
FROM 
    acc_{account_id}.urls as recordings_urls, 
    acc_{account_id}.recording_data as recording_data, 
    acc_{account_id}.sessions as sessions 
WHERE 
    recording_data.usp_id = sessions.usp_id 
    AND  (  1 = 1  )  
    AND sessions.referrer_id = recordings_urls.id 
    AND r_time > to_timestamp(1542585600) 
    AND r_time < to_timestamp(1545177599) 
    AND recording_data.duration >=5 
    AND recording_data.num_of_pages > 0
    AND EXISTS(
        SELECT urls.url
        FROM 
            acc_{account_id}.urls as urls,
            (SELECT unnest(urls) AS rec_url_id FROM acc_{account_id}.recording_data) 
            AS unrolled_urls
        WHERE
            urls.id = unrolled_urls.rec_url_id AND
            urls.url  ILIKE  '%enterprise_customer.com/jobs%'
    );

And the final run time Time: 1898.717 ms Time to celebrate?!?

Not so fast! First you need to check the correctness. I was extremely suspicious of EXISTS optimization, as it changes the logic to terminate earlier. We need to be sure that we haven't added an unobvious error to the request.

A simple check was to run count(*) on both slow and fast queries for a large number of different data sets. Then, for a small subset of the data, I manually checked the correctness of all the results.

All tests have been consistently positive. We fixed everything!

Lessons Learned

There are many lessons to be learned from this story:

  1. Query plans don't tell the whole story, but they can give hints
  2. The prime suspects are not always the real culprits.
  3. Slow queries can be broken down to isolate bottlenecks
  4. Not all optimizations are reductive in nature.
  5. Using EXIST, where possible, can lead to a dramatic increase in productivity

Hack and predictor Aviator

We've gone from a request time of ~24 minutes to 2 seconds - a pretty big performance boost! Although this article is long, all the experiments we did took place on the same day, and were estimated to take 1,5 to 2 hours to optimize and test.

SQL is a wonderful language, if not to be afraid of it, but to try to learn and use it. With a good understanding of how SQL queries are executed, how the database generates query plans, how indexes work, and just the size of the data you are dealing with, you can be very successful at optimizing queries. Equally important, however, is to keep trying different approaches and slowly break down the problem by finding bottlenecks.

The best part about achieving these results is the noticeable speed improvement - when a report that didn't even load before now loads almost instantly.

Special thanks to my comrades at the command of Aditya MishraAditya Gaur ΠΈ Varun Malhotra for brainstorming and Dinkaru Pandira for finding an important error in our final query before we finally said goodbye to it!

Source: habr.com

Add a comment