Using all the features of indexes in PostgreSQL

Using all the features of indexes in PostgreSQL
In the Postgres world, indexes are essential for efficient navigation of a database's storage (called a "heap"). Postgres doesn't support clustering for it, and the MVCC architecture causes you to end up with many versions of the same tuple. Therefore, it is very important to be able to create and maintain efficient indexes to support applications.

Here are some tips for optimizing and improving the use of indexes.

Note: the queries shown below work on an unmodified pagila sample database.

Using Covering Indexes

Let's look at a request to extract email addresses for inactive users. Table customer there is a column active, and the query is simple:

pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                        QUERY PLAN
-----------------------------------------------------------
 Seq Scan on customer  (cost=0.00..16.49 rows=15 width=32)
   Filter: (active = 0)
(2 rows)

The query invokes the full table scan sequence customer. Let's create an index on a column active:

pagila=# CREATE INDEX idx_cust1 ON customer(active);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                                 QUERY PLAN
-----------------------------------------------------------------------------
 Index Scan using idx_cust1 on customer  (cost=0.28..12.29 rows=15 width=32)
   Index Cond: (active = 0)
(2 rows)

It helped, the subsequent scan turned into "index scan". This means that Postgres will scan the index "idx_cust1", and then continue searching the table heap to read the values ​​of other columns (in this case, the column email) that the query needs.

Covering indexes are introduced in PostgreSQL 11. They allow you to include one or more additional columns in the index itself - their values ​​are stored in the index data store.

If we were to use this feature and add the email value inside the index, then Postgres would not need to search the table's heap for the value. email. Let's see if this will work:

pagila=# CREATE INDEX idx_cust2 ON customer(active) INCLUDE (email);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Index Only Scan using idx_cust2 on customer  (cost=0.28..12.29 rows=15 width=32)
   Index Cond: (active = 0)
(2 rows)

Β«Index Only Scan' tells us that the query now only needs an index, which helps avoid all disk I/O to read the table heap.

Covering indexes are currently only available for B-trees. However, in this case, the maintenance effort will be higher.

Using Partial Indexes

Partial indexes index only a subset of the rows in a table. This saves the size of indexes and makes scans faster.

Let's say we want to get a list of our customers' email addresses in California. The request will be like this:

SELECT c.email FROM customer c
JOIN address a ON c.address_id = a.address_id
WHERE a.district = 'California';
which has a query plan that involves scanning both the tables that are joined:
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=15.65..32.22 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=15.54..15.54 rows=9 width=4)
         ->  Seq Scan on address a  (cost=0.00..15.54 rows=9 width=4)
               Filter: (district = 'California'::text)
(6 rows)

What ordinary indexes will give us:

pagila=# CREATE INDEX idx_address1 ON address(district);
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Hash Join  (cost=12.98..29.55 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=12.87..12.87 rows=9 width=4)
         ->  Bitmap Heap Scan on address a  (cost=4.34..12.87 rows=9 width=4)
               Recheck Cond: (district = 'California'::text)
               ->  Bitmap Index Scan on idx_address1  (cost=0.00..4.34 rows=9 width=0)
                     Index Cond: (district = 'California'::text)
(8 rows)

Scanning address has been replaced by index scan idx_address1and then scanned the heap address.

Since this is a frequent query and needs to be optimized, we can use a partial index, which indexes only those rows with addresses in which the district β€˜California’:

pagila=# CREATE INDEX idx_address2 ON address(address_id) WHERE district='California';
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Join  (cost=12.38..28.96 rows=9 width=32)
   Hash Cond: (c.address_id = a.address_id)
   ->  Seq Scan on customer c  (cost=0.00..14.99 rows=599 width=34)
   ->  Hash  (cost=12.27..12.27 rows=9 width=4)
         ->  Index Only Scan using idx_address2 on address a  (cost=0.14..12.27 rows=9 width=4)
(5 rows)

Now the query only reads idx_address2 and does not touch the table address.

Using Multi-Value Indexes

Some columns to be indexed may not contain a scalar data type. Column types like jsonb, arrays ΠΈ tsvector contain composite or multiple values. If you need to index such columns, you usually have to search through all the individual values ​​in those columns.

Let's try to find the titles of all films containing cuts from unsuccessful takes. Table film there is a text column called special_features. If the movie has this "special property", then the column contains the element as a text array Behind The Scenes. To search for all such films, we need to select all rows with "Behind The Scenes" when any array values special_features:

SELECT title FROM film WHERE special_features @> '{"Behind The Scenes"}';

Nesting operator @> checks if the right side is a subset of the left side.

Request plan:

pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on film  (cost=0.00..67.50 rows=5 width=15)
   Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)

Which requests a full heap scan with a cost of 67.

Let's see if a regular B-tree index helps us:

pagila=# CREATE INDEX idx_film1 ON film(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on film  (cost=0.00..67.50 rows=5 width=15)
   Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)

The index was not even considered. The B-tree index is not aware of the existence of individual elements in the indexed values.

We need a GIN index.

pagila=# CREATE INDEX idx_film2 ON film USING GIN(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
                                QUERY PLAN
---------------------------------------------------------------------------
 Bitmap Heap Scan on film  (cost=8.04..23.58 rows=5 width=15)
   Recheck Cond: (special_features @> '{"Behind The Scenes"}'::text[])
   ->  Bitmap Index Scan on idx_film2  (cost=0.00..8.04 rows=5 width=0)
         Index Cond: (special_features @> '{"Behind The Scenes"}'::text[])
(4 rows)

The GIN index supports mapping single values ​​against indexed composite values, resulting in a query plan cost that is more than halved.

Getting rid of duplicate indexes

Indexes accumulate over time, and sometimes a new index may contain the same definition as one of the previous ones. You can use the catalog view to get human-readable SQL definitions of indexes. pg_indexes. You can also easily find identical definitions:

 SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
    FROM pg_indexes
GROUP BY defn
  HAVING count(*) > 1;
And here’s the result when run on the stock pagila database:
pagila=#   SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
pagila-#     FROM pg_indexes
pagila-# GROUP BY defn
pagila-#   HAVING count(*) > 1;
                                indexes                                 |                                defn
------------------------------------------------------------------------+------------------------------------------------------------------
 {payment_p2017_01_customer_id_idx,idx_fk_payment_p2017_01_customer_id} | CREATE INDEX  ON public.payment_p2017_01 USING btree (customer_id
 {payment_p2017_02_customer_id_idx,idx_fk_payment_p2017_02_customer_id} | CREATE INDEX  ON public.payment_p2017_02 USING btree (customer_id
 {payment_p2017_03_customer_id_idx,idx_fk_payment_p2017_03_customer_id} | CREATE INDEX  ON public.payment_p2017_03 USING btree (customer_id
 {idx_fk_payment_p2017_04_customer_id,payment_p2017_04_customer_id_idx} | CREATE INDEX  ON public.payment_p2017_04 USING btree (customer_id
 {payment_p2017_05_customer_id_idx,idx_fk_payment_p2017_05_customer_id} | CREATE INDEX  ON public.payment_p2017_05 USING btree (customer_id
 {idx_fk_payment_p2017_06_customer_id,payment_p2017_06_customer_id_idx} | CREATE INDEX  ON public.payment_p2017_06 USING btree (customer_id
(6 rows)

Superset Indexes

It can happen that you end up with many indexes, one of which indexes a superset of columns that index other indexes. This may or may not be desirableβ€”the superset may result in index-only scans, which is good, but it may take up too much space, or the query that the superset was intended to optimize is no longer used.

If you need to automate the definition of such indexes, you can start with pg_index from the table pg_catalog.

Unused indexes

As applications that use databases evolve, so do the queries they use. Indexes added earlier may no longer be used by any query. Each time an index is scanned, it is marked by the statistics manager, and in the system catalog view pg_stat_user_indexes you can see the value idx_scan, which is a cumulative counter. Tracking this value over a period of time (say a month) will give a good idea of ​​which indexes are not being used and could be dropped.

Here is a query to get the current scan counts of all indexes in the schema β€˜public’:

SELECT relname, indexrelname, idx_scan
FROM   pg_catalog.pg_stat_user_indexes
WHERE  schemaname = 'public';
with output like this:
pagila=# SELECT relname, indexrelname, idx_scan
pagila-# FROM   pg_catalog.pg_stat_user_indexes
pagila-# WHERE  schemaname = 'public'
pagila-# LIMIT  10;
    relname    |    indexrelname    | idx_scan
---------------+--------------------+----------
 customer      | customer_pkey      |    32093
 actor         | actor_pkey         |     5462
 address       | address_pkey       |      660
 category      | category_pkey      |     1000
 city          | city_pkey          |      609
 country       | country_pkey       |      604
 film_actor    | film_actor_pkey    |        0
 film_category | film_category_pkey |        0
 film          | film_pkey          |    11043
 inventory     | inventory_pkey     |    16048
(10 rows)

Rebuilding indexes with fewer locks

Indexes often need to be rebuilt, for example when they become bloated, and rebuilding can speed up the scan. Also indexes can get corrupted. Changing the index parameters may also require rebuilding it.

Enable parallel index creation

In PostgreSQL 11, B-Tree index creation is concurrent. To speed up the creation process, several parallel workers can be used. However, make sure these configuration options are set correctly:

SET max_parallel_workers = 32;
SET max_parallel_maintenance_workers = 16;

The default values ​​are too small. Ideally, these numbers should increase along with the number of processor cores. Read more in documentation.

Background index creation

You can create an index in the background using the option CONCURRENTLY teams CREATE INDEX:

pagila=# CREATE INDEX CONCURRENTLY idx_address1 ON address(district);
CREATE INDEX

This index creation procedure differs from the usual one in that it does not require a lock on the table, and therefore does not block write operations. On the other hand, it takes more time and consumes more resources.

Postgres provides a lot of flexibility for creating indexes and ways to solve any special cases, as well as ways to manage the database in case your application grows explodingly. We hope these tips will help you get your queries fast and your database ready to scale.

Source: habr.com

Add a comment