In the Postgres world, indexes are essential for efficient navigation of a database's storage (called a "heap"). Postgres doesn't support clustering for it, and the MVCC architecture causes you to end up with many versions of the same tuple. Therefore, it is very important to be able to create and maintain efficient indexes to support applications.
Here are some tips for optimizing and improving the use of indexes.
Note: the queries shown below work on an unmodified
Using Covering Indexes
Let's look at a request to extract email addresses for inactive users. Table customer
there is a column active
, and the query is simple:
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
QUERY PLAN
-----------------------------------------------------------
Seq Scan on customer (cost=0.00..16.49 rows=15 width=32)
Filter: (active = 0)
(2 rows)
The query invokes the full table scan sequence customer
. Let's create an index on a column active
:
pagila=# CREATE INDEX idx_cust1 ON customer(active);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using idx_cust1 on customer (cost=0.28..12.29 rows=15 width=32)
Index Cond: (active = 0)
(2 rows)
It helped, the subsequent scan turned into "index scan
". This means that Postgres will scan the index "idx_cust1
", and then continue searching the table heap to read the values ββof other columns (in this case, the column email
) that the query needs.
Covering indexes are introduced in PostgreSQL 11. They allow you to include one or more additional columns in the index itself - their values ββare stored in the index data store.
If we were to use this feature and add the email value inside the index, then Postgres would not need to search the table's heap for the value. email
. Let's see if this will work:
pagila=# CREATE INDEX idx_cust2 ON customer(active) INCLUDE (email);
CREATE INDEX
pagila=# EXPLAIN SELECT email FROM customer WHERE active=0;
QUERY PLAN
----------------------------------------------------------------------------------
Index Only Scan using idx_cust2 on customer (cost=0.28..12.29 rows=15 width=32)
Index Cond: (active = 0)
(2 rows)
Β«Index Only Scan
' tells us that the query now only needs an index, which helps avoid all disk I/O to read the table heap.
Covering indexes are currently only available for B-trees. However, in this case, the maintenance effort will be higher.
Using Partial Indexes
Partial indexes index only a subset of the rows in a table. This saves the size of indexes and makes scans faster.
Let's say we want to get a list of our customers' email addresses in California. The request will be like this:
SELECT c.email FROM customer c
JOIN address a ON c.address_id = a.address_id
WHERE a.district = 'California';
which has a query plan that involves scanning both the tables that are joined:
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
QUERY PLAN
----------------------------------------------------------------------
Hash Join (cost=15.65..32.22 rows=9 width=32)
Hash Cond: (c.address_id = a.address_id)
-> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=34)
-> Hash (cost=15.54..15.54 rows=9 width=4)
-> Seq Scan on address a (cost=0.00..15.54 rows=9 width=4)
Filter: (district = 'California'::text)
(6 rows)
What ordinary indexes will give us:
pagila=# CREATE INDEX idx_address1 ON address(district);
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
QUERY PLAN
---------------------------------------------------------------------------------------
Hash Join (cost=12.98..29.55 rows=9 width=32)
Hash Cond: (c.address_id = a.address_id)
-> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=34)
-> Hash (cost=12.87..12.87 rows=9 width=4)
-> Bitmap Heap Scan on address a (cost=4.34..12.87 rows=9 width=4)
Recheck Cond: (district = 'California'::text)
-> Bitmap Index Scan on idx_address1 (cost=0.00..4.34 rows=9 width=0)
Index Cond: (district = 'California'::text)
(8 rows)
Scanning address
has been replaced by index scan idx_address1
and then scanned the heap address
.
Since this is a frequent query and needs to be optimized, we can use a partial index, which indexes only those rows with addresses in which the district βCaliforniaβ
:
pagila=# CREATE INDEX idx_address2 ON address(address_id) WHERE district='California';
CREATE INDEX
pagila=# EXPLAIN SELECT c.email FROM customer c
pagila-# JOIN address a ON c.address_id = a.address_id
pagila-# WHERE a.district = 'California';
QUERY PLAN
------------------------------------------------------------------------------------------------
Hash Join (cost=12.38..28.96 rows=9 width=32)
Hash Cond: (c.address_id = a.address_id)
-> Seq Scan on customer c (cost=0.00..14.99 rows=599 width=34)
-> Hash (cost=12.27..12.27 rows=9 width=4)
-> Index Only Scan using idx_address2 on address a (cost=0.14..12.27 rows=9 width=4)
(5 rows)
Now the query only reads idx_address2
and does not touch the table address
.
Using Multi-Value Indexes
Some columns to be indexed may not contain a scalar data type. Column types like jsonb
, arrays
ΠΈ tsvector
contain composite or multiple values. If you need to index such columns, you usually have to search through all the individual values ββin those columns.
Let's try to find the titles of all films containing cuts from unsuccessful takes. Table film
there is a text column called special_features
. If the movie has this "special property", then the column contains the element as a text array Behind The Scenes
. To search for all such films, we need to select all rows with "Behind The Scenes" when any array values special_features
:
SELECT title FROM film WHERE special_features @> '{"Behind The Scenes"}';
Nesting operator @>
checks if the right side is a subset of the left side.
Request plan:
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on film (cost=0.00..67.50 rows=5 width=15)
Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)
Which requests a full heap scan with a cost of 67.
Let's see if a regular B-tree index helps us:
pagila=# CREATE INDEX idx_film1 ON film(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on film (cost=0.00..67.50 rows=5 width=15)
Filter: (special_features @> '{"Behind The Scenes"}'::text[])
(2 rows)
The index was not even considered. The B-tree index is not aware of the existence of individual elements in the indexed values.
We need a GIN index.
pagila=# CREATE INDEX idx_film2 ON film USING GIN(special_features);
CREATE INDEX
pagila=# EXPLAIN SELECT title FROM film
pagila-# WHERE special_features @> '{"Behind The Scenes"}';
QUERY PLAN
---------------------------------------------------------------------------
Bitmap Heap Scan on film (cost=8.04..23.58 rows=5 width=15)
Recheck Cond: (special_features @> '{"Behind The Scenes"}'::text[])
-> Bitmap Index Scan on idx_film2 (cost=0.00..8.04 rows=5 width=0)
Index Cond: (special_features @> '{"Behind The Scenes"}'::text[])
(4 rows)
The GIN index supports mapping single values ββagainst indexed composite values, resulting in a query plan cost that is more than halved.
Getting rid of duplicate indexes
Indexes accumulate over time, and sometimes a new index may contain the same definition as one of the previous ones. You can use the catalog view to get human-readable SQL definitions of indexes. pg_indexes
. You can also easily find identical definitions:
SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
FROM pg_indexes
GROUP BY defn
HAVING count(*) > 1;
And hereβs the result when run on the stock pagila database:
pagila=# SELECT array_agg(indexname) AS indexes, replace(indexdef, indexname, '') AS defn
pagila-# FROM pg_indexes
pagila-# GROUP BY defn
pagila-# HAVING count(*) > 1;
indexes | defn
------------------------------------------------------------------------+------------------------------------------------------------------
{payment_p2017_01_customer_id_idx,idx_fk_payment_p2017_01_customer_id} | CREATE INDEX ON public.payment_p2017_01 USING btree (customer_id
{payment_p2017_02_customer_id_idx,idx_fk_payment_p2017_02_customer_id} | CREATE INDEX ON public.payment_p2017_02 USING btree (customer_id
{payment_p2017_03_customer_id_idx,idx_fk_payment_p2017_03_customer_id} | CREATE INDEX ON public.payment_p2017_03 USING btree (customer_id
{idx_fk_payment_p2017_04_customer_id,payment_p2017_04_customer_id_idx} | CREATE INDEX ON public.payment_p2017_04 USING btree (customer_id
{payment_p2017_05_customer_id_idx,idx_fk_payment_p2017_05_customer_id} | CREATE INDEX ON public.payment_p2017_05 USING btree (customer_id
{idx_fk_payment_p2017_06_customer_id,payment_p2017_06_customer_id_idx} | CREATE INDEX ON public.payment_p2017_06 USING btree (customer_id
(6 rows)
Superset Indexes
It can happen that you end up with many indexes, one of which indexes a superset of columns that index other indexes. This may or may not be desirableβthe superset may result in index-only scans, which is good, but it may take up too much space, or the query that the superset was intended to optimize is no longer used.
If you need to automate the definition of such indexes, you can start with pg_catalog
.
Unused indexes
As applications that use databases evolve, so do the queries they use. Indexes added earlier may no longer be used by any query. Each time an index is scanned, it is marked by the statistics manager, and in the system catalog view pg_stat_user_indexes
you can see the value idx_scan
, which is a cumulative counter. Tracking this value over a period of time (say a month) will give a good idea of ββwhich indexes are not being used and could be dropped.
Here is a query to get the current scan counts of all indexes in the schema βpublicβ
:
SELECT relname, indexrelname, idx_scan
FROM pg_catalog.pg_stat_user_indexes
WHERE schemaname = 'public';
with output like this:
pagila=# SELECT relname, indexrelname, idx_scan
pagila-# FROM pg_catalog.pg_stat_user_indexes
pagila-# WHERE schemaname = 'public'
pagila-# LIMIT 10;
relname | indexrelname | idx_scan
---------------+--------------------+----------
customer | customer_pkey | 32093
actor | actor_pkey | 5462
address | address_pkey | 660
category | category_pkey | 1000
city | city_pkey | 609
country | country_pkey | 604
film_actor | film_actor_pkey | 0
film_category | film_category_pkey | 0
film | film_pkey | 11043
inventory | inventory_pkey | 16048
(10 rows)
Rebuilding indexes with fewer locks
Indexes often need to be rebuilt, for example when they become bloated, and rebuilding can speed up the scan. Also indexes can get corrupted. Changing the index parameters may also require rebuilding it.
Enable parallel index creation
In PostgreSQL 11, B-Tree index creation is concurrent. To speed up the creation process, several parallel workers can be used. However, make sure these configuration options are set correctly:
SET max_parallel_workers = 32;
SET max_parallel_maintenance_workers = 16;
The default values ββare too small. Ideally, these numbers should increase along with the number of processor cores. Read more in
Background index creation
You can create an index in the background using the option CONCURRENTLY
teams CREATE INDEX
:
pagila=# CREATE INDEX CONCURRENTLY idx_address1 ON address(district);
CREATE INDEX
This index creation procedure differs from the usual one in that it does not require a lock on the table, and therefore does not block write operations. On the other hand, it takes more time and consumes more resources.
Postgres provides a lot of flexibility for creating indexes and ways to solve any special cases, as well as ways to manage the database in case your application grows explodingly. We hope these tips will help you get your queries fast and your database ready to scale.
Source: habr.com