Happy Party or a couple of lines-memories of acquaintance with partitioning in PostgreSQL10

Preface or how the idea of ​​sectioning came about

The story starts here: Do you remember how it all began. Everything was for the first time and again. After almost all the resources for optimizing the query, at that time, were exhausted, the question arose - what's next? Thus, the idea of ​​partitioning was born.

Happy Party or a couple of lines-memories of acquaintance with partitioning in PostgreSQL10

Lyrical digression:
Precisely 'at that moment', because as it turned out, there were untapped reserves of optimization. Thanks asmm and habru!

So, how else can you make the customer as if happy, at the same time pump your own skills?

If you simplify everything as much as possible, then, there are only two ways to radically improve something in the speed of the database:
1) Extensive way - we increase resources, change the configuration;
2) Intensive way - query optimization

Since, I repeat, at that time it was no longer clear what else to change in the request to speed up, the path was chosen - table design changes.

So - the main question arises - what and how will we change?

Initial conditions

First, there is such an ERD (shown conditionally simplified):
Happy Party or a couple of lines-memories of acquaintance with partitioning in PostgreSQL10
Main Features:

  1. many-to-many relationships
  2. table already has a potential partition key

Original request:

SELECT
            p."PARAMETER_ID" as  parameter_id,
            pc."PC_NAME" AS pc_name,
            pc."CUSTOMER_PARTNUMBER" AS customer_partnumber,
            w."LASERMARK" AS lasermark,
            w."LOTID" AS lotid,
            w."REPORTED_VALUE" AS reported_value,
            w."LOWER_SPEC_LIMIT" AS lower_spec_limit,
            w."UPPER_SPEC_LIMIT" AS upper_spec_limit,
            p."TYPE_CALCUL" AS type_calcul,
            s."SHIPMENT_NAME" AS shipment_name,
            s."SHIPMENT_DATE" AS shipment_date,
            extract(year from s."SHIPMENT_DATE") AS year,
            extract(month from s."SHIPMENT_DATE") as month,
            s."REPORT_NAME" AS report_name,
            p."SPARAM_NAME" AS SPARAM_name,
            p."CUSTOMERPARAM_NAME" AS customerparam_name
        FROM data w INNER JOIN shipment s ON s."SHIPMENT_ID" = w."SHIPMENT_ID"
             INNER JOIN parameters p ON p."PARAMETER_ID" = w."PARAMETER_ID"
             INNER JOIN shipment_pc sp ON s."SHIPMENT_ID" = sp."SHIPMENT_ID"
             INNER JOIN pc pc ON pc."PC_ID" = sp."PC_ID"
             INNER JOIN ( SELECT w2."LASERMARK" , MAX(s2."SHIPMENT_DATE") AS "SHIPMENT_DATE"
                          FROM shipment s2 INNER JOIN data w2 ON s2."SHIPMENT_ID" = w2."SHIPMENT_ID" 
                          GROUP BY w2."LASERMARK"
                         ) md ON md."SHIPMENT_DATE" = s."SHIPMENT_DATE" AND md."LASERMARK" = w."LASERMARK"
        WHERE 
             s."SHIPMENT_DATE" >= '2018-07-01' AND s."SHIPMENT_DATE" <= '2018-09-30' ;

Results of execution on the test database:
Cost : 502 997.55
execution time: 505 seconds.

What do we see? Regular request, by time slice.
We make the simplest logical assumption: if there is a sample of a time slice, will it help us? That's right - partitioning.

What to section?

At first glance, the choice is obvious - declarative partitioning of the "shipment" table by the key "SHIPMENT_DATE" (running a lot ahead - as a result, the production turned out a little wrong).

How to section?

This question is also not too difficult. Fortunately, in PostgreSQL 10, there is now a human partitioning mechanism.
So:

  1. Saving a dump of the original table - pg_dump source_table
  2. Delete the original table - drop table source_table
  3. Create a parent table with range partitioning − create table source_table
  4. Create sections - create table source_table, create index
  5. Import the dump created in step 1 − pg_restore

Partitioning scripts

For simplicity and convenience, steps 2,3,4 have been combined in one script.

So:
Saving a dump of the source table

pg_dump postgres --file=/dump/shipment.dmp --format=c --table=shipment --verbose > /dump/shipment.log 2>&1

Delete original table + Create parent table with range partitioning + Create partitions

--create_partition_shipment.sql
do language plpgsql $$
declare 
rec_shipment_date RECORD ;
partition_name varchar;
index_name varchar;
current_year varchar ;
current_month varchar ;
begin_year varchar ;
begin_month varchar ;
next_year varchar ;
next_month varchar ;
first_flag boolean ;
i integer ;
begin
  RAISE NOTICE 'CREATE TEMPORARY TABLE FOR SHIPMENT_DATE';
  CREATE TEMP TABLE tmp_shipment_date as select distinct "SHIPMENT_DATE" from shipment order by "SHIPMENT_DATE" ;

  RAISE NOTICE 'DROP TABLE shipment';
  drop table shipment cascade ;
  
  CREATE TABLE public.shipment
  (
    "SHIPMENT_ID" integer NOT NULL DEFAULT nextval('shipment_shipment_id_seq'::regclass),
    "SHIPMENT_NAME" character varying(30) COLLATE pg_catalog."default",
    "SHIPMENT_DATE" timestamp without time zone,
    "REPORT_NAME" character varying(40) COLLATE pg_catalog."default"
  )
  PARTITION BY RANGE ("SHIPMENT_DATE")
  WITH (
      OIDS = FALSE
  )
  TABLESPACE pg_default;

  RAISE NOTICE 'CREATE PARTITIONS FOR TABLE shipment';

  current_year:='0';
  current_month:='0';

  begin_year := '0' ;
  begin_month := '0'  ;
  next_year := '0' ;
  next_month := '0'  ;

  FOR rec_shipment_date IN SELECT * FROM tmp_shipment_date LOOP
      
      RAISE NOTICE 'SHIPMENT_DATE=%',rec_shipment_date."SHIPMENT_DATE";
      
      current_year := date_part('year' ,rec_shipment_date."SHIPMENT_DATE");
      current_month := date_part('month' ,rec_shipment_date."SHIPMENT_DATE") ; 

      IF to_number(current_month,'99') < 10 THEN
        current_month := '0'||current_month ; 
      END IF ;

      --Init borders
      IF   begin_year = '0' THEN
       first_flag := true ; --first time flag
       begin_year := current_year ;
       begin_month := current_month ;   
   
        IF current_month = '12' THEN
          next_year := date_part('year' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 year') ;
        ELSE
          next_year := current_year ;
        END IF;
     
       next_month := date_part('month' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 month') ;

      END IF;

      -- Check current date into borders NOT for First time
      IF to_date( current_year||'.'||current_month, 'YYYY.MM') >= to_date( begin_year||'.'||begin_month, 'YYYY.MM') AND 
         to_date( current_year||'.'||current_month, 'YYYY.MM') < to_date( next_year||'.'||next_month, 'YYYY.MM') AND 
         NOT first_flag 
      THEN
         CONTINUE ; 
      ELSE
       --NEW borders only for second and after time 
       begin_year := current_year ;
       begin_month := current_month ;   
   
        IF current_month = '12' THEN
          next_year := date_part('year' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 year') ;
        ELSE
          next_year := current_year ;
        END IF;
     
       next_month := date_part('month' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 month') ;

      END IF;      

      partition_name := 'shipment_shipment_date_'||begin_year||'-'||begin_month||'-01-'|| next_year||'-'||next_month||'-01'  ;
 
     EXECUTE format('CREATE TABLE ' || quote_ident(partition_name) || ' PARTITION OF shipment FOR VALUES FROM ( %L ) TO ( %L )  ' , current_year||'-'||current_month||'-01' , next_year||'-'||next_month||'-01'  ) ; 

      index_name := partition_name||'_shipment_id_idx';
      RAISE NOTICE 'INDEX NAME =%',index_name;
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_ID") TABLESPACE pg_default ' ) ; 

      --Drop first time flag
      first_flag := false ;
   
  END LOOP;

end
$$;

Importing a dump

pg_restore -d postgres --data-only --format=c --table=shipment --verbose  shipment.dmp > /tmp/data_dump/shipment_restore.log 2>&1

Checking partitioning results

What do we have as a result? The full text of the execution plan is large and boring, so it is quite possible to limit ourselves to the final figures.

It was

Cost: 502 997.55
execution time: 505 seconds.

It became

Cost: 77 872.36
execution time: 79 seconds.

Quite a good result. Reduced cost and lead time. Thus, the use of partitioning gives the expected effect and, in general, no surprises.

Please the customer

The test results were presented to the customer for consideration. And after getting acquainted, they were given a somewhat unexpected verdict: "Great, partition the table "data"".

Yes, but we have examined a completely different “shipment” table, the “data” table does not have the “SHIPMENT_DATE” field.

Not a problem, add, change. The main thing is that the customer is satisfied with what happens as a result, the details of the implementation are not particularly important.

Partitioning the main table "data"

In general, there were no particular difficulties. Although, the partitioning algorithm, of course, has changed somewhat.

Adding the "SHIPMENT_DATA" column to the "data" table

psql -h хост -U база -d юзер
=> ALTER TABLE data ADD COLUMN "SHIPMENT_DATE" timestamp without time zone ;

Fill in the values ​​of the "SHIPMENT_DATA" column in the "data" table with the values ​​of the column of the same name from the "shipment" table

-----------------------------
--update_data.sql
--updating for altered table "data" to values of "shipment_data" from the table "shipment"
--version 1.0
do language plpgsql $$
declare 
rec_shipment_data RECORD ;
shipment_date timestamp without time zone ; 
row_count integer ;
total_rows integer ;
begin

  select count(*) into total_rows from shipment ; 
  RAISE NOTICE 'Total %',total_rows;
  row_count:= 0 ;

  FOR rec_shipment_data IN SELECT * FROM shipment LOOP

   update data set "SHIPMENT_DATE" = rec_shipment_data."SHIPMENT_DATE" where "SHIPMENT_ID" = rec_shipment_data."SHIPMENT_ID";
   
   row_count:=  row_count +1 ;
   RAISE NOTICE 'row count = % , from %',row_count,total_rows;
  END LOOP;

end
$$;

Saving a dump of the "data" table

pg_dump postgres --file=/dump/data.dmp --format=c --table=data --verbose > /dump/data.log 2>&1</source

Recreate the partitioned table "data"

--create_partition_data.sql
--create partitions for the table "wafer data" by range column "shipment_data" with one month duration
--version 1.0
do language plpgsql $$
declare 
rec_shipment_date RECORD ;
partition_name varchar;
index_name varchar;
current_year varchar ;
current_month varchar ;
begin_year varchar ;
begin_month varchar ;
next_year varchar ;
next_month varchar ;
first_flag boolean ;
i integer ;

begin

  RAISE NOTICE 'CREATE TEMPORARY TABLE FOR SHIPMENT_DATE';
  CREATE TEMP TABLE tmp_shipment_date as select distinct "SHIPMENT_DATE" from shipment order by "SHIPMENT_DATE" ;


  RAISE NOTICE 'DROP TABLE data';
  drop table data cascade ;


  RAISE NOTICE 'CREATE PARTITIONED TABLE data';
  
  CREATE TABLE public.data
  (
    "RUN_ID" integer,
    "LASERMARK" character varying(20) COLLATE pg_catalog."default" NOT NULL,
    "LOTID" character varying(80) COLLATE pg_catalog."default",
    "SHIPMENT_ID" integer NOT NULL,
    "PARAMETER_ID" integer NOT NULL,
    "INTERNAL_VALUE" character varying(75) COLLATE pg_catalog."default",
    "REPORTED_VALUE" character varying(75) COLLATE pg_catalog."default",
    "LOWER_SPEC_LIMIT" numeric,
    "UPPER_SPEC_LIMIT" numeric , 
    "SHIPMENT_DATE" timestamp without time zone
  )
  PARTITION BY RANGE ("SHIPMENT_DATE")
  WITH (
    OIDS = FALSE
  )
  TABLESPACE pg_default ;


  RAISE NOTICE 'CREATE PARTITIONS FOR TABLE data';

  current_year:='0';
  current_month:='0';

  begin_year := '0' ;
  begin_month := '0'  ;
  next_year := '0' ;
  next_month := '0'  ;
  i := 1;

  FOR rec_shipment_date IN SELECT * FROM tmp_shipment_date LOOP
      
      RAISE NOTICE 'SHIPMENT_DATE=%',rec_shipment_date."SHIPMENT_DATE";
      
      current_year := date_part('year' ,rec_shipment_date."SHIPMENT_DATE");
      current_month := date_part('month' ,rec_shipment_date."SHIPMENT_DATE") ; 

      --Init borders
      IF   begin_year = '0' THEN
       RAISE NOTICE '***Init borders';
       first_flag := true ; --first time flag
       begin_year := current_year ;
       begin_month := current_month ;   
   
        IF current_month = '12' THEN
          next_year := date_part('year' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 year') ;
        ELSE
          next_year := current_year ;
        END IF;
     
       next_month := date_part('month' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 month') ;

      END IF;

--      RAISE NOTICE 'current_year=% , current_month=% ',current_year,current_month;
--      RAISE NOTICE 'begin_year=% , begin_month=% ',begin_year,begin_month;
--      RAISE NOTICE 'next_year=% , next_month=% ',next_year,next_month;

      -- Check current date into borders NOT for First time

      RAISE NOTICE 'Current data = %',to_char( to_date( current_year||'.'||current_month, 'YYYY.MM'), 'YYYY.MM');
      RAISE NOTICE 'Begin data = %',to_char( to_date( begin_year||'.'||begin_month, 'YYYY.MM'), 'YYYY.MM');
      RAISE NOTICE 'Next data = %',to_char( to_date( next_year||'.'||next_month, 'YYYY.MM'), 'YYYY.MM');

      IF to_date( current_year||'.'||current_month, 'YYYY.MM') >= to_date( begin_year||'.'||begin_month, 'YYYY.MM') AND 
         to_date( current_year||'.'||current_month, 'YYYY.MM') < to_date( next_year||'.'||next_month, 'YYYY.MM') AND 
         NOT first_flag 
      THEN
         RAISE NOTICE '***CONTINUE';
         CONTINUE ; 
      ELSE
       --NEW borders only for second and after time 
       RAISE NOTICE '***NEW BORDERS';
       begin_year := current_year ;
       begin_month := current_month ;   
   
        IF current_month = '12' THEN
          next_year := date_part('year' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 year') ;
        ELSE
          next_year := current_year ;
        END IF;
     
       next_month := date_part('month' ,rec_shipment_date."SHIPMENT_DATE" + interval '1 month') ;


      END IF;      

      IF to_number(current_month,'99') < 10 THEN
        current_month := '0'||current_month ; 
      END IF ;

      IF to_number(begin_month,'99') < 10 THEN
        begin_month := '0'||begin_month ; 
      END IF ;

      IF to_number(next_month,'99') < 10 THEN
        next_month := '0'||next_month ; 
      END IF ;

      RAISE NOTICE 'current_year=% , current_month=% ',current_year,current_month;
      RAISE NOTICE 'begin_year=% , begin_month=% ',begin_year,begin_month;
      RAISE NOTICE 'next_year=% , next_month=% ',next_year,next_month;

      partition_name := 'data_'||begin_year||begin_month||'01_'||next_year||next_month||'01'  ;

      RAISE NOTICE 'PARTITION NUMBER % , TABLE NAME =%',i , partition_name;
      
      EXECUTE format('CREATE TABLE ' || quote_ident(partition_name) || ' PARTITION OF data FOR VALUES FROM ( %L ) TO ( %L )  ' , begin_year||'-'||begin_month||'-01' , next_year||'-'||next_month||'-01'  ) ; 

      index_name := partition_name||'_shipment_id_parameter_id_idx';
      RAISE NOTICE 'INDEX NAME =%',index_name;
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_ID", "PARAMETER_ID") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_lasermark_idx';
      RAISE NOTICE 'INDEX NAME =%',index_name;
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("LASERMARK" COLLATE pg_catalog."default") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_shipment_id_idx';
      RAISE NOTICE 'INDEX NAME =%',index_name;
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_ID") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_parameter_id_idx';
      RAISE NOTICE 'INDEX NAME =%',index_name;
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("PARAMETER_ID") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_shipment_date_idx';
      RAISE NOTICE 'INDEX NAME =%',index_name;
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_DATE") TABLESPACE pg_default ' ) ; 

      --Drop first time flag
      first_flag := false ;

  END LOOP;
end
$$;

Download the dump created in step 3.

pg_restore -h хост -юзер -d база --data-only --format=c --table=data --verbose  data.dmp > data_restore.log 2>&1

Create a separate section for old data

---------------------------------------------------
--create_partition_for_old_dates.sql
--create partitions for keeping old dates 
--version 1.0
do language plpgsql $$
declare 
rec_shipment_date RECORD ;
partition_name varchar;
index_name varchar;

begin

      SELECT min("SHIPMENT_DATE") AS min_date INTO rec_shipment_date from data ;

      RAISE NOTICE 'Old date is %',rec_shipment_date.min_date ;

      partition_name := 'data_old_dates'  ;

      RAISE NOTICE 'PARTITION NAME IS %',partition_name;

      EXECUTE format('CREATE TABLE ' || quote_ident(partition_name) || ' PARTITION OF data FOR VALUES FROM ( %L ) TO ( %L )  ' , '1900-01-01' , 
              to_char( rec_shipment_date.min_date,'YYYY')||'-'||to_char(rec_shipment_date.min_date,'MM')||'-01'  ) ; 

      index_name := partition_name||'_shipment_id_parameter_id_idx';
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_ID", "PARAMETER_ID") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_lasermark_idx';
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("LASERMARK" COLLATE pg_catalog."default") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_shipment_id_idx';
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_ID") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_parameter_id_idx';
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("PARAMETER_ID") TABLESPACE pg_default ' ) ; 

      index_name := partition_name||'_shipment_date_idx';
      EXECUTE format('CREATE INDEX ' || quote_ident(index_name) || ' ON '|| quote_ident(partition_name) ||' USING btree ("SHIPMENT_DATE") TABLESPACE pg_default ' ) ; 

end
$$;

Final results:

It was
Cost: 502 997.55
execution time: 505 seconds.

It became
Cost: 68 533.70
execution time: 69 seconds

Worthy, quite worthy. And given that along the way I managed to more or less master the partitioning mechanism in PostgreSQL 10 - An excellent result.

Lyrical digression

And you can do it even better - YES, YOU CAN!To do this, you need to use MATERIALIZED VIEW.
CREATE MATERIALIZED VIEW LASERMARK_VIEW

CREATE MATERIALIZED VIEW LASERMARK_VIEW 
AS
SELECT w."LASERMARK" , MAX(s."SHIPMENT_DATE") AS "SHIPMENT_DATE"
FROM shipment s INNER JOIN data w ON s."SHIPMENT_ID" = w."SHIPMENT_ID" 
GROUP BY w."LASERMARK" ;

CREATE INDEX lasermark_vw_shipment_date_ind on lasermark_view USING btree ("SHIPMENT_DATE") TABLESPACE pg_default;
analyze lasermark_view ;

Rewrite the query again:
Query using materialized view

SELECT
            p."PARAMETER_ID" as  parameter_id,
            pc."PC_NAME" AS pc_name,
            pc."CUSTOMER_PARTNUMBER" AS customer_partnumber,
            w."LASERMARK" AS lasermark,
            w."LOTID" AS lotid,
            w."REPORTED_VALUE" AS reported_value,
            w."LOWER_SPEC_LIMIT" AS lower_spec_limit,
            w."UPPER_SPEC_LIMIT" AS upper_spec_limit,
            p."TYPE_CALCUL" AS type_calcul,
            s."SHIPMENT_NAME" AS shipment_name,
            s."SHIPMENT_DATE" AS shipment_date,
            extract(year from s."SHIPMENT_DATE") AS year,
            extract(month from s."SHIPMENT_DATE") as month,
            s."REPORT_NAME" AS report_name,
            p."STC_NAME" AS STC_name,
            p."CUSTOMERPARAM_NAME" AS customerparam_name
        FROM data w INNER JOIN shipment s ON s."SHIPMENT_ID" = w."SHIPMENT_ID"
             INNER JOIN parameters p ON p."PARAMETER_ID" = w."PARAMETER_ID"
             INNER JOIN shipment_pc sp ON s."SHIPMENT_ID" = sp."SHIPMENT_ID"
             INNER JOIN pc pc ON pc."PC_ID" = sp."PC_ID"
             INNER JOIN LASERMARK_VIEW md ON md."SHIPMENT_DATE" = s."SHIPMENT_DATE" AND md."LASERMARK" = w."LASERMARK"
        WHERE 
              s."SHIPMENT_DATE" >= '2018-07-01' AND s."SHIPMENT_DATE" <= '2018-09-30';

And we get another result:
It was
Cost: 502 997.55
execution time: 505 seconds

It became
Cost: 42 481.16
execution time: 43 seconds.

Although, of course, such a promising result is deceptive, the ideas need to be refreshed. So the total time to receive data is not much help. But as an experiment, it is quite interesting.

Actually, as it turned out, thanks again asmm and Habru!- the query can be further improved.

Afterword

So the customer is satisfied. AND necessary take advantage of the situation.

New task: What can you think of to deepen and expand?

And then I remember - guys, but we don’t have monitoring of our PostgreSQL databases.

Hand on heart, there is still some kind of monitoring in the form of Cloud Watch on AWS. But what is the use of this monitoring for the DBA? In general, almost none.

If you have a chance to do something useful and interesting for yourself, you can’t not take advantage of this chance ...
FOR

Happy Party or a couple of lines-memories of acquaintance with partitioning in PostgreSQL10

And so we come to the most interesting part:

December 3, 2018.
Deciding to start researching the available PostgreSQL query performance monitoring capabilities.

But that's a completely different story.

To be continued…

Source: habr.com

Add a comment