We clean clone records from a table without PK

There are situations when to a table without a primary key or some other unique index, full clones of already existing records get through an oversight.

We clean clone records from a table without PK

For example, the values ​​of a chronological metric are written to the PostgreSQL COPY stream, and then a sudden failure, and part of the completely identical data comes again.

How to rid the database of unnecessary clones?

When PK is not a helper

The simplest way is to prevent such a situation from occurring at all. For example, roll the same PRIMARY KEY. But this is not always possible without increasing the amount of stored data.

For example, if the accuracy of the source system is higher than the accuracy of the field in the database:

metric   | ts                  | data
--------------------------------------------------
cpu.busy | 2019-12-20 00:00:00 | {"value" : 12.34}
cpu.busy | 2019-12-20 00:00:01 | {"value" : 10}
cpu.busy | 2019-12-20 00:00:01 | {"value" : 11.2}
cpu.busy | 2019-12-20 00:00:03 | {"value" : 15.7}

Noticed? The countdown instead of 00:00:02 was written to the database with ts a second earlier, but remained quite valid from an applied point of view (after all, the data values ​​are different!).

Of course you can do PK(metric, ts) - but then we will get insertion conflicts for valid data.

Can do PK(metric, ts, data) - but this will greatly increase its volume, which we will not use.

Therefore, the most correct option is to make a regular non-unique index (metric, ts) and deal with problems after the fact, if they do arise.

"The Clone War has begun"

There was some kind of accident, and now we have to destroy the clone records from the table.

We clean clone records from a table without PK

Let's simulate the original data:

CREATE TABLE tbl(k text, v integer);

INSERT INTO tbl
VALUES
  ('a', 1)
, ('a', 3)
, ('b', 2)
, ('b', 2) -- oops!
, ('c', 3)
, ('c', 3) -- oops!!
, ('c', 3) -- oops!!
, ('d', 4)
, ('e', 5)
;

Here our hand trembled three times, stuck Ctrl + V, and now ...

First, let's understand that our table can be very rather big, so after we find all the clones, it is advisable for us to literally β€œpoke a finger” to delete specific records without re-search for them.

And there is such a way addressing by ctid, the physical identifier of the particular entry.

That is, first of all, we need to collect the ctid of the records in the context of the full content of the table row. The simplest option is to cast the entire line to text:

SELECT
  T::text
, array_agg(ctid) ctids
FROM
  tbl T
GROUP BY
  1;

t     | ctids
---------------------------------
(e,5) | {"(0,9)"}
(d,4) | {"(0,8)"}
(c,3) | {"(0,5)","(0,6)","(0,7)"}
(b,2) | {"(0,3)","(0,4)"}
(a,3) | {"(0,2)"}
(a,1) | {"(0,1)"}

Is it possible not to cast?In principle, it is possible in most cases. Until you start using fields in this table types without equality operator:

CREATE TABLE tbl(k text, v integer, x point);
SELECT
  array_agg(ctid) ctids
FROM
  tbl T
GROUP BY
  T;
-- ERROR:  could not identify an equality operator for type tbl

Yeah, we immediately see that if there is more than one entry in the array, these are all clones. Let's leave them alone:

SELECT
  unnest(ctids[2:])
FROM
  (
    SELECT
      array_agg(ctid) ctids
    FROM
      tbl T
    GROUP BY
      T::text
  ) T;

unnest
------
(0,6)
(0,7)
(0,4)

Those who like to write shorterYou can also write like this:

SELECT
  unnest((array_agg(ctid))[2:])
FROM
  tbl T
GROUP BY
  T::text;

Since the value of the serialized string itself is of no interest to us, we simply threw it out of the returned columns of the subquery.

The only thing left is to force DELETE to use the set we received:

DELETE FROM
  tbl
WHERE
  ctid = ANY(ARRAY(
    SELECT
      unnest(ctids[2:])
    FROM
      (
        SELECT
          array_agg(ctid) ctids
        FROM
          tbl T
        GROUP BY
          T::text
      ) T
  )::tid[]);

Let's check ourselves:

We clean clone records from a table without PK
[look at explain.tensor.ru]

Yes, everything is correct: our 3 records were selected for a single Seq Scan of the entire table, and the Delete node was used to search for data single pass with Tid Scan:

->  Tid Scan on tbl (actual time=0.050..0.051 rows=3 loops=1)
      TID Cond: (ctid = ANY ($0))

If you cleared a lot of records, don't forget to run VACUUM ANALYZE.

Let's check for a larger table with more duplicates:

TRUNCATE TABLE tbl;

INSERT INTO tbl
SELECT
  chr(ascii('a'::text) + (random() * 26)::integer) k -- a..z
, (random() * 100)::integer v -- 0..99
FROM
  generate_series(1, 10000) i;

We clean clone records from a table without PK
[look at explain.tensor.ru]

So, the method works successfully, but it must be applied with some caution. Because for every record that is deleted, there is one read of the data page in Tid Scan, and one in Delete.

Source: habr.com

Add a comment