There are situations when to a table without a primary key or some other unique index, full clones of already existing records get through an oversight.
For example, the values ββof a chronological metric are written to the PostgreSQL COPY stream, and then a sudden failure, and part of the completely identical data comes again.
How to rid the database of unnecessary clones?
When PK is not a helper
The simplest way is to prevent such a situation from occurring at all. For example, roll the same PRIMARY KEY. But this is not always possible without increasing the amount of stored data.
For example, if the accuracy of the source system is higher than the accuracy of the field in the database:
metric | ts | data
--------------------------------------------------
cpu.busy | 2019-12-20 00:00:00 | {"value" : 12.34}
cpu.busy | 2019-12-20 00:00:01 | {"value" : 10}
cpu.busy | 2019-12-20 00:00:01 | {"value" : 11.2}
cpu.busy | 2019-12-20 00:00:03 | {"value" : 15.7}
Noticed? The countdown instead of 00:00:02 was written to the database with ts a second earlier, but remained quite valid from an applied point of view (after all, the data values ββare different!).
Of course you can do PK(metric, ts) - but then we will get insertion conflicts for valid data.
Can do PK(metric, ts, data) - but this will greatly increase its volume, which we will not use.
Therefore, the most correct option is to make a regular non-unique index (metric, ts) and deal with problems after the fact, if they do arise.
"The Clone War has begun"
There was some kind of accident, and now we have to destroy the clone records from the table.
Let's simulate the original data:
CREATE TABLE tbl(k text, v integer);
INSERT INTO tbl
VALUES
('a', 1)
, ('a', 3)
, ('b', 2)
, ('b', 2) -- oops!
, ('c', 3)
, ('c', 3) -- oops!!
, ('c', 3) -- oops!!
, ('d', 4)
, ('e', 5)
;
Here our hand trembled three times, stuck Ctrl + V, and now ...
First, let's understand that our table can be very rather big, so after we find all the clones, it is advisable for us to literally βpoke a fingerβ to delete specific records without re-search for them.
And there is such a way
That is, first of all, we need to collect the ctid of the records in the context of the full content of the table row. The simplest option is to cast the entire line to text:
SELECT
T::text
, array_agg(ctid) ctids
FROM
tbl T
GROUP BY
1;
t | ctids
---------------------------------
(e,5) | {"(0,9)"}
(d,4) | {"(0,8)"}
(c,3) | {"(0,5)","(0,6)","(0,7)"}
(b,2) | {"(0,3)","(0,4)"}
(a,3) | {"(0,2)"}
(a,1) | {"(0,1)"}
Is it possible not to cast?In principle, it is possible in most cases. Until you start using fields in this table types without equality operator:
CREATE TABLE tbl(k text, v integer, x point);
SELECT
array_agg(ctid) ctids
FROM
tbl T
GROUP BY
T;
-- ERROR: could not identify an equality operator for type tbl
Yeah, we immediately see that if there is more than one entry in the array, these are all clones. Let's leave them alone:
SELECT
unnest(ctids[2:])
FROM
(
SELECT
array_agg(ctid) ctids
FROM
tbl T
GROUP BY
T::text
) T;
unnest
------
(0,6)
(0,7)
(0,4)
Those who like to write shorterYou can also write like this:
SELECT
unnest((array_agg(ctid))[2:])
FROM
tbl T
GROUP BY
T::text;
Since the value of the serialized string itself is of no interest to us, we simply threw it out of the returned columns of the subquery.
The only thing left is to force DELETE to use the set we received:
DELETE FROM
tbl
WHERE
ctid = ANY(ARRAY(
SELECT
unnest(ctids[2:])
FROM
(
SELECT
array_agg(ctid) ctids
FROM
tbl T
GROUP BY
T::text
) T
)::tid[]);
Let's check ourselves:
Yes, everything is correct: our 3 records were selected for a single Seq Scan of the entire table, and the Delete node was used to search for data single pass with Tid Scan:
-> Tid Scan on tbl (actual time=0.050..0.051 rows=3 loops=1)
TID Cond: (ctid = ANY ($0))
If you cleared a lot of records,
Let's check for a larger table with more duplicates:
TRUNCATE TABLE tbl;
INSERT INTO tbl
SELECT
chr(ascii('a'::text) + (random() * 26)::integer) k -- a..z
, (random() * 100)::integer v -- 0..99
FROM
generate_series(1, 10000) i;
So, the method works successfully, but it must be applied with some caution. Because for every record that is deleted, there is one read of the data page in Tid Scan, and one in Delete.
Source: habr.com