PostgreSQL Antipatterns: Passing Sets and Selects to SQL

From time to time, the developer needs pass a set of parameters or even an entire selection to the request "at the entrance". Sometimes there are very strange solutions to this problem.
PostgreSQL Antipatterns: Passing Sets and Selects to SQL
Let's go "from the opposite" and see how not to do it, why, and how you can do it better.

Direct "insertion" of values ​​in the request body

It usually looks something like this:

query = "SELECT * FROM tbl WHERE id = " + value

... or like this:

query = "SELECT * FROM tbl WHERE id = :param".format(param=value)

About this method it is said, written and even drawn enough:

PostgreSQL Antipatterns: Passing Sets and Selects to SQL

Almost always it is direct path to SQL injection and an extra load on the business logic, which is forced to “glue” your query string.

This approach can be partially justified only if necessary. use partitioning in PostgreSQL versions 10 and below for a more efficient plan. In these versions, the list of scanned sections is determined without taking into account the transmitted parameters, only on the basis of the request body.

$n arguments

Using placeholders parameters is good, it allows you to use PREPARED STATEMENTS, reducing the load both on the business logic (the query string is formed and transmitted only once) and on the database server (re-parsing and planning is not required for each instance of the request).

Variable number of arguments

Problems will await us when we want to pass an unknown number of arguments in advance:

... id IN ($1, $2, $3, ...) -- $1 : 2, $2 : 3, $3 : 5, ...

If you leave the request in this form, then although this will save us from potential injections, it will still lead to the need to glue / parse the request for each option from the number of arguments. Already better than doing it every time, but you can do without it.

It is enough to pass only one parameter containing serialized representation of an array:

... id = ANY($1::integer[]) -- $1 : '{2,3,5,8,13}'

The only difference is the need to explicitly convert the argument to the desired array type. But this does not cause problems, since we already know in advance where we are addressing.

Sample transfer (matrix)

Usually these are all sorts of options for transferring data sets for insertion into the database “in one request”:

INSERT INTO tbl(k, v) VALUES($1,$2),($3,$4),...

In addition to the problems described above with the "re-gluing" of the request, this can also lead us to out of memory and server crash. The reason is simple - PG reserves additional memory for the arguments, and the number of records in the set is limited only by the business logic application Wishlist. In especially clinical cases it was necessary to see "numbered" arguments greater than $9000 - do not do it this way.

Let's rewrite the query, applying already "two-level" serialization:

INSERT INTO tbl
SELECT
  unnest[1]::text k
, unnest[2]::integer v
FROM (
  SELECT
    unnest($1::text[])::text[] -- $1 : '{"{a,1}","{b,2}","{c,3}","{d,4}"}'
) T;

Yes, in the case of "complex" values ​​inside an array, they need to be framed with quotes.
It is clear that in this way you can "expand" the selection with an arbitrary number of fields.

unnest, unnest, …

From time to time there are options for passing instead of an "array of arrays" several "arrays of columns" that I mentioned in the last article:

SELECT
  unnest($1::text[]) k
, unnest($2::integer[]) v;

With this method, if you make a mistake when generating lists of values ​​for different columns, it is very easy to get completely unexpected results, which also depend on the server version:

-- $1 : '{a,b,c}', $2 : '{1,2}'
-- PostgreSQL 9.4
k | v
-----
a | 1
b | 2
c | 1
a | 2
b | 1
c | 2
-- PostgreSQL 11
k | v
-----
a | 1
b | 2
c |

JSON

Starting from version 9.3, PostgreSQL has full-fledged functions for working with the json type. Therefore, if your input parameters are defined in the browser, you can right there and form json object for SQL query:

SELECT
  key k
, value v
FROM
  json_each($1::json); -- '{"a":1,"b":2,"c":3,"d":4}'

For previous versions, the same method can be used for each(hstore), but correct "folding" with escaping complex objects in hstore can cause problems.

json_populate_recordset

If you know in advance that the data from the “input” json array will go to fill in some table, you can save a lot in “dereferencing” fields and casting to the desired types using the json_populate_recordset function:

SELECT
  *
FROM
  json_populate_recordset(
    NULL::pg_class
  , $1::json -- $1 : '[{"relname":"pg_class","oid":1262},{"relname":"pg_namespace","oid":2615}]'
  );

json_to_recordset

And this function will simply “expand” the passed array of objects into a selection, without relying on the table format:

SELECT
  *
FROM
  json_to_recordset($1::json) T(k text, v integer);
-- $1 : '[{"k":"a","v":1},{"k":"b","v":2}]'
k | v
-----
a | 1
b | 2

TEMPORARY TABLE

But if the amount of data in the transmitted sample is very large, then throwing it into one serialized parameter is difficult, and sometimes impossible, since it requires a one-time large memory allocation. For example, you need to collect a large batch of event data from an external system for a long, long time, and then you want to process it one-time on the database side.

In this case, the best solution would be to use temporary tables:

CREATE TEMPORARY TABLE tbl(k text, v integer);
...
INSERT INTO tbl(k, v) VALUES($1, $2); -- повторить много-много раз
...
-- тут делаем что-то полезное со всей этой таблицей целиком

The method is good for infrequent transmission of large volumes data.
From the point of view of describing the structure of its data, a temporary table differs from a “regular” table in only one feature. in pg_class system table, And in pg_type, pg_depend, pg_attribute, pg_attrdef, ... — and nothing at all.

Therefore, in web systems with a large number of short-lived connections for each of them, such a table will generate new system records each time, which are deleted when the connection to the database is closed. Eventually, uncontrolled use of TEMP TABLE leads to "swelling" of tables in pg_catalog and slowing down many operations that use them.
Of course, this can be combated with periodic pass VACUUM FULL according to the system catalog tables.

Session Variables

Suppose the processing of the data from the previous case is quite complex for a single SQL query, but you want to do it quite often. That is, we want to use procedural processing in DO block, but using data transfer through temporary tables will be too expensive.

We also cannot use $n-parameters to pass to an anonymous block. The session variables and the function will help us get out of the situation. current_setting.

Prior to version 9.2, you had to pre-configure special namespace custom_variable_classes for "their" session variables. On current versions, you can write something like this:

SET my.val = '{1,2,3}';
DO $$
DECLARE
  id integer;
BEGIN
  FOR id IN (SELECT unnest(current_setting('my.val')::integer[])) LOOP
    RAISE NOTICE 'id : %', id;
  END LOOP;
END;
$$ LANGUAGE plpgsql;
-- NOTICE:  id : 1
-- NOTICE:  id : 2
-- NOTICE:  id : 3

There are other solutions available in other supported procedural languages.

Know more ways? Share in the comments!

Source: habr.com

Add a comment