Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm

Hello.

I decided to share my find - the fruit of thought, trial and error.
By and large: this is not a godsend, of course - all this should have been known for a long time, those who are engaged in applied stat data processing and optimization of any systems, not necessarily the DBMS.
And: yes, they know, they write amusing articles based on their research, example (UPD.: in the comments they pointed to a very interesting project: ottertune )
On the other hand: offhand, I don’t see a wide mention, distribution of this approach, on the Internet, among IT specialists, DBA.

So, to the point.

Suppose that here we have a task: to set up a certain service system to serve some kind of work.

It is known about this work: what it is, how the quality of this work is measured and what is the criterion for measuring this quality.

Let's also assume that, more or less known, it is clear: exactly how work is performed in (or with) this service system.

"More or less" - this means that it is possible to prepare (or get somewhere) some tools, a utility, a service that can be synthesized and applied to the system with a test load adequate enough to what will be in the production, in conditions adequate enough to work in the production .

Well, let's assume that a set of adjustment parameters of this service system is known, which can be used to configure this system, in the sense of the productivity of its work.

And what is the problem - there is not a sufficiently complete understanding of this service system, such that it allows you to expertly set the configuration of this system for the future load on this platform and you need to get the productivity of the system.

Well. Yes, it almost always happens.

What can be done here.

Well, the first thing that comes to mind is to look into the dock on this system. Understand - what are the acceptable ranges for the values ​​​​of the adjustment parameters. And, for example, using the coordinate descent method, to select values ​​for the system parameters in tests.

Those. to set some kind of configuration for the system, in the form of a specific set of values ​​for its settings.

Apply a test load to it, with this very tool-utility, a load generator.
And look at the value - the response, well, or the metric of the quality of the system.

The second thought may be the conclusion such that - this is a very long time.

Well, that is: if there are a lot of tuning parameters, if the ranges of their values ​​to be run are large, if each individual load test takes a lot of time, then: yes, this can all take an unacceptably long time.

Well, here's what you can understand and remember.

You can find out, in the set of values ​​for the tuning parameters of the service system - a vector, as a sequence of some values.

Each such vector, other things being equal (in that it is not affected by this vector), corresponds to a completely certain value of the metric - an indicator of the quality of the system, under a test load.

Those.

Here we denote the system configuration vector as Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithmWhere Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm; Where Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm - number of system configuration parameters, how many of these parameters.

And the value of the metric corresponding to the given Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm denoted as
Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm, then we get a function: Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm

Well, then: everything immediately comes down to, in my case: almost forgotten from the student bench, algorithms for finding the extremum of a function.

Well, but here an organizational and applied question arises: which algorithm to use.

  1. In the sense - to code with less hands.
  2. And to make it work, i.e. found an extremum (if it exists), well, at least faster than coordinate descent.

The first point hints that we need to look towards some environments in which such algorithms are already implemented, and there are, in some form, ready for use in the code.
Well, I'm aware python и cran-r

The second point means that you need to read about the actual algorithms, what they are, what their requirements are, and features in their work.

And what they give can be useful side effects-results, or directly, from the algorithm itself.

Or they can be obtained on the results of the algorithm.

A lot depends on the input conditions.

For example, if, for some reason, you need to get the result faster, well, you need to look towards gradient descent algorithms and choose one of them.

Or, if time is not so important, you can, for example, use stochastic optimization methods, such as a genetic algorithm.

I propose to consider the work of this approach, on the selection of the system configuration, using a genetic algorithm, at the next, so to speak: laboratory work.

Source:

  1. Let it be, as a service system: oracle xe 18c
  2. Let it serve transactional activity and the goal: to get the largest possible throughput of the subd, in transactions / sec.
  3. Transactions are very different, in terms of the nature of working with data and the context of work.
    Let's agree that these are transactions that do not process a large number of tabular data.
    In the sense that they do not generate undo data more than rarely and do not process large percentages of rows, large tables.

These are transactions that change one row in a more or less large table, with a small number of indexes above this table.

In this scenario: the productivity of transaction processing subds will, with a caveat, be determined by the quality of processing by the redo-database.

A caveat - if we talk specifically about the subd settings.

Because, in general, there may be, for example, transactional locks between SQL sessions, due to the design of user work with tabular data and / or the tabular model.

Which, of course, will have a depressing effect on the tps metric and it will be an exogenous, relatively subd, factor: well, this is how we designed the tabular model and working with data in it that locks occur.

Therefore, for the purity of the experiment, we exclude this factor, below I will specify how.

  1. Suppose, for definiteness, that 100% of the sql commands supplied to the subd: these are dml commands.
    Let the characteristics of user work with subd: the same, in tests.
    Namely: number of skl-sessions, tabular data, how skl-sessions work with them.
  2. Subd works in FORCE LOGGING, ARCHIVELOG mods. Flashback-database mode is off, at the level of subd.
  3. Redo-logs: located in a separate file system, on a separate "disk";
    The rest of the physical component of the database: in another, separate fs, on a separate "disk":

More about the device physical. laboratory database components

SQL> select status||' '||name from v$controlfile;
 /db/u14/oradata/XE/control01.ctl
SQL> select GROUP#||' '||MEMBER from v$logfile;
1 /db/u02/oradata/XE/redo01_01.log
2 /db/u02/oradata/XE/redo02_01.log
SQL> select FILE_ID||' '||TABLESPACE_NAME||' '||round(BYTES/1024/1024,2)||' '||FILE_NAME as col from dba_data_files;
4 UNDOTBS1 2208 /db/u14/oradata/XE/undotbs1_01.dbf
2 SLOB 128 /db/u14/oradata/XE/slob01.dbf
7 USERS 5 /db/u14/oradata/XE/users01.dbf
1 SYSTEM 860 /db/u14/oradata/XE/system01.dbf
3 SYSAUX 550 /db/u14/oradata/XE/sysaux01.dbf
5 MONITOR 128 /db/u14/oradata/XE/monitor.dbf
SQL> !cat /proc/mounts | egrep "/db/u[0-2]"
/dev/vda1 /db/u14 ext4 rw,noatime,nodiratime,data=ordered 0 0
/dev/mapper/vgsys-ora_redo /db/u02 xfs rw,noatime,nodiratime,attr2,nobarrier,inode64,logbsize=256k,noquota 0 0

Initially, under these load conditions, subd transactions wanted to use slob utility
She has such a wonderful feature, I will quote the author:

At the heart of SLOB is the “SLOB method.” The SLOB Method aims to test platforms
without application content. One cannot drive maximum hardware performance
using application code that is, for example, bound by application locking or even
sharing Oracle Database blocks. That's right—there is overhead when sharing data
in data blocks! But SLOB—in its default deployment—is immune to such contention.

This declaration: corresponds, it is.
It is convenient to adjust the degree of concurrency of skl-sessions, this is the key -t launching the utility runit.sh from SLOB
The percentage of dml-commands is regulated, in the number of skl-s that are sent to the subd, each skl-session, parameter UPDATE_PCT
Separately and very conveniently: SLOB itself, before and after the load session - prepares a statspack, or awr-snapshots (what is set to be prepared).

However, it turned out that SLOB does not support the work of skl-sessions with a duration of less than 30 seconds.
Therefore, at first I coded my own, worker-peasant version of the loader, and then he remained in work.

I’ll clarify on the loader - what and how it does, for clarity.
Essentially the loader looks like this:

Worker code

function dotx()
{
local v_period="$2"
[ -z "v_period" ] && v_period="0"
source "/home/oracle/testingredotracе/config.conf"

$ORACLE_HOME/bin/sqlplus -S system/${v_system_pwd} << __EOF__
whenever sqlerror exit failure
set verify off
set echo off
set feedback off

define wnum="$1"
define period="$v_period"
set appinfo worker_&&wnum

declare
 v_upto number;
 v_key  number;
 v_tots number;
 v_cts  number;
begin
 select max(col1) into v_upto from system.testtab_&&wnum;
 SELECT (( SYSDATE - DATE '1970-01-01' ) * 86400 ) into v_cts FROM DUAL;
 v_tots := &&period + v_cts;
 while v_cts <= v_tots
 loop
  v_key:=abs(mod(dbms_random.random,v_upto));
  if v_key=0 then
   v_key:=1;
  end if;
  update system.testtab_&&wnum t
  set t.object_name=translate(dbms_random.string('a', 120), 'abcXYZ', '158249')
  where t.col1=v_key
  ;
  commit;
  SELECT (( SYSDATE - DATE '1970-01-01' ) * 86400 ) into v_cts FROM DUAL;
 end loop;
end;
/

exit
__EOF__
}
export -f dotx

Workers are launched like this:

Running Workers

echo "starting test, duration: ${TEST_DURATION}" >> "$v_logfile"
for((i=1;i<="$SQLSESS_COUNT";i++))
do
 echo "sql-session: ${i}" >> "$v_logfile"
 dotx "$i" "${TEST_DURATION}" &
done
echo "waiting..." >> "$v_logfile"
wait

And tables for workers are prepared like this:

Creating tables

function createtable() {
source "/home/oracle/testingredotracе/config.conf"
$ORACLE_HOME/bin/sqlplus -S system/${v_system_pwd} << __EOF__
whenever sqlerror continue
set verify off
set echo off
set feedback off

define wnum="$1"
define ts_name="slob"

begin
 execute immediate 'drop table system.testtab_&&wnum';
exception when others then null;
end;
/

create table system.testtab_&&wnum tablespace &&ts_name as
select rownum as col1, t.*
from sys.dba_objects t
where rownum<1000
;
create index testtab_&&wnum._idx on system.testtab_&&wnum (col1);
--alter table system.testtab_&&wnum nologging;
--alter index system.testtab_&&wnum._idx nologging;
exit
__EOF__
}
export -f createtable

seq 1 1 "$SQLSESS_COUNT" | xargs -n 1 -P 4 -I {} -t bash -c "createtable "{}"" | tee -a "$v_logfile"
echo "createtable done" >> "$v_logfile"

Those. for each worker (practically: a separate skl-session in the subd), a separate table is created, with which the worker works.

This achieves the absence of transactional locks between the skl-sessions of the workers.
Each worker: does the same thing, with its own table, the tables are all the same.
Workers all - perform work during the same amount of time.
Moreover, it takes a long enough time for, for example, log-switching to happen more than once.
Well, accordingly, there were costs and effects associated with this.
In my case, the duration of the worker work was set to 8 minutes.

A piece of a statspack report describing the operation of the subd under load

Database    DB Id    Instance     Inst Num  Startup Time   Release     RAC
~~~~~~~~ ----------- ------------ -------- --------------- ----------- ---
          2929910313 XE                  1 07-Sep-20 23:12 18.0.0.0.0  NO

Host Name             Platform                CPUs Cores Sockets   Memory (G)
~~~~ ---------------- ---------------------- ----- ----- ------- ------------
     billing.izhevsk1 Linux x86 64-bit           2     2       1         15.6

Snapshot       Snap Id     Snap Time      Sessions Curs/Sess Comment
~~~~~~~~    ---------- ------------------ -------- --------- ------------------
Begin Snap:       1630 07-Sep-20 23:12:27       55        .7
  End Snap:       1631 07-Sep-20 23:20:29       62        .6
   Elapsed:       8.03 (mins) Av Act Sess:       8.4
   DB time:      67.31 (mins)      DB CPU:      15.01 (mins)

Cache Sizes            Begin        End
~~~~~~~~~~~       ---------- ----------
    Buffer Cache:     1,392M              Std Block Size:         8K
     Shared Pool:       288M                  Log Buffer:   103,424K

Load Profile              Per Second    Per Transaction    Per Exec    Per Call
~~~~~~~~~~~~      ------------------  ----------------- ----------- -----------
      DB time(s):                8.4                0.0        0.00        0.20
       DB CPU(s):                1.9                0.0        0.00        0.04
       Redo size:        7,685,765.6              978.4
   Logical reads:           60,447.0                7.7
   Block changes:           47,167.3                6.0
  Physical reads:                8.3                0.0
 Physical writes:              253.4                0.0
      User calls:               42.6                0.0
          Parses:               23.2                0.0
     Hard parses:                1.2                0.0
W/A MB processed:                1.0                0.0
          Logons:                0.5                0.0
        Executes:           15,756.5                2.0
       Rollbacks:                0.0                0.0
    Transactions:            7,855.1

Returning to the setting of laboratory work.
We will, ceteris paribus, vary the values ​​of the following parameters of the laboratory subd:

  1. Size of loggroups db. value range: [32, 1024] MB;
  2. Number of log groups bd. value range: [2,32];
  3. log_archive_max_processes value range: [1,8];
  4. commit_logging two values ​​are allowed: batch|immediate;
  5. commit_wait two values ​​are allowed: wait|nowait;
  6. log_buffer value range: [2,128] MB.
  7. log_checkpoint_timeout value range: [60,1200] seconds
  8. db_writer_processes value range: [1,4]
  9. undo_retention value range: [30;300] seconds
  10. transactions_per_rollback_segment value range: [1,8]
  11. disk_asynch_io two values ​​are allowed: true|false;
  12. filesystemio_options the following values ​​are allowed: none|setall|directIO|asynch;
  13. db_block_checking the following values ​​are allowed: OFF|LOW|MEDIUM|FULL;
  14. db_block_checksum the following values ​​are allowed: OFF|TYPICAL|FULL;

A person with experience in maintaining oracle databases, of course, can already say right now - what and what values ​​\uXNUMXb\uXNUMXbshould be set, from the specified parameters and their valid values, in order to get greater subd productivity, for the work with data that is indicated by the application code , here above.

But.

The meaning of the laboratory work is to show that the optimization algorithm itself and relatively quickly will clarify this for us.

For us: it remains only to look at the dock, according to the custom system, exactly as much as necessary to find out: what parameters and in what ranges to change.
And also: to encode the code that will implement the work with the custom system of the selected optimization algorithm.

Thus, now about the code.
Above talked about cran-r, i.e.: all manipulations, with a custom system, are orchestrated in the form of an R-script.

The task itself, analysis, selection by metric value, system state vectors: this is a package GA (documentation)
The package, in this case, is not very suitable, in the sense that it expects vectors (chromosomes, if in terms of the package) to be given in the form of post numbers with a fractional part.

And my vector, from the values ​​of the tuning parameters: these are 14 values ​​​​- integers and string values.

The problem, of course, is easily circumvented by assigning string values ​​some specific numbers.

Thus, in the end, the main piece of the R script looks like this:

Call GA::ga

cat( "", file=v_logfile, sep="n", append=F)

pSize = 10
elitism_value=1
pmutation_coef=0.8
pcrossover_coef=0.1
iterations=50

gam=GA::ga(type="real-valued", fitness=evaluate,
lower=c(32,2, 1,1,1,2,60,1,30,1,0,0, 0,0), upper=c(1024,32, 8,10,10,128,800,4,300,8,10,40, 40,30),
popSize=pSize,
pcrossover = pcrossover_coef,
pmutation = pmutation_coef,
maxiter=iterations,
run=4,
keepBest=T)
cat( "GA-session is done" , file=v_logfile, sep="n", append=T)
gam@solution

Here, with the help lower и upper subroutine attributes ga in fact, the region of the search space is set, within which a search will be performed for such a vector (or vectors) for which the maximum value of the fitness function will be obtained.

The ga subroutine performs the search by maximizing the fitness function.

Well, so, it turns out that, in this case, it is necessary that the fitness function, understanding the vector as a set of values ​​​​for certain parameters of the subd, receive a metric from the subd.

Ie: how much, with a given setting of the subd and a given load on the subd: the subd processes transactions per second.

That is, when unfolding, it is necessary that the following multi-pass is performed inside the fitness function:

  1. Processing the input vector of numbers - converting it into values ​​for the subd parameters.
  2. An attempt to create a given number of redo groups, a given size. Moreover, an attempt may be unsuccessful.
    Journal groups that already existed in the subd, in some number and some size, for the purity of the experiment - d.b. removed.
  3. If the previous point is successful: setting the base of the values ​​of the configuration parameters (again: there may be a failure)
  4. If the previous point is successful: stop the subd, start the subd in order for the newly set parameter values ​​\uXNUMXb\uXNUMXbto take effect. (again: may be a glitch)
  5. If the previous point is successful: perform a load test. get metric from subd.
  6. Return the subd to its original state, i.e. remove additional log groups, return the original subd configuration to work.

Fitness function code

evaluate=function(p_par) {
v_module="evaluate"
v_metric=0
opn=NULL
opn$rg_size=round(p_par[1],digit=0)
opn$rg_count=round(p_par[2],digit=0)
opn$log_archive_max_processes=round(p_par[3],digit=0)
opn$commit_logging="BATCH"
if ( round(p_par[4],digit=0) > 5 ) {
 opn$commit_logging="IMMEDIATE"
}
opn$commit_logging=paste("'", opn$commit_logging, "'",sep="")

opn$commit_wait="WAIT"
if ( round(p_par[5],digit=0) > 5 ) {
 opn$commit_wait="NOWAIT"
}
opn$commit_wait=paste("'", opn$commit_wait, "'",sep="")

opn$log_buffer=paste(round(p_par[6],digit=0),"m",sep="")
opn$log_checkpoint_timeout=round(p_par[7],digit=0)
opn$db_writer_processes=round(p_par[8],digit=0)
opn$undo_retention=round(p_par[9],digit=0)
opn$transactions_per_rollback_segment=round(p_par[10],digit=0)
opn$disk_asynch_io="true"
if ( round(p_par[11],digit=0) > 5 ) {
 opn$disk_asynch_io="false"
} 

opn$filesystemio_options="none"
if ( round(p_par[12],digit=0) > 10 && round(p_par[12],digit=0) <= 20 ) {
 opn$filesystemio_options="setall"
}
if ( round(p_par[12],digit=0) > 20 && round(p_par[12],digit=0) <= 30 ) {
 opn$filesystemio_options="directIO"
}
if ( round(p_par[12],digit=0) > 30 ) {
 opn$filesystemio_options="asynch"
}

opn$db_block_checking="OFF"
if ( round(p_par[13],digit=0) > 10 && round(p_par[13],digit=0) <= 20 ) {
 opn$db_block_checking="LOW"
}
if ( round(p_par[13],digit=0) > 20 && round(p_par[13],digit=0) <= 30 ) {
 opn$db_block_checking="MEDIUM"
}
if ( round(p_par[13],digit=0) > 30 ) {
 opn$db_block_checking="FULL"
}

opn$db_block_checksum="OFF"
if ( round(p_par[14],digit=0) > 10 && round(p_par[14],digit=0) <= 20 ) {
 opn$db_block_checksum="TYPICAL"
}
if ( round(p_par[14],digit=0) > 20 ) {
 opn$db_block_checksum="FULL"
}

v_vector=paste(round(p_par[1],digit=0),round(p_par[2],digit=0),round(p_par[3],digit=0),round(p_par[4],digit=0),round(p_par[5],digit=0),round(p_par[6],digit=0),round(p_par[7],digit=0),round(p_par[8],digit=0),round(p_par[9],digit=0),round(p_par[10],digit=0),round(p_par[11],digit=0),round(p_par[12],digit=0),round(p_par[13],digit=0),round(p_par[14],digit=0),sep=";")
cat( paste(v_module," try to evaluate vector: ", v_vector,sep="") , file=v_logfile, sep="n", append=T)

rc=make_additional_rgroups(opn)
if ( rc!=0 ) {
 cat( paste(v_module,"make_additional_rgroups failed",sep="") , file=v_logfile, sep="n", append=T)
 return (0)
}

v_rc=0
rc=set_db_parameter("log_archive_max_processes", opn$log_archive_max_processes)
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("commit_logging", opn$commit_logging )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("commit_wait", opn$commit_wait )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("log_buffer", opn$log_buffer )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("log_checkpoint_timeout", opn$log_checkpoint_timeout )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("db_writer_processes", opn$db_writer_processes )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("undo_retention", opn$undo_retention )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("transactions_per_rollback_segment", opn$transactions_per_rollback_segment )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("disk_asynch_io", opn$disk_asynch_io )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("filesystemio_options", opn$filesystemio_options )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("db_block_checking", opn$db_block_checking )
if ( rc != 0 ) {  v_rc=1 }
rc=set_db_parameter("db_block_checksum", opn$db_block_checksum )
if ( rc != 0 ) {  v_rc=1 }

if ( rc!=0 ) {
 cat( paste(v_module," can not startup db with that vector of settings",sep="") , file=v_logfile, sep="n", append=T)
 rc=stop_db("immediate")
 rc=create_spfile()
 rc=start_db("")
 rc=remove_additional_rgroups(opn)
 return (0)
}

rc=stop_db("immediate")
rc=start_db("")
if ( rc!=0 ) {
 cat( paste(v_module," can not startup db with that vector of settings",sep="") , file=v_logfile, sep="n", append=T)
 rc=stop_db("abort")
 rc=create_spfile()
 rc=start_db("")
 rc=remove_additional_rgroups(opn)
 return (0)
}

rc=run_test()
v_metric=getmetric()

rc=stop_db("immediate")
rc=create_spfile()
rc=start_db("")
rc=remove_additional_rgroups(opn)

cat( paste("result: ",v_metric," ",v_vector,sep="") , file=v_logfile, sep="n", append=T)
return (v_metric)
}

That. all work: performed in the fitness function.

ga-subroutine, performs the processing of vectors, or, more correctly, chromosomes.
In which, for us, it is most important: the selection of chromosomes with such genes, in which the fitness function produces large values.

This, in fact, is the process of searching for the optimal set of chromosomes by a vector in an N-dimensional search space.

Very clear and detailed explanation, with examples of R-code, how the genetic algorithm works.

I would like to highlight two technical points.

Auxiliary calls, from function evaluate, for example, stop-start, setting the value of the subd parameter, are performed based on cran-r functions system2

With the help of which: some kind of bash script or command is already called.

For example:

set_db_parameter

set_db_parameter=function(p1, p2) {
v_module="set_db_parameter"
v_cmd="/home/oracle/testingredotracе/set_db_parameter.sh"
v_args=paste(p1," ",p2,sep="")

x=system2(v_cmd, args=v_args, stdout=T, stderr=T, wait=T)
if ( length(attributes(x)) > 0 ) {
 cat(paste(v_module," failed with: ",attributes(x)$status," ",v_cmd," ",v_args,sep=""), file=v_logfile, sep="n", append=T)
 return (attributes(x)$status)
}
else {
 cat(paste(v_module," ok: ",v_cmd," ",v_args,sep=""), file=v_logfile, sep="n", append=T)
 return (0)
}
}

The second point is the line evaluate functions, with saving a specific value of the metric and its corresponding tuning vector, to the log file:

cat( paste("result: ",v_metric," ",v_vector,sep="") , file=v_logfile, sep="n", append=T)

This is important, because, from this data array, it will be possible to obtain additional information about which of the components of the tuning vector has more or less influence on the value of the metric.

Ie: it will be possible to carry out an attribute-importamce analysis.

So what can happen.

In the form of a graph, if you sort the tests in ascending order of the metric, the picture is as follows:

Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm

Some data corresponding to the extreme values ​​of the metric:
Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm
Here, in the screenshot with the results, I’ll clarify: the values ​​of the tuning vector are given in terms of the fitness function code, not in terms of the number list of parameters / ranges of parameter values, which I formulated above in the text.

Well. Is it a lot, or a little, ~ 8 thousand tps: a separate question.
As part of the laboratory work, this figure is not important, the dynamics is important, how this value changes.

The dynamics are good.
Obviously, at least one factor that significantly affects the value of the metric, the ga-algorithm, sorting through the chromosome vectors: covered.
Judging by the rather vigorous dynamics of the values ​​of the curve, there is at least one more factor that, although much less, still influences.

Here is needed attribute-importance analysis to understand: what attributes (well, in this case, the components of the tuning vector) and how much they affect the value of the metric.
And from this information: to understand what factors were affected by changes in significant attributes.

Run attribute-importance can be in many ways.

I, for these purposes, like the algorithm randomForest R-package of the same name (documentation)
randomForest, as I understand his work in general and his approach to assessing the importance of attributes in particular, builds a certain model of the dependence of the response variable on attributes.

In our case, the response variable is the metric received from the subd, in load tests: tps;
And the attributes are the components of the tuning vector.

So randomForest evaluates the importance of each model attribute with two numbers: %IncMSE - how the presence / absence of this attribute in the model changes the MSE quality of this model (Mean Squared Error);

And IncNodePurity is a number that displays how well, according to the values ​​of this attribute, it is possible to divide the dataset with observations, so that in one part there is data with some one value of the metric being explained, and in the other with a different value of the metric.
Well, that is: how much it is a classifying attribute (I saw the most intelligible, Russian-language explanation on random forest here).

Worker-peasant R-code, for processing a dataset with the results of load tests:

x=NULL
v_data_file=paste('/tmp/data1.dat',sep="")
x=read.table(v_data_file, header = TRUE, sep = ";", dec=",", quote = ""'", stringsAsFactors=FALSE)
colnames(x)=c('metric','rgsize','rgcount','lamp','cmtl','cmtw','lgbffr','lct','dbwrp','undo_retention','tprs','disk_async_io','filesystemio_options','db_block_checking','db_block_checksum')

idxTrain=sample(nrow(x),as.integer(nrow(x)*0.7))
idxNotTrain=which(! 1:nrow(x) %in% idxTrain )
TrainDS=x[idxTrain,]
ValidateDS=x[idxNotTrain,]

library(randomForest)
#mtry=as.integer( sqrt(dim(x)[2]-1) )
rf=randomForest(metric ~ ., data=TrainDS, ntree=40, mtry=3, replace=T, nodesize=2, importance=T, do.trace=10, localImp=F)
ValidateDS$predicted=predict(rf, newdata=ValidateDS[,colnames(ValidateDS)!="metric"], type="response")
sum((ValidateDS$metric-ValidateDS$predicted)^2)
rf$importance

You can directly select the hyperparameters of the algorithm with your hands and, focusing on the quality of the model, choose a model that more accurately performs predictions on the validation dataset.
You can write some kind of function for this work (by the way - again, on some kind of optimization algorithm).

You can use the R package caret, not the point is important.

As a result, in this case, we get the following result, to assess the degree of importance of attributes:

Scientific poke method, or how to choose a subd configuration using benchmarks and an optimization algorithm

Well. Thus, you can start global reflections:

  1. It turns out that the most significant, in these testing conditions, turned out to be the parameter commit_wait
    Technically, it sets the execution mode of the io-operation for writing redo data, from the subd log buffer, to the current log group: synchronous, or asynchronous.
    Value nowait which results in an almost vertical, multiple increase in the value of the tps-metric: this is the inclusion of the io async mode in redo groups.
    A separate question is whether it is necessary or not to do this in the food database. Here I limit myself to a statement: this is a significant factor.
  2. It is logical that the size of the log buffer subd: turns out to be a significant factor.
    The smaller the size of the log buffer, the lower its buffering capacity, the more often it overflows and/or it is not possible to allocate a free area in it for a portion of new redo data.
    This means: delays associated with allocating space in the log buffer and / or dumping redo data from it into redo groups.
    These delays, of course, should and do affect the transaction throughput of the subd.
  3. Parameter db_block_checksum: well, too, in general, it’s understandable - transaction processing leads to the formation of darty blocks in the subd buffer cache.
    Which, with datablock checksums enabled, the database has to process - calculate these checksums from the datablock body, compare them with what is written in the datablock header: matches / does not match.
    Such work, again, cannot but delay the processing of data, and, accordingly, the parameter and the mechanism that this parameter sets turn out to be significant.
    Therefore, the vendor offers, in the documentation for this parameter, its different (parameter) values ​​​​and notes that - yes, there will be an impact, but, here, you can choose different values, up to "off" and a different impact.

Well, the global conclusion.

The approach, in general, is: it turns out to be quite working.

It quite allows itself, at the early stages of load testing of a certain service system, to select its (system) optimal configuration for the load, it is not very much to delve into the features of setting the system for the load.

But it does not completely exclude - at least at the level of understanding: "adjusting knobs" and the permissible ranges of rotation of these knobs, you need to know the system.

Further, the approach can relatively quickly find the optimal system configuration.
And it is possible, based on the results of testing, to obtain information about the nature of the relationship between the quality metric of the system and the values ​​of the system settings.

Which, of course, should contribute to the emergence of this most profound understanding of the system, its operation, at least under this load.

In practice, this is: the exchange of costs for understanding the custom system, for the costs of preparing just such a test of the system.

Separately, I note: in this approach, the degree of adequacy of testing the system to the conditions of its operation that it will have in production operation is critically important.

Thank you for your attention time.

Source: habr.com

Add a comment