Some aspects of MS SQL Server monitoring. Guidelines for Setting Trace Flags

foreword

Quite often, users, developers and administrators of the MS SQL Server DBMS encounter performance problems of the database or the DBMS as a whole, so MS SQL Server monitoring is very relevant.
This article is an addition to the article Using Zabbix to Monitor MS SQL Server Database and it will cover some aspects of monitoring MS SQL Server, in particular: how to quickly determine which resources are missing, as well as recommendations for setting trace flags.
For the following scripts to work, you need to create an inf schema in the desired database as follows:
Creating an inf schema

use <имя_Π‘Π”>;
go
create schema inf;

Method for detecting lack of RAM

The first indicator of the lack of RAM is the case when an instance of MS SQL Server eats up all the RAM allocated to it.
To do this, we will create the following representation of inf.vRAM:
Creating the inf.vRAM view

CREATE view [inf].[vRAM] as
select a.[TotalAvailOSRam_Mb]						--сколько свободно ΠžΠ—Π£ Π½Π° сСрвСрС Π² ΠœΠ‘
		 , a.[RAM_Avail_Percent]					--ΠΏΡ€ΠΎΡ†Π΅Π½Ρ‚ свободного ΠžΠ—Π£ Π½Π° сСрвСрС
		 , a.[Server_physical_memory_Mb]				--сколько всСго ΠžΠ—Π£ Π½Π° сСрвСрС Π² ΠœΠ‘
		 , a.[SQL_server_committed_target_Mb]			--сколько всСго ΠžΠ—Π£ Π²Ρ‹Π΄Π΅Π»Π΅Π½ΠΎ ΠΏΠΎΠ΄ MS SQL Server Π² ΠœΠ‘
		 , a.[SQL_server_physical_memory_in_use_Mb] 		--сколько всСго ΠžΠ—Π£ потрСбляСт MS SQL Server Π² Π΄Π°Π½Π½Ρ‹ΠΉ ΠΌΠΎΠΌΠ΅Π½Ρ‚ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ Π² ΠœΠ‘
		 , a.[SQL_RAM_Avail_Percent]				--ΠΏΠΎΡ†Π΅Π½Ρ‚ свободного ΠžΠ—Π£ для MS SQL Server ΠΎΡ‚Π½ΠΎΡΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ всСго Π²Ρ‹Π΄Π΅Π»Π΅Π½Π½ΠΎΠ³ΠΎ ΠžΠ—Π£ для MS SQL Server
		 , a.[StateMemorySQL]						--достаточно Π»ΠΈ ΠžΠ—Π£ для MS SQL Server
		 , a.[SQL_RAM_Reserve_Percent]				--ΠΏΡ€ΠΎΡ†Π΅Π½Ρ‚ Π²Ρ‹Π΄Π΅Π»Π΅Π½Π½ΠΎΠΉ ΠžΠ—Π£ для MS SQL Server ΠΎΡ‚Π½ΠΎΡΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ всСго ΠžΠ—Π£ сСрвСра
		 --достаточно Π»ΠΈ ΠžΠ—Π£ для сСрвСра
		, (case when a.[RAM_Avail_Percent]<10 and a.[RAM_Avail_Percent]>5 and a.[TotalAvailOSRam_Mb]<8192 then 'Warning' when a.[RAM_Avail_Percent]<=5 and a.[TotalAvailOSRam_Mb]<2048 then 'Danger' else 'Normal' end) as [StateMemoryServer]
	from
	(
		select cast(a0.available_physical_memory_kb/1024.0 as int) as TotalAvailOSRam_Mb
			 , cast((a0.available_physical_memory_kb/casT(a0.total_physical_memory_kb as float))*100 as numeric(5,2)) as [RAM_Avail_Percent]
			 , a0.system_low_memory_signal_state
			 , ceiling(b.physical_memory_kb/1024.0) as [Server_physical_memory_Mb]
			 , ceiling(b.committed_target_kb/1024.0) as [SQL_server_committed_target_Mb]
			 , ceiling(a.physical_memory_in_use_kb/1024.0) as [SQL_server_physical_memory_in_use_Mb]
			 , cast(((b.committed_target_kb-a.physical_memory_in_use_kb)/casT(b.committed_target_kb as float))*100 as numeric(5,2)) as [SQL_RAM_Avail_Percent]
			 , cast((b.committed_target_kb/casT(a0.total_physical_memory_kb as float))*100 as numeric(5,2)) as [SQL_RAM_Reserve_Percent]
			 , (case when (ceiling(b.committed_target_kb/1024.0)-1024)<ceiling(a.physical_memory_in_use_kb/1024.0) then 'Warning' else 'Normal' end) as [StateMemorySQL]
		from sys.dm_os_sys_memory as a0
		cross join sys.dm_os_process_memory as a
		cross join sys.dm_os_sys_info as b
		cross join sys.dm_os_sys_memory as v
	) as a;

Then you can determine that an instance of MS SQL Server consumes all the memory allocated to it by the following query:

select  SQL_server_physical_memory_in_use_Mb,  SQL_server_committed_target_Mb
from [inf].[vRAM];

If SQL_server_physical_memory_in_use_Mb is consistently greater than or equal to SQL_server_committed_target_Mb, then the wait statistics should be checked.
To determine the lack of RAM through wait statistics, let's create the inf.vWaits view:
Creating the inf.vWaits View

CREATE view [inf].[vWaits] as
WITH [Waits] AS
    (SELECT
        [wait_type], --имя Ρ‚ΠΈΠΏΠ° оТидания
        [wait_time_ms] / 1000.0 AS [WaitS],--ΠžΠ±Ρ‰Π΅Π΅ врСмя оТидания Π΄Π°Π½Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ° Π² миллисСкундах. Π­Ρ‚ΠΎ врСмя Π²ΠΊΠ»ΡŽΡ‡Π°Π΅Ρ‚ signal_wait_time_ms
        ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],--ΠžΠ±Ρ‰Π΅Π΅ врСмя оТидания Π΄Π°Π½Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ° Π² миллисСкундах Π±Π΅Π· signal_wait_time_ms
        [signal_wait_time_ms] / 1000.0 AS [SignalS],--Π Π°Π·Π½ΠΈΡ†Π° ΠΌΠ΅ΠΆΠ΄Ρƒ Π²Ρ€Π΅ΠΌΠ΅Π½Π΅ΠΌ сигнализации ΠΎΠΆΠΈΠ΄Π°ΡŽΡ‰Π΅Π³ΠΎ ΠΏΠΎΡ‚ΠΎΠΊΠ° ΠΈ Π²Ρ€Π΅ΠΌΠ΅Π½Π΅ΠΌ Π½Π°Ρ‡Π°Π»Π° Π΅Π³ΠΎ выполнСния
        [waiting_tasks_count] AS [WaitCount],--Число ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΠΉ Π΄Π°Π½Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ°. Π­Ρ‚ΠΎΡ‚ счСтчик наращиваСтся ΠΊΠ°ΠΆΠ΄Ρ‹ΠΉ Ρ€Π°Π· ΠΏΡ€ΠΈ Π½Π°Ρ‡Π°Π»Π΅ оТидания
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_wait_stats
    WHERE [waiting_tasks_count]>0
		and [wait_type] NOT IN (
        N'BROKER_EVENTHANDLER',         N'BROKER_RECEIVE_WAITFOR',
        N'BROKER_TASK_STOP',            N'BROKER_TO_FLUSH',
        N'BROKER_TRANSMITTER',          N'CHECKPOINT_QUEUE',
        N'CHKPT',                       N'CLR_AUTO_EVENT',
        N'CLR_MANUAL_EVENT',            N'CLR_SEMAPHORE',
        N'DBMIRROR_DBM_EVENT',          N'DBMIRROR_EVENTS_QUEUE',
        N'DBMIRROR_WORKER_QUEUE',       N'DBMIRRORING_CMD',
        N'DIRTY_PAGE_POLL',             N'DISPATCHER_QUEUE_SEMAPHORE',
        N'EXECSYNC',                    N'FSAGENT',
        N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX',
        N'HADR_CLUSAPI_CALL',           N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
        N'HADR_LOGCAPTURE_WAIT',        N'HADR_NOTIFICATION_DEQUEUE',
        N'HADR_TIMER_TASK',             N'HADR_WORK_QUEUE',
        N'KSOURCE_WAKEUP',              N'LAZYWRITER_SLEEP',
        N'LOGMGR_QUEUE',                N'ONDEMAND_TASK_QUEUE',
        N'PWAIT_ALL_COMPONENTS_INITIALIZED',
        N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP',
        N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP',
        N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
        N'SERVER_IDLE_CHECK',           N'SLEEP_BPOOL_FLUSH',
        N'SLEEP_DBSTARTUP',             N'SLEEP_DCOMSTARTUP',
        N'SLEEP_MASTERDBREADY',         N'SLEEP_MASTERMDREADY',
        N'SLEEP_MASTERUPGRADED',        N'SLEEP_MSDBSTARTUP',
        N'SLEEP_SYSTEMTASK',            N'SLEEP_TASK',
        N'SLEEP_TEMPDBSTARTUP',         N'SNI_HTTP_ACCEPT',
        N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH',
        N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        N'SQLTRACE_WAIT_ENTRIES',       N'WAIT_FOR_RESULTS',
        N'WAITFOR',                     N'WAITFOR_TASKSHUTDOWN',
        N'WAIT_XTP_HOST_WAIT',          N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG',
        N'WAIT_XTP_CKPT_CLOSE',         N'XE_DISPATCHER_JOIN',
        N'XE_DISPATCHER_WAIT',          N'XE_TIMER_EVENT')
    )
, ress as (
	SELECT
	    [W1].[wait_type] AS [WaitType],
	    CAST ([W1].[WaitS] AS DECIMAL (16, 2)) AS [Wait_S],--ΠžΠ±Ρ‰Π΅Π΅ врСмя оТидания Π΄Π°Π½Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ° Π² миллисСкундах. Π­Ρ‚ΠΎ врСмя Π²ΠΊΠ»ΡŽΡ‡Π°Π΅Ρ‚ signal_wait_time_ms
	    CAST ([W1].[ResourceS] AS DECIMAL (16, 2)) AS [Resource_S],--ΠžΠ±Ρ‰Π΅Π΅ врСмя оТидания Π΄Π°Π½Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ° Π² миллисСкундах Π±Π΅Π· signal_wait_time_ms
	    CAST ([W1].[SignalS] AS DECIMAL (16, 2)) AS [Signal_S],--Π Π°Π·Π½ΠΈΡ†Π° ΠΌΠ΅ΠΆΠ΄Ρƒ Π²Ρ€Π΅ΠΌΠ΅Π½Π΅ΠΌ сигнализации ΠΎΠΆΠΈΠ΄Π°ΡŽΡ‰Π΅Π³ΠΎ ΠΏΠΎΡ‚ΠΎΠΊΠ° ΠΈ Π²Ρ€Π΅ΠΌΠ΅Π½Π΅ΠΌ Π½Π°Ρ‡Π°Π»Π° Π΅Π³ΠΎ выполнСния
	    [W1].[WaitCount] AS [WaitCount],--Число ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΠΉ Π΄Π°Π½Π½ΠΎΠ³ΠΎ Ρ‚ΠΈΠΏΠ°. Π­Ρ‚ΠΎΡ‚ счСтчик наращиваСтся ΠΊΠ°ΠΆΠ΄Ρ‹ΠΉ Ρ€Π°Π· ΠΏΡ€ΠΈ Π½Π°Ρ‡Π°Π»Π΅ оТидания
	    CAST ([W1].[Percentage] AS DECIMAL (5, 2)) AS [Percentage],
	    CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (16, 4)) AS [AvgWait_S],
	    CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (16, 4)) AS [AvgRes_S],
	    CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (16, 4)) AS [AvgSig_S]
	FROM [Waits] AS [W1]
	INNER JOIN [Waits] AS [W2]
	    ON [W2].[RowNum] <= [W1].[RowNum]
	GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
	    [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
	HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95 -- percentage threshold
)
SELECT [WaitType]
      ,MAX([Wait_S]) as [Wait_S]
      ,MAX([Resource_S]) as [Resource_S]
      ,MAX([Signal_S]) as [Signal_S]
      ,MAX([WaitCount]) as [WaitCount]
      ,MAX([Percentage]) as [Percentage]
      ,MAX([AvgWait_S]) as [AvgWait_S]
      ,MAX([AvgRes_S]) as [AvgRes_S]
      ,MAX([AvgSig_S]) as [AvgSig_S]
  FROM ress
  group by [WaitType];

In this case, you can determine the lack of RAM with the following query:

SELECT [Percentage]
      ,[AvgWait_S]
  FROM [inf].[vWaits]
  where [WaitType] in (
    'PAGEIOLATCH_XX',
    'RESOURCE_SEMAPHORE',
    'RESOURCE_SEMAPHORE_QUERY_COMPILE'
  );

Here you need to pay attention to the indicators Percentage and AvgWait_S. If they are significant in their totality, then there is a very high probability that there is not enough RAM for the MS SQL Server instance. Significant values ​​are determined individually for each system. However, you can start with the following: Percentage>=1 and AvgWait_S>=0.005.
To output indicators to a monitoring system (for example, Zabbix), you can create the following two queries:

  1. how many types of waits are occupied by RAM in percentage (the sum of all such types of waits):
    select coalesce(sum([Percentage]), 0.00) as [Percentage]
    from [inf].[vWaits]
           where [WaitType] in (
               'PAGEIOLATCH_XX',
               'RESOURCE_SEMAPHORE',
                'RESOURCE_SEMAPHORE_QUERY_COMPILE'
      );
    
  2. how many RAM wait types take in milliseconds (the maximum value of all average delays for all such wait types):
    select coalesce(max([AvgWait_S])*1000, 0.00) as [AvgWait_MS]
    from [inf].[vWaits]
           where [WaitType] in (
               'PAGEIOLATCH_XX',
               'RESOURCE_SEMAPHORE',
                'RESOURCE_SEMAPHORE_QUERY_COMPILE'
      );
    

Based on the dynamics of the obtained values ​​for these two indicators, we can conclude whether there is enough RAM for an instance of MS SQL Server.

CPU Overload Detection Method

To identify the lack of processor time, it is enough to use the sys.dm_os_schedulers system view. Here, if the runnable_tasks_count is constantly greater than 1, then there is a high probability that the number of cores is not enough for the MS SQL Server instance.
To output an indicator to a monitoring system (for example, Zabbix), you can create the following query:

select max([runnable_tasks_count]) as [runnable_tasks_count]
from sys.dm_os_schedulers
where scheduler_id<255;

Based on the dynamics of the obtained values ​​for this indicator, we can conclude whether there is enough processor time (the number of CPU cores) for an instance of MS SQL Server.
However, it is important to keep in mind the fact that requests themselves can request multiple threads at once. And sometimes the optimizer cannot correctly estimate the complexity of the query itself. Then the request may be allocated too many threads that cannot be processed at the same time at the given time. And this also causes a type of wait associated with a lack of processor time, and growth of the queue for schedulers that use specific CPU cores, i.e. the runnable_tasks_count indicator will grow in such conditions.
In this case, before increasing the number of CPU cores, it is necessary to correctly configure the parallelism properties of the MS SQL Server instance itself, and from the 2016 version, correctly configure the parallelism properties of the required databases:
Some aspects of MS SQL Server monitoring. Guidelines for Setting Trace Flags

Some aspects of MS SQL Server monitoring. Guidelines for Setting Trace Flags
Here you should pay attention to the following parameters:

  1. Max Degree of Parallelism - sets the maximum number of threads that can be allocated to each request (the default is 0 - limited only by the operating system itself and the edition of MS SQL Server)
  2. Cost Threshold for Parallelism - estimated cost of parallelism (default is 5)
  3. Max DOP - sets the maximum number of threads that can be allocated to each query at the database level (but not more than the value of the "Max Degree of Parallelism" property) (default is 0 - limited only by the operating system itself and the edition of MS SQL Server, as well as the restriction on the "Max Degree of Parallelism" property of the entire instance of MS SQL Server)

Here it is impossible to give an equally good recipe for all cases, i.e. you need to analyze heavy queries.
From my own experience, I recommend the following algorithm of actions for OLTP systems for setting up parallelism properties:

  1. first disable parallelism by setting the instance-wide Max Degree of Parallelism to 1
  2. analyze the heaviest requests and select the optimal number of threads for them
  3. set the Max Degree of Parallelism to the selected optimal number of threads obtained from step 2, and for specific databases set the Max DOP value obtained from step 2 for each database
  4. analyze the heaviest requests and identify the negative effect of multithreading. If it is, then increase the Cost Threshold for Parallelism.
    For systems such as 1C, Microsoft CRM and Microsoft NAV, in most cases, prohibiting multithreading is suitable

Also, if there is a Standard edition, then in most cases the prohibition of multithreading is suitable due to the fact that this edition is limited in the number of CPU cores.
For OLAP systems, the algorithm described above is not suitable.
From my own experience, I recommend the following algorithm of actions for OLAP systems for setting up parallelism properties:

  1. analyze the heaviest requests and select the optimal number of threads for them
  2. set the Max Degree of Parallelism to the selected optimal number of threads obtained from step 1, and for specific databases set the Max DOP value obtained from step 1 for each database
  3. analyze the heaviest queries and identify the negative effect of limiting concurrency. If it is, then either lower the Cost Threshold for Parallelism value, or repeat steps 1-2 of this algorithm

That is, for OLTP systems we go from single-threading to multi-threading, and for OLAP-systems, on the contrary, we go from multi-threading to single-threading. Thus, you can choose the optimal parallelism settings for both a specific database and the entire instance of MS SQL Server.
It is also important to understand that the settings of the parallelism properties need to be changed over time, based on the results of monitoring the performance of MS SQL Server.

Guidelines for Setting Trace Flags

From my own experience and the experience of my colleagues, for optimal performance, I recommend setting the following trace flags at the run level of the MS SQL Server service for versions 2008-2016:

  1. 610 - Reduced logging of inserts into indexed tables. Can help with inserts into tables with many records and many transactions, with frequent long WRITELOG waits for changes in indexes
  2. 1117 - If a file in a filegroup meets the autogrowth threshold requirements, all files in the filegroup grow
  3. 1118 - Forces all objects to be located in different extents (prohibition of mixed extents), which minimizes the need to scan the SGAM page, which is used to track mixed extents
  4. 1224 - Disables lock escalation based on the number of locks. However, excessive memory usage can trigger lock escalation
  5. 2371 - Changes the fixed automatic statistics update threshold to the dynamic automatic statistics update threshold. Important for updating query plans for large tables, where an incorrect count of records results in erroneous execution plans
  6. 3226 - Suppresses backup success messages in the error log
  7. 4199 - Includes changes to the query optimizer released in CUs and SQL Server Service Packs
  8. 6532-6534 - Includes performance improvements for query operations on spatial data types
  9. 8048 - Converts NUMA partitioned memory objects to CPU partitioned ones
  10. 8780 - Enables additional time allocation for query planning. Some requests without this flag may be rejected because they don't have a query plan (very rare bug)
  11. 8780 - 9389 - Enables additional dynamic grant memory buffer for batch mode statements, which allows the batch mode operator to request additional memory and avoid moving data to tempdb if additional memory is available

Also prior to 2016, it is useful to enable trace flag 2301, which enables enhanced decision support optimizations and thus helps in choosing more correct query plans. However, as of version 2016, it often has a negative effect on quite long overall query execution times.
Also, for systems with a lot of indexes (for example, for 1C databases), I recommend enabling trace flag 2330, which disables collection of index usage, which generally has a positive effect on the system.
For more information about trace flags, see here
From the link above, it's also important to consider versions and builds of MS SQL Server, as for newer versions, some trace flags are enabled by default or have no effect.
You can turn the trace flag on and off with the DBCC TRACEON and DBCC TRACEOFF commands, respectively. For more details see here
You can get the status of the trace flags using the DBCC TRACESTATUS command: more
In order for trace flags to be included in the autostart of the MS SQL Server service, you must go to SQL Server Configuration Manager and add these trace flags via -T in the service properties:
Some aspects of MS SQL Server monitoring. Guidelines for Setting Trace Flags

Results

In this article, some aspects of monitoring MS SQL Server were analyzed, with the help of which you can quickly identify the lack of RAM and free CPU time, as well as a number of other less obvious problems. The most commonly used trace flags have been reviewed.

Sources:

Β» SQL Server wait statistics
Β» SQL Server wait statistics or please tell me where it hurts
Β» System view sys.dm_os_schedulers
Β» Using Zabbix to Monitor MS SQL Server Database
Β» SQL Lifestyle
Β» Trace Flags
Β» sql.ru

Source: habr.com

Add a comment