foreword
Quite often, users, developers and administrators of the MS SQL Server DBMS encounter performance problems of the database or the DBMS as a whole, so MS SQL Server monitoring is very relevant.
This article is an addition to the article
For the following scripts to work, you need to create an inf schema in the desired database as follows:
Creating an inf schema
use <ΠΈΠΌΡ_ΠΠ>;
go
create schema inf;
Method for detecting lack of RAM
The first indicator of the lack of RAM is the case when an instance of MS SQL Server eats up all the RAM allocated to it.
To do this, we will create the following representation of inf.vRAM:
Creating the inf.vRAM view
CREATE view [inf].[vRAM] as
select a.[TotalAvailOSRam_Mb] --ΡΠΊΠΎΠ»ΡΠΊΠΎ ΡΠ²ΠΎΠ±ΠΎΠ΄Π½ΠΎ ΠΠΠ£ Π½Π° ΡΠ΅ΡΠ²Π΅ΡΠ΅ Π² ΠΠ
, a.[RAM_Avail_Percent] --ΠΏΡΠΎΡΠ΅Π½Ρ ΡΠ²ΠΎΠ±ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΠΠ£ Π½Π° ΡΠ΅ΡΠ²Π΅ΡΠ΅
, a.[Server_physical_memory_Mb] --ΡΠΊΠΎΠ»ΡΠΊΠΎ Π²ΡΠ΅Π³ΠΎ ΠΠΠ£ Π½Π° ΡΠ΅ΡΠ²Π΅ΡΠ΅ Π² ΠΠ
, a.[SQL_server_committed_target_Mb] --ΡΠΊΠΎΠ»ΡΠΊΠΎ Π²ΡΠ΅Π³ΠΎ ΠΠΠ£ Π²ΡΠ΄Π΅Π»Π΅Π½ΠΎ ΠΏΠΎΠ΄ MS SQL Server Π² ΠΠ
, a.[SQL_server_physical_memory_in_use_Mb] --ΡΠΊΠΎΠ»ΡΠΊΠΎ Π²ΡΠ΅Π³ΠΎ ΠΠΠ£ ΠΏΠΎΡΡΠ΅Π±Π»ΡΠ΅Ρ MS SQL Server Π² Π΄Π°Π½Π½ΡΠΉ ΠΌΠΎΠΌΠ΅Π½Ρ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ Π² ΠΠ
, a.[SQL_RAM_Avail_Percent] --ΠΏΠΎΡΠ΅Π½Ρ ΡΠ²ΠΎΠ±ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΠΠ£ Π΄Π»Ρ MS SQL Server ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΠΎ Π²ΡΠ΅Π³ΠΎ Π²ΡΠ΄Π΅Π»Π΅Π½Π½ΠΎΠ³ΠΎ ΠΠΠ£ Π΄Π»Ρ MS SQL Server
, a.[StateMemorySQL] --Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π»ΠΈ ΠΠΠ£ Π΄Π»Ρ MS SQL Server
, a.[SQL_RAM_Reserve_Percent] --ΠΏΡΠΎΡΠ΅Π½Ρ Π²ΡΠ΄Π΅Π»Π΅Π½Π½ΠΎΠΉ ΠΠΠ£ Π΄Π»Ρ MS SQL Server ΠΎΡΠ½ΠΎΡΠΈΡΠ΅Π»ΡΠ½ΠΎ Π²ΡΠ΅Π³ΠΎ ΠΠΠ£ ΡΠ΅ΡΠ²Π΅ΡΠ°
--Π΄ΠΎΡΡΠ°ΡΠΎΡΠ½ΠΎ Π»ΠΈ ΠΠΠ£ Π΄Π»Ρ ΡΠ΅ΡΠ²Π΅ΡΠ°
, (case when a.[RAM_Avail_Percent]<10 and a.[RAM_Avail_Percent]>5 and a.[TotalAvailOSRam_Mb]<8192 then 'Warning' when a.[RAM_Avail_Percent]<=5 and a.[TotalAvailOSRam_Mb]<2048 then 'Danger' else 'Normal' end) as [StateMemoryServer]
from
(
select cast(a0.available_physical_memory_kb/1024.0 as int) as TotalAvailOSRam_Mb
, cast((a0.available_physical_memory_kb/casT(a0.total_physical_memory_kb as float))*100 as numeric(5,2)) as [RAM_Avail_Percent]
, a0.system_low_memory_signal_state
, ceiling(b.physical_memory_kb/1024.0) as [Server_physical_memory_Mb]
, ceiling(b.committed_target_kb/1024.0) as [SQL_server_committed_target_Mb]
, ceiling(a.physical_memory_in_use_kb/1024.0) as [SQL_server_physical_memory_in_use_Mb]
, cast(((b.committed_target_kb-a.physical_memory_in_use_kb)/casT(b.committed_target_kb as float))*100 as numeric(5,2)) as [SQL_RAM_Avail_Percent]
, cast((b.committed_target_kb/casT(a0.total_physical_memory_kb as float))*100 as numeric(5,2)) as [SQL_RAM_Reserve_Percent]
, (case when (ceiling(b.committed_target_kb/1024.0)-1024)<ceiling(a.physical_memory_in_use_kb/1024.0) then 'Warning' else 'Normal' end) as [StateMemorySQL]
from sys.dm_os_sys_memory as a0
cross join sys.dm_os_process_memory as a
cross join sys.dm_os_sys_info as b
cross join sys.dm_os_sys_memory as v
) as a;
Then you can determine that an instance of MS SQL Server consumes all the memory allocated to it by the following query:
select SQL_server_physical_memory_in_use_Mb, SQL_server_committed_target_Mb
from [inf].[vRAM];
If SQL_server_physical_memory_in_use_Mb is consistently greater than or equal to SQL_server_committed_target_Mb, then the wait statistics should be checked.
To determine the lack of RAM through wait statistics, let's create the inf.vWaits view:
Creating the inf.vWaits View
CREATE view [inf].[vWaits] as
WITH [Waits] AS
(SELECT
[wait_type], --ΠΈΠΌΡ ΡΠΈΠΏΠ° ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ
[wait_time_ms] / 1000.0 AS [WaitS],--ΠΠ±ΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ Π΄Π°Π½Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ° Π² ΠΌΠΈΠ»Π»ΠΈΡΠ΅ΠΊΡΠ½Π΄Π°Ρ
. ΠΡΠΎ Π²ΡΠ΅ΠΌΡ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ signal_wait_time_ms
([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],--ΠΠ±ΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ Π΄Π°Π½Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ° Π² ΠΌΠΈΠ»Π»ΠΈΡΠ΅ΠΊΡΠ½Π΄Π°Ρ
Π±Π΅Π· signal_wait_time_ms
[signal_wait_time_ms] / 1000.0 AS [SignalS],--Π Π°Π·Π½ΠΈΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ Π²ΡΠ΅ΠΌΠ΅Π½Π΅ΠΌ ΡΠΈΠ³Π½Π°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΎΠΆΠΈΠ΄Π°ΡΡΠ΅Π³ΠΎ ΠΏΠΎΡΠΎΠΊΠ° ΠΈ Π²ΡΠ΅ΠΌΠ΅Π½Π΅ΠΌ Π½Π°ΡΠ°Π»Π° Π΅Π³ΠΎ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ
[waiting_tasks_count] AS [WaitCount],--Π§ΠΈΡΠ»ΠΎ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΠΉ Π΄Π°Π½Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ°. ΠΡΠΎΡ ΡΡΠ΅ΡΡΠΈΠΊ Π½Π°ΡΠ°ΡΠΈΠ²Π°Π΅ΡΡΡ ΠΊΠ°ΠΆΠ΄ΡΠΉ ΡΠ°Π· ΠΏΡΠΈ Π½Π°ΡΠ°Π»Π΅ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ
100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
FROM sys.dm_os_wait_stats
WHERE [waiting_tasks_count]>0
and [wait_type] NOT IN (
N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR',
N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH',
N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE',
N'CHKPT', N'CLR_AUTO_EVENT',
N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE',
N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE',
N'DBMIRROR_WORKER_QUEUE', N'DBMIRRORING_CMD',
N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE',
N'EXECSYNC', N'FSAGENT',
N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX',
N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE',
N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE',
N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP',
N'LOGMGR_QUEUE', N'ONDEMAND_TASK_QUEUE',
N'PWAIT_ALL_COMPONENTS_INITIALIZED',
N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP',
N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP',
N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE',
N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH',
N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP',
N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY',
N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP',
N'SLEEP_SYSTEMTASK', N'SLEEP_TASK',
N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT',
N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH',
N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS',
N'WAITFOR', N'WAITFOR_TASKSHUTDOWN',
N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG',
N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN',
N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT')
)
, ress as (
SELECT
[W1].[wait_type] AS [WaitType],
CAST ([W1].[WaitS] AS DECIMAL (16, 2)) AS [Wait_S],--ΠΠ±ΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ Π΄Π°Π½Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ° Π² ΠΌΠΈΠ»Π»ΠΈΡΠ΅ΠΊΡΠ½Π΄Π°Ρ
. ΠΡΠΎ Π²ΡΠ΅ΠΌΡ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ signal_wait_time_ms
CAST ([W1].[ResourceS] AS DECIMAL (16, 2)) AS [Resource_S],--ΠΠ±ΡΠ΅Π΅ Π²ΡΠ΅ΠΌΡ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ Π΄Π°Π½Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ° Π² ΠΌΠΈΠ»Π»ΠΈΡΠ΅ΠΊΡΠ½Π΄Π°Ρ
Π±Π΅Π· signal_wait_time_ms
CAST ([W1].[SignalS] AS DECIMAL (16, 2)) AS [Signal_S],--Π Π°Π·Π½ΠΈΡΠ° ΠΌΠ΅ΠΆΠ΄Ρ Π²ΡΠ΅ΠΌΠ΅Π½Π΅ΠΌ ΡΠΈΠ³Π½Π°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΎΠΆΠΈΠ΄Π°ΡΡΠ΅Π³ΠΎ ΠΏΠΎΡΠΎΠΊΠ° ΠΈ Π²ΡΠ΅ΠΌΠ΅Π½Π΅ΠΌ Π½Π°ΡΠ°Π»Π° Π΅Π³ΠΎ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΡ
[W1].[WaitCount] AS [WaitCount],--Π§ΠΈΡΠ»ΠΎ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΠΉ Π΄Π°Π½Π½ΠΎΠ³ΠΎ ΡΠΈΠΏΠ°. ΠΡΠΎΡ ΡΡΠ΅ΡΡΠΈΠΊ Π½Π°ΡΠ°ΡΠΈΠ²Π°Π΅ΡΡΡ ΠΊΠ°ΠΆΠ΄ΡΠΉ ΡΠ°Π· ΠΏΡΠΈ Π½Π°ΡΠ°Π»Π΅ ΠΎΠΆΠΈΠ΄Π°Π½ΠΈΡ
CAST ([W1].[Percentage] AS DECIMAL (5, 2)) AS [Percentage],
CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (16, 4)) AS [AvgWait_S],
CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (16, 4)) AS [AvgRes_S],
CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (16, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
[W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95 -- percentage threshold
)
SELECT [WaitType]
,MAX([Wait_S]) as [Wait_S]
,MAX([Resource_S]) as [Resource_S]
,MAX([Signal_S]) as [Signal_S]
,MAX([WaitCount]) as [WaitCount]
,MAX([Percentage]) as [Percentage]
,MAX([AvgWait_S]) as [AvgWait_S]
,MAX([AvgRes_S]) as [AvgRes_S]
,MAX([AvgSig_S]) as [AvgSig_S]
FROM ress
group by [WaitType];
In this case, you can determine the lack of RAM with the following query:
SELECT [Percentage]
,[AvgWait_S]
FROM [inf].[vWaits]
where [WaitType] in (
'PAGEIOLATCH_XX',
'RESOURCE_SEMAPHORE',
'RESOURCE_SEMAPHORE_QUERY_COMPILE'
);
Here you need to pay attention to the indicators Percentage and AvgWait_S. If they are significant in their totality, then there is a very high probability that there is not enough RAM for the MS SQL Server instance. Significant values ββare determined individually for each system. However, you can start with the following: Percentage>=1 and AvgWait_S>=0.005.
To output indicators to a monitoring system (for example, Zabbix), you can create the following two queries:
- how many types of waits are occupied by RAM in percentage (the sum of all such types of waits):
select coalesce(sum([Percentage]), 0.00) as [Percentage] from [inf].[vWaits] where [WaitType] in ( 'PAGEIOLATCH_XX', 'RESOURCE_SEMAPHORE', 'RESOURCE_SEMAPHORE_QUERY_COMPILE' );
- how many RAM wait types take in milliseconds (the maximum value of all average delays for all such wait types):
select coalesce(max([AvgWait_S])*1000, 0.00) as [AvgWait_MS] from [inf].[vWaits] where [WaitType] in ( 'PAGEIOLATCH_XX', 'RESOURCE_SEMAPHORE', 'RESOURCE_SEMAPHORE_QUERY_COMPILE' );
Based on the dynamics of the obtained values ββfor these two indicators, we can conclude whether there is enough RAM for an instance of MS SQL Server.
CPU Overload Detection Method
To identify the lack of processor time, it is enough to use the sys.dm_os_schedulers system view. Here, if the runnable_tasks_count is constantly greater than 1, then there is a high probability that the number of cores is not enough for the MS SQL Server instance.
To output an indicator to a monitoring system (for example, Zabbix), you can create the following query:
select max([runnable_tasks_count]) as [runnable_tasks_count]
from sys.dm_os_schedulers
where scheduler_id<255;
Based on the dynamics of the obtained values ββfor this indicator, we can conclude whether there is enough processor time (the number of CPU cores) for an instance of MS SQL Server.
However, it is important to keep in mind the fact that requests themselves can request multiple threads at once. And sometimes the optimizer cannot correctly estimate the complexity of the query itself. Then the request may be allocated too many threads that cannot be processed at the same time at the given time. And this also causes a type of wait associated with a lack of processor time, and growth of the queue for schedulers that use specific CPU cores, i.e. the runnable_tasks_count indicator will grow in such conditions.
In this case, before increasing the number of CPU cores, it is necessary to correctly configure the parallelism properties of the MS SQL Server instance itself, and from the 2016 version, correctly configure the parallelism properties of the required databases:
Here you should pay attention to the following parameters:
- Max Degree of Parallelism - sets the maximum number of threads that can be allocated to each request (the default is 0 - limited only by the operating system itself and the edition of MS SQL Server)
- Cost Threshold for Parallelism - estimated cost of parallelism (default is 5)
- Max DOP - sets the maximum number of threads that can be allocated to each query at the database level (but not more than the value of the "Max Degree of Parallelism" property) (default is 0 - limited only by the operating system itself and the edition of MS SQL Server, as well as the restriction on the "Max Degree of Parallelism" property of the entire instance of MS SQL Server)
Here it is impossible to give an equally good recipe for all cases, i.e. you need to analyze heavy queries.
From my own experience, I recommend the following algorithm of actions for OLTP systems for setting up parallelism properties:
- first disable parallelism by setting the instance-wide Max Degree of Parallelism to 1
- analyze the heaviest requests and select the optimal number of threads for them
- set the Max Degree of Parallelism to the selected optimal number of threads obtained from step 2, and for specific databases set the Max DOP value obtained from step 2 for each database
- analyze the heaviest requests and identify the negative effect of multithreading. If it is, then increase the Cost Threshold for Parallelism.
For systems such as 1C, Microsoft CRM and Microsoft NAV, in most cases, prohibiting multithreading is suitable
Also, if there is a Standard edition, then in most cases the prohibition of multithreading is suitable due to the fact that this edition is limited in the number of CPU cores.
For OLAP systems, the algorithm described above is not suitable.
From my own experience, I recommend the following algorithm of actions for OLAP systems for setting up parallelism properties:
- analyze the heaviest requests and select the optimal number of threads for them
- set the Max Degree of Parallelism to the selected optimal number of threads obtained from step 1, and for specific databases set the Max DOP value obtained from step 1 for each database
- analyze the heaviest queries and identify the negative effect of limiting concurrency. If it is, then either lower the Cost Threshold for Parallelism value, or repeat steps 1-2 of this algorithm
That is, for OLTP systems we go from single-threading to multi-threading, and for OLAP-systems, on the contrary, we go from multi-threading to single-threading. Thus, you can choose the optimal parallelism settings for both a specific database and the entire instance of MS SQL Server.
It is also important to understand that the settings of the parallelism properties need to be changed over time, based on the results of monitoring the performance of MS SQL Server.
Guidelines for Setting Trace Flags
From my own experience and the experience of my colleagues, for optimal performance, I recommend setting the following trace flags at the run level of the MS SQL Server service for versions 2008-2016:
- 610 - Reduced logging of inserts into indexed tables. Can help with inserts into tables with many records and many transactions, with frequent long WRITELOG waits for changes in indexes
- 1117 - If a file in a filegroup meets the autogrowth threshold requirements, all files in the filegroup grow
- 1118 - Forces all objects to be located in different extents (prohibition of mixed extents), which minimizes the need to scan the SGAM page, which is used to track mixed extents
- 1224 - Disables lock escalation based on the number of locks. However, excessive memory usage can trigger lock escalation
- 2371 - Changes the fixed automatic statistics update threshold to the dynamic automatic statistics update threshold. Important for updating query plans for large tables, where an incorrect count of records results in erroneous execution plans
- 3226 - Suppresses backup success messages in the error log
- 4199 - Includes changes to the query optimizer released in CUs and SQL Server Service Packs
- 6532-6534 - Includes performance improvements for query operations on spatial data types
- 8048 - Converts NUMA partitioned memory objects to CPU partitioned ones
- 8780 - Enables additional time allocation for query planning. Some requests without this flag may be rejected because they don't have a query plan (very rare bug)
- 8780 - 9389 - Enables additional dynamic grant memory buffer for batch mode statements, which allows the batch mode operator to request additional memory and avoid moving data to tempdb if additional memory is available
Also prior to 2016, it is useful to enable trace flag 2301, which enables enhanced decision support optimizations and thus helps in choosing more correct query plans. However, as of version 2016, it often has a negative effect on quite long overall query execution times.
Also, for systems with a lot of indexes (for example, for 1C databases), I recommend enabling trace flag 2330, which disables collection of index usage, which generally has a positive effect on the system.
For more information about trace flags, see
From the link above, it's also important to consider versions and builds of MS SQL Server, as for newer versions, some trace flags are enabled by default or have no effect.
You can turn the trace flag on and off with the DBCC TRACEON and DBCC TRACEOFF commands, respectively. For more details see
You can get the status of the trace flags using the DBCC TRACESTATUS command:
In order for trace flags to be included in the autostart of the MS SQL Server service, you must go to SQL Server Configuration Manager and add these trace flags via -T in the service properties:
Results
In this article, some aspects of monitoring MS SQL Server were analyzed, with the help of which you can quickly identify the lack of RAM and free CPU time, as well as a number of other less obvious problems. The most commonly used trace flags have been reviewed.
Sources:
Β»
Β»
Β»
Β»
Β»
Β»
Β»
Source: habr.com