Yes, my old laptop is several times more powerful than your production server.

These are the claims I heard from our developers. The most interesting thing is that this turned out to be true, giving rise to a lengthy investigation. We will talk about SQL servers that are running on VMware.

Yes, my old laptop is several times more powerful than your production server.

Actually, getting the production server hopelessly behind the laptop is easy. Run (not on tempdb and not on a database with Delayed Durability enabled) the code:

set nocount on
create table _t (v varchar(100))
declare @n int=300000
while @n>0 begin 
  insert into _t select 'What a slowpoke!'
  delete from _t
  set @n=@n-1
  end
GO
drop table _t

It takes 5 seconds on my desktop and 28 seconds on the production server. Because SQL has to wait for the physical end of writing to the transaction log, and we are doing very short transactions here. Roughly speaking, we drove a large powerful truck into city traffic, and we are watching how pizza delivery people on scooters are famously overtaking it - throughput is not important here, only latency is important. And not a single network storage, no matter how many zeros there are in its price, will be able to outperform a local SSD in terms of latency.

(in the comments it turned out that I lied - I had delayed durability in both places. Without delayed durability it turns out:
Desktop - 39 seconds, 15K tr/sec, 0.065ms /io roundtrip
PROD - 360 seconds, 1600 tr/sec, 0.6ms
I should have noticed that it's too fast)

However, in this case we are dealing with trivial zeros of the Riemann zeta function with a trivial example. In the example that the developers brought me, it was different. I was convinced that they were right, and began to clean out all their specifics related to business logic from the example. At some point, I realized that I could completely throw away their code, and write my own - which demonstrates the same problem - in production it runs 3-4 times slower:

create function dbo.isPrime (@n bigint)
returns int
as
  begin
  if @n = 1 return 0
  if @n = 2 return 1
  if @n = 3 return 1
  if @n % 2 = 0 return 0
  declare @sq int
  set @sq = sqrt(@n)+1 -- check odds up to sqrt
  declare @dv int = 1
  while @dv < @sq 
    begin
	set @dv=@dv+2
	if @n % @dv = 0 return 0
	end
  return 1
  end
GO
declare @dt datetime set @dt=getdate()
select dbo.isPrime(1000000000000037)
select datediff(ms,@dt,getdate()) as ms
GO

If everything is fine with you, then checking for the simplicity of a number will take 6-7-8 seconds. This has happened on a number of servers. But on some, the check took 25-40 seconds. Interestingly, there were no servers where the execution would take, say, 14 seconds - the code worked either very quickly or very slowly, that is, the problem was, let's say, black and white.

What I've done? Got into VMware metrics. Everything was fine there - there were plenty of resources, Ready time = 0, there was enough of everything, during the test both on fast and slow servers CPU = 100 on one vCPU. I took a test to calculate the number of Pi - the test showed the same results on any servers. The smell of black magic grew stronger and stronger.

Having got out on the DEV farm, I began to play with servers. It turned out that vMotion from host to host can “cure” a server, but it can also turn a “fast” server into a “slow” one. It seems that this is it - some hosts have a problem ... but ... no. Some virtual machine slowed down on host, say, A, but worked quickly on host B. And the other virtual machine, on the contrary, worked fast on A and slowed down on B! Both “fast” and “slow” cars were often spinning on the host!

From that moment on, there was a distinct smell of sulfur in the air. After all, the problem could not be attributed to any virtual machine (windows patches, for example) - after all, it turned into a “fast” one with vMotion. But the problem also could not be attributed to the host - after all, it could have both “fast” and “slow” machines. It was also not related to the load - I managed to get a “slow” machine on the host, where there was nothing at all besides it.

Out of desperation, I fired up Sysinternals' Process Explorer and looked at the SQL stack. On slow machines, the line immediately caught my eye:

ntoskrnl.exe!KeSynchronizeExecution+0x5bf6
ntoskrnl.exe!KeWaitForMultipleObjects+0x109d
ntoskrnl.exe!KeWaitForMultipleObjects+0xb3f
ntoskrnl.exe!KeWaitForSingleObject+0x377
ntoskrnl.exe!KeQuerySystemTimePrecise+0x881 < - !!!
ntoskrnl.exe!ObDereferenceObjectDeferDelete+0x28a
ntoskrnl.exe!KeSynchronizeExecution+0x2de2
sqllang.dll!CDiagThreadSafe::PxlvlReplace+0x1a20
... skipped
sqldk.dll!SystemThread::MakeMiniSOSThread+0xa54
KERNEL32.DLL!BaseThreadInitThunk+0x14
ntdll.dll! RtlUserThreadStart + 0x21

It was already something. The program was written:

    class Program
    {
        [DllImport("kernel32.dll")]
        static extern void GetSystemTimePreciseAsFileTime(out FILE_TIME lpSystemTimeAsFileTime);

        [StructLayout(LayoutKind.Sequential)]
        struct FILE_TIME
        {
            public int ftTimeLow;
            public int ftTimeHigh;
        }

        static void Main(string[] args)
        {
            for (int i = 0; i < 16; i++)
            {
                int counter = 0;

                var stopwatch = Stopwatch.StartNew();

                while (stopwatch.ElapsedMilliseconds < 1000)
                {
                    GetSystemTimePreciseAsFileTime(out var fileTime);
                    counter++;
                }

                if (i > 0)
                {
                    Console.WriteLine("{0}", counter);
                }
            }
        }
    }

This program showed an even more pronounced slowdown - on "fast" machines it shows 16-18 million cycles per second, while on slow ones - one and a half million, or even 700 thousand. That is, the difference is 10-20 times (!!!). This was already a small victory: in any case, there was no threat of getting stuck between Microsoft and VMware support so that they would switch arrows to each other.

Then progress stopped - vacation, important things, viral hysteria and a sharp increase in workload. I often mentioned the magical problem to colleagues, but at times it seemed that they didn’t even always believe me - the statement that VMware slowed down the code by 10-20 times was too monstrous.

I tried to dig out myself what slows it down. At times it seemed to me that I had found a solution - turning the Hot plugs on and off, changing the amount of memory or the number of processors often turned the machine into a “fast one”. But not forever. But what turned out to be true is that it is enough to go out and knock on the wheel - that is, to change any virtual machine parameter

Finally, my American colleagues suddenly found a root cause.

Yes, my old laptop is several times more powerful than your production server.

Hosts differed in frequency!

  • As a rule, this is not scary. But: when moving from a 'native' host to a host with a 'different' frequency, VMware must adjust the GetTimePrecise result.
  • As a rule, this is not a problem, unless there is an application that requests the exact time millions of times per second, like SQL server.
  • But this is not scary either, since SQL server does not always do this (see Conclusion)

But there are cases when this rake hurts. And yes, by knocking on the wheel (by changing something in the VM settings), I forced VMware to 'recalculate' the configuration, and the frequency of the current host became the 'native' frequency of the machine.

Solution

www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf

When you disable virtualization of the TSC, reading the TSC from within the virtual machine returns the physical machine's TSC value, and writing the TSC from within the virtual machine has no effect. Migrating the virtual machine to another host, resuming it from suspended state, or reverting to a snapshot causes the TSC to jump discontinuously. Some guest operating systems fail to boot, or exhibit other timekeeping problems, when TSC virtualization is disabled. In the past, this feature has sometimes been recommended to improve performance of applications that read the TSC frequently, but performance of the virtual TSC has been improved substantially in current products. The feature has also been recommended for use when performing measurements that require a precise source of real time in the virtual machine.

In short, you need to add the parameter

monitor_control.virtual_rdtsc = FALSE

Conclusion

You probably have a question: why would SQL call GetTimePrecise so often?

I don't have the SQL server sources, but the logic says this. SQL is almost an operating system with cooperative concurrency, where each thread must "give way" from time to time. Where is the best place to do it? Where there is a natural expectation - lock or IO. Okay, but what if we are spinning computational cycles? Then the obvious and almost the only place is in the interpreter (this is not quite an interpreter), after the execution of the next operator.

As a rule, SQL server is not used for pure computing and this is not a problem. But cycles with work with all sorts of temporary tables (which are immediately cached) turn the code into a sequence of very quickly executed statements.

By the way, if the function is wrapped in NATIVELY COMPILED, then it stops requesting time, and its speed increases by 10 times. But what about cooperative multitasking? But for natively compiled code, I had to do PREEMPTIVE MULTITASKING in SQL.

Source: habr.com

Add a comment