Some aspects of LINQ query optimization in C#.NET for MS SQL Server

LINQ entered .NET as a powerful new data manipulation language. LINQ to SQL as part of it allows you to communicate with the DBMS quite conveniently using, for example, the Entity Framework. However, using it quite often, developers forget to look at what kind of SQL query the queryable provider will generate, in your case, the Entity Framework.

Let's look at two main points with an example.
To do this, in SQL Server we will create a database Test, and in it we will create two tables using the following query:

Creating tables

USE [TEST]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Ref](
	[ID] [int] NOT NULL,
	[ID2] [int] NOT NULL,
	[Name] [nvarchar](255) NOT NULL,
	[InsertUTCDate] [datetime] NOT NULL,
 CONSTRAINT [PK_Ref] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[Ref] ADD  CONSTRAINT [DF_Ref_InsertUTCDate]  DEFAULT (getutcdate()) FOR [InsertUTCDate]
GO

USE [TEST]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Customer](
	[ID] [int] NOT NULL,
	[Name] [nvarchar](255) NOT NULL,
	[Ref_ID] [int] NOT NULL,
	[InsertUTCDate] [datetime] NOT NULL,
	[Ref_ID2] [int] NOT NULL,
 CONSTRAINT [PK_Customer] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[Customer] ADD  CONSTRAINT [DF_Customer_Ref_ID]  DEFAULT ((0)) FOR [Ref_ID]
GO

ALTER TABLE [dbo].[Customer] ADD  CONSTRAINT [DF_Customer_InsertUTCDate]  DEFAULT (getutcdate()) FOR [InsertUTCDate]
GO

Now let's populate the Ref table by running the following script:

Filling the table Ref

USE [TEST]
GO

DECLARE @ind INT=1;

WHILE(@ind<1200000)
BEGIN
	INSERT INTO [dbo].[Ref]
           ([ID]
           ,[ID2]
           ,[Name])
    SELECT
           @ind
           ,@ind
           ,CAST(@ind AS NVARCHAR(255));

	SET @ind=@ind+1;
END 
GO

Let's populate the Customer table in the same way using the following script:

Populating the Customer table

USE [TEST]
GO

DECLARE @ind INT=1;
DECLARE @ind_ref INT=1;

WHILE(@ind<=12000000)
BEGIN
	IF(@ind%3=0) SET @ind_ref=1;
	ELSE IF (@ind%5=0) SET @ind_ref=2;
	ELSE IF (@ind%7=0) SET @ind_ref=3;
	ELSE IF (@ind%11=0) SET @ind_ref=4;
	ELSE IF (@ind%13=0) SET @ind_ref=5;
	ELSE IF (@ind%17=0) SET @ind_ref=6;
	ELSE IF (@ind%19=0) SET @ind_ref=7;
	ELSE IF (@ind%23=0) SET @ind_ref=8;
	ELSE IF (@ind%29=0) SET @ind_ref=9;
	ELSE IF (@ind%31=0) SET @ind_ref=10;
	ELSE IF (@ind%37=0) SET @ind_ref=11;
	ELSE SET @ind_ref=@ind%1190000;
	
	INSERT INTO [dbo].[Customer]
	           ([ID]
	           ,[Name]
	           ,[Ref_ID]
	           ,[Ref_ID2])
	     SELECT
	           @ind,
	           CAST(@ind AS NVARCHAR(255)),
	           @ind_ref,
	           @ind_ref;


	SET @ind=@ind+1;
END
GO

Thus, we got two tables, one of which has more than 1 million rows of data, and the other has more than 10 million rows of data.

Now in Visual Studio, you need to create a test project Visual C# Console App (.NET Framework):

Some aspects of LINQ query optimization in C#.NET for MS SQL Server

Next, you need to add a library for the Entity Framework to interact with the database.
To add it, right-click on the project and select Manage NuGet Packages from the context menu:

Some aspects of LINQ query optimization in C#.NET for MS SQL Server

Then, in the NuGet package management window that appears, in the search box, enter the word "Entity Framework" and select the Entity Framework package and install it:

Some aspects of LINQ query optimization in C#.NET for MS SQL Server

Next, in the App.config file, after closing the configSections element, add the following block:

<connectionStrings>
    <add name="DBConnection" connectionString="data source=ИМЯ_Π­ΠšΠ—Π•ΠœΠŸΠ›Π―Π Π_MSSQL;Initial Catalog=TEST;Integrated Security=True;" providerName="System.Data.SqlClient" />
</connectionStrings>

In connectionString you need to enter the connection string.

Now let's create 3 interfaces in separate files:

  1. Implementation of the IBaseEntityID interface
    namespace TestLINQ
    {
        public interface IBaseEntityID
        {
            int ID { get; set; }
        }
    }
    

  2. Implementation of the IBaseEntityName interface
    namespace TestLINQ
    {
        public interface IBaseEntityName
        {
            string Name { get; set; }
        }
    }
    

  3. Implementation of the IBaseNameInsertUTCDate interface
    namespace TestLINQ
    {
        public interface IBaseNameInsertUTCDate
        {
            DateTime InsertUTCDate { get; set; }
        }
    }
    

And in a separate file, we will create a BaseEntity base class for our two entities, which will include common fields:

Implementation of the base class BaseEntity

namespace TestLINQ
{
    public class BaseEntity : IBaseEntityID, IBaseEntityName, IBaseNameInsertUTCDate
    {
        public int ID { get; set; }
        public string Name { get; set; }
        public DateTime InsertUTCDate { get; set; }
    }
}

Next, in separate files, we will create our two entities:

  1. Ref class implementation
    using System.ComponentModel.DataAnnotations.Schema;
    
    namespace TestLINQ
    {
        [Table("Ref")]
        public class Ref : BaseEntity
        {
            public int ID2 { get; set; }
        }
    }
    

  2. Implementation of the Customer class
    using System.ComponentModel.DataAnnotations.Schema;
    
    namespace TestLINQ
    {
        [Table("Customer")]
        public class Customer: BaseEntity
        {
            public int Ref_ID { get; set; }
            public int Ref_ID2 { get; set; }
        }
    }
    

Now let's create a UserContext context in a separate file:

Implementation of the UserContex class

using System.Data.Entity;

namespace TestLINQ
{
    public class UserContext : DbContext
    {
        public UserContext()
            : base("DbConnection")
        {
            Database.SetInitializer<UserContext>(null);
        }

        public DbSet<Customer> Customer { get; set; }
        public DbSet<Ref> Ref { get; set; }
    }
}

We got a ready-made solution for conducting optimization tests with LINQ to SQL via EF for MS SQL Server:

Some aspects of LINQ query optimization in C#.NET for MS SQL Server

Now in the Program.cs file, enter the following code:

Program.cs file

using System;
using System.Collections.Generic;
using System.Linq;

namespace TestLINQ
{
    class Program
    {
        static void Main(string[] args)
        {
            using (UserContext db = new UserContext())
            {
                var dblog = new List<string>();
                db.Database.Log = dblog.Add;

                var query = from e1 in db.Customer
                            from e2 in db.Ref
                            where (e1.Ref_ID == e2.ID)
                                 && (e1.Ref_ID2 == e2.ID2)
                            select new { Data1 = e1.Name, Data2 = e2.Name };

                var result = query.Take(1000).ToList();

                Console.WriteLine(dblog[1]);

                Console.ReadKey();
            }
        }
    }
}

Next, let's run our project.

At the end of the work, the following will be displayed on the console:

Generated SQL query

SELECT TOP (1000) 
    [Extent1].[Ref_ID] AS [Ref_ID], 
    [Extent1].[Name] AS [Name], 
    [Extent2].[Name] AS [Name1]
    FROM  [dbo].[Customer] AS [Extent1]
    INNER JOIN [dbo].[Ref] AS [Extent2] ON ([Extent1].[Ref_ID] = [Extent2].[ID]) AND ([Extent1].[Ref_ID2] = [Extent2].[ID2])

That is, in general, the LINQ query generated a SQL query to the MS SQL Server DBMS quite well.

Now let's change the AND condition to OR in the LINQ query:

LINQ query

var query = from e1 in db.Customer
                            from e2 in db.Ref
                            where (e1.Ref_ID == e2.ID)
                                || (e1.Ref_ID2 == e2.ID2)
                            select new { Data1 = e1.Name, Data2 = e2.Name };

Let's run our application again.

The execution will crash with an error related to the command execution time exceeding 30 seconds:

Some aspects of LINQ query optimization in C#.NET for MS SQL Server

If you look at which query was generated by LINQ:

Some aspects of LINQ query optimization in C#.NET for MS SQL Server
, then you can make sure that the selection occurs through the Cartesian product of two sets (tables):

Generated SQL query

SELECT TOP (1000) 
    [Extent1].[Ref_ID] AS [Ref_ID], 
    [Extent1].[Name] AS [Name], 
    [Extent2].[Name] AS [Name1]
    FROM  [dbo].[Customer] AS [Extent1]
    CROSS JOIN [dbo].[Ref] AS [Extent2]
    WHERE [Extent1].[Ref_ID] = [Extent2].[ID] OR [Extent1].[Ref_ID2] = [Extent2].[ID2]

Let's rewrite the LINQ query like this:

Optimized LINQ query

var query = (from e1 in db.Customer
                   join e2 in db.Ref
                   on e1.Ref_ID equals e2.ID
                   select new { Data1 = e1.Name, Data2 = e2.Name }).Union(
                        from e1 in db.Customer
                        join e2 in db.Ref
                        on e1.Ref_ID2 equals e2.ID2
                        select new { Data1 = e1.Name, Data2 = e2.Name });

Then we get the following SQL query:

SQL query

SELECT 
    [Limit1].[C1] AS [C1], 
    [Limit1].[C2] AS [C2], 
    [Limit1].[C3] AS [C3]
    FROM ( SELECT DISTINCT TOP (1000) 
        [UnionAll1].[C1] AS [C1], 
        [UnionAll1].[Name] AS [C2], 
        [UnionAll1].[Name1] AS [C3]
        FROM  (SELECT 
            1 AS [C1], 
            [Extent1].[Name] AS [Name], 
            [Extent2].[Name] AS [Name1]
            FROM  [dbo].[Customer] AS [Extent1]
            INNER JOIN [dbo].[Ref] AS [Extent2] ON [Extent1].[Ref_ID] = [Extent2].[ID]
        UNION ALL
            SELECT 
            1 AS [C1], 
            [Extent3].[Name] AS [Name], 
            [Extent4].[Name] AS [Name1]
            FROM  [dbo].[Customer] AS [Extent3]
            INNER JOIN [dbo].[Ref] AS [Extent4] ON [Extent3].[Ref_ID2] = [Extent4].[ID2]) AS [UnionAll1]
    )  AS [Limit1]

Alas, in LINQ queries there can be only one join condition, therefore it is possible to make an equivalent query through two queries for each condition, followed by their union through Union to remove duplicates among rows.
Yes, the queries will generally be non-equivalent, given that complete duplicate rows may be returned. However, in real life, full duplicate lines are not needed and they are trying to get rid of them.

Now let's compare the execution plans of these two queries:

  1. for CROSS JOIN, the average execution time is 195 sec:
    Some aspects of LINQ query optimization in C#.NET for MS SQL Server
  2. for INNER JOIN-UNION the average execution time is less than 24 sec:
    Some aspects of LINQ query optimization in C#.NET for MS SQL Server

As can be seen from the results, for two tables with millions of records, the optimized LINQ query is many times faster than the non-optimized one.

For the variant with AND in the conditions of a LINQ query of the form:

LINQ query

var query = from e1 in db.Customer
                            from e2 in db.Ref
                            where (e1.Ref_ID == e2.ID)
                                 && (e1.Ref_ID2 == e2.ID2)
                            select new { Data1 = e1.Name, Data2 = e2.Name };

almost always a correct SQL query will be generated, which will run on average for about 1 second:

Some aspects of LINQ query optimization in C#.NET for MS SQL Server
Also for LINQ to Objects manipulations instead of querying the view:

LINQ query (1st option)

var query = from e1 in seq1
                            from e2 in seq2
                            where (e1.Key1==e2.Key1)
                               && (e1.Key2==e2.Key2)
                            select new { Data1 = e1.Data, Data2 = e2.Data };

You can use a query like:

LINQ query (2st option)

var query = from e1 in seq1
                            join e2 in seq2
                            on new { e1.Key1, e1.Key2 } equals new { e2.Key1, e2.Key2 }
                            select new { Data1 = e1.Data, Data2 = e2.Data };

where:

Defining two arrays

Para[] seq1 = new[] { new Para { Key1 = 1, Key2 = 2, Data = "777" }, new Para { Key1 = 2, Key2 = 3, Data = "888" }, new Para { Key1 = 3, Key2 = 4, Data = "999" } };
Para[] seq2 = new[] { new Para { Key1 = 1, Key2 = 2, Data = "777" }, new Para { Key1 = 2, Key2 = 3, Data = "888" }, new Para { Key1 = 3, Key2 = 5, Data = "999" } };

, and the Para type is defined as follows:

Para type definition

class Para
{
        public int Key1, Key2;
        public string Data;
}

Thus, we have considered some aspects in optimizing LINQ queries to MS SQL Server.

Unfortunately, even experienced and leading .NET developers forget that it is necessary to understand what the instructions they use are doing behind the scenes. Otherwise, they become configurators and can lay a time bomb in the future, both when scaling a software solution, and with minor changes in external environmental conditions.

There was also a small review here.

Sources for the test - the project itself, creating tables in the TEST database, as well as filling these tables with data is located here.
Also in this repository in the Plans folder are plans for executing queries with OR conditions.

Source: habr.com

Add a comment