LINQ query optimization techniques in C#.NET

Introduction

Π’ this article some optimization methods were considered LINQ queries.
Here we present some more approaches to code optimization related to LINQ queries.

It is known that LINQ(Language-Integrated Query) is a simple and convenient language for querying a data source.

А LINQ to SQL is a data access technology in a DBMS. This is a powerful tool for working with data, where queries are constructed through a declarative language, which will then be converted into SQL queries platform and sent to the database server for execution. In our case, by DBMS we mean MS SQL Server.

But, LINQ queries are not converted to optimally written SQL queries, which an experienced DBA could write with all the nuances of optimization SQL queries:

  1. optimal connections (JOIN) and filtering the results (WHERE)
  2. many nuances in the use of joins and group conditions
  3. many variations in the substitution of conditions IN on EXISTSΠΈ NOTE IN, <> on EXISTS
  4. intermediate caching of results via temporary tables, CTEs, table variables
  5. using a sentence (OPTION) with directions and table hints WITH (...)
  6. the use of indexed views as one of the means to get rid of redundant data readings during selections

The main performance bottlenecks of the resulting SQL queries when compiling LINQ queries are:

  1. consolidation of the entire data selection mechanism in one request
  2. duplication of identical blocks of code, which ultimately leads to multiple redundant data reads
  3. groups of multipart conditions (logical "and" and "or") - AND ΠΈ OR, joining in complex conditions, leads to the fact that the optimizer, having suitable non-clustered indexes on the necessary fields, eventually still starts doing a scan on the clustered index (INDEX SCAN) by condition groups
  4. deep nesting of subqueries makes parsing very problematic SQL statements and parsing the query plan by the developers and WBA

Optimization methods

Now let's move on to optimization methods.

1) Additional indexing

It is best to consider filters on the main selection tables, since very often the entire query is built around one or two main tables (applications-people-operations) and with a standard set of conditions (IsClosed, Canceled, Enabled, Status). It is important to create appropriate indexes for the identified samples.

This decision makes sense when selecting by these fields significantly limits the returned set to the query.

For example, we have 500000 applications. However, there are only 2000 entries in active applications. Then a properly chosen index will save us from INDEX SCAN on a large table and will allow you to quickly select data through a non-clustered index.

Also, the lack of indexes can be detected through the prompts for parsing query plans or collecting system view statistics. MS SQL Server:

  1. sys.dm_db_missing_index_groups
  2. sys.dm_db_missing_index_group_stats
  3. sys.dm_db_missing_index_details

All view data contains missing index information, except for spatial indexes.

However, indexes and caching are often methods of dealing with the consequences of poorly written LINQ queries ΠΈ SQL queries.

As the harsh practice of life shows, it is often important for business to implement business features by certain deadlines. And therefore often heavy requests are transferred to the background with caching.

This is partly justified, as the user does not always need the latest data and there is an acceptable level of user interface response.

This approach allows you to solve business needs, but ultimately lowers the performance of the information system, simply postponing problem solving.

It is also worth remembering that in the process of searching for the necessary indexes to add, offers MS SQL optimization may be incorrect, including under the following conditions:

  1. if there are already indexes with a similar set of fields
  2. if the fields in the table cannot be indexed due to indexing restrictions (for more details, see here).

2) Combining attributes into one new attribute

Sometimes some fields from the same table, on which a group of conditions occurs, can be replaced by the introduction of one new field.

This is especially true for state fields, which by type are usually either bit or integer.

Example:

IsClosed = 0 AND Canceled = 0 AND Enabled = 0 replaced by Status = 1.

An integer Status attribute is introduced here, provided by populating these statuses in the table. Next, this new attribute is indexed.

This is a fundamental solution to the performance problem, because we are accessing the data without unnecessary calculations.

3) Materialization of the view

Unfortunately, in LINQ queries you cannot directly use temporary tables, CTEs, and table variables.

However, there is another way to optimize for this case - these are indexed views.

Condition group (from the example above) IsClosed = 0 AND Canceled = 0 AND Enabled = 0 (or a set of other similar conditions) becomes a good option for using them in an indexed view, caching a small slice of data from a large set.

But there are a number of limitations when materializing a view:

  1. use of subqueries, suggestions EXISTS should be replaced by the use JOIN
  2. can't use offers UNITY, UNION ALL, EXCEPTION, INTERSECT
  3. table hints and suggestions cannot be used OPTION
  4. there is no way to work with cycles
  5. it is not possible to display data in the same view from different tables

It is important to remember that the real benefit of using an indexed view can actually only be obtained when it is indexed.

But when calling the view, these indexes may not be used, and to explicitly use them, you must specify WITH(NOEXPAND).

Since in LINQ queries you cannot define table hints, so you have to make one more representation - a β€œwrapper” of the following form:

CREATE VIEW ИМЯ_прСдставлСния AS SELECT * FROM MAT_VIEW WITH (NOEXPAND);

4) Using table functions

Often in LINQ queries large blocks of subqueries or blocks using views with a complex structure form a final query with a very complex and sub-optimal execution structure.

The main advantages of using table functions in LINQ queries:

  1. The ability, as in the case of views, to use and specify as an object, but you can pass a set of input parameters:
    FROM FUNCTION(@param1, @param2 ...)
    as a result, you can achieve flexible data sampling
  2. In the case of using a table function, there are no such strong restrictions as in the case of indexed views, described above:
    1. Table hints:
      via LINQ you cannot specify which indexes to use and determine the data isolation level when querying.
      But the function has these features.
      With the function, you can achieve a fairly constant execution query plan, where the rules for working with indexes and data isolation levels are defined.
    2. Using the function allows, in comparison with indexed views, to get:
      • complex data sampling logic (up to the use of cycles)
      • fetching data from many different tables
      • use UNITY ΠΈ EXISTS

  3. Proposal OPTION very useful when we need to provide concurrency control OPTION(MAXDOP N), the order of the query execution plan. For example:
    • you can specify a forced re-creation of the query plan OPTION (RECOMPLE)
    • you can specify whether to force the query plan to use the join order specified in the query OPTION (FORCE ORDER)

    More details about OPTION described here.

  4. Using the narrowest and required data slice:
    There is no need to keep large datasets in caches (as is the case with indexed views), from which it is still necessary to filter the data by parameter.
    For example, there is a table whose filter has WHERE three fields are used (a, b, c).

    Conditionally for all requests there is a constant condition a = 0 and b = 0.

    However, the query for the field c more variable.

    Let the condition a = 0 and b = 0 it really helps us to limit the required resulting set to thousands of records, but the condition on с narrows down the selection to hundreds of records.

    This is where a table function can be a better option.

    Also, a table function is more predictable and constant in execution time.

Examples

Consider an implementation example using the Questions database as an example.

Have a request SELECT, which combines several tables and uses one view (OperativeQuestions), in which membership is checked by email (via EXISTS) to "Active Queries"([OperativeQuestions]):

Request #1

(@p__linq__0 nvarchar(4000))SELECT
1 AS [C1],
[Extent1].[Id] AS [Id],
[Join2].[Object_Id] AS [Object_Id],
[Join2].[ObjectType_Id] AS [ObjectType_Id],
[Join2].[Name] AS [Name],
[Join2].[ExternalId] AS [ExternalId]
FROM [dbo].[Questions] AS [Extent1]
INNER JOIN (SELECT [Extent2].[Object_Id] AS [Object_Id],
[Extent2].[Question_Id] AS [Question_Id], [Extent3].[ExternalId] AS [ExternalId],
[Extent3].[ObjectType_Id] AS [ObjectType_Id], [Extent4].[Name] AS [Name]
FROM [dbo].[ObjectQuestions] AS [Extent2]
INNER JOIN [dbo].[Objects] AS [Extent3] ON [Extent2].[Object_Id] = [Extent3].[Id]
LEFT OUTER JOIN [dbo].[ObjectTypes] AS [Extent4] 
ON [Extent3].[ObjectType_Id] = [Extent4].[Id] ) AS [Join2] 
ON [Extent1].[Id] = [Join2].[Question_Id]
WHERE ([Extent1].[AnswerId] IS NULL) AND (0 = [Extent1].[Exp]) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[OperativeQuestions] AS [Extent5]
WHERE (([Extent5].[Email] = @p__linq__0) OR (([Extent5].[Email] IS NULL) 
AND (@p__linq__0 IS NULL))) AND ([Extent5].[Id] = [Extent1].[Id])
));

The view has a rather complex structure: it has subquery joins and the use of sorting DISTINCT, which in the general case is a rather resource-intensive operation.

The sample from OperativeQuestions is about ten thousand records.

The main problem with this query is that for records from the outer query, an inner subquery is executed on the [OperativeQuestions] view, which should limit the output selection for [Email] = @p__linq__0 (via EXISTS) up to hundreds of records.

And it may seem that the subquery should calculate the records by [Email] = @p__linq__0 once, and then these couple of hundred records should be connected by Id with Questions, and the query will be fast.

In fact, all tables are sequentially connected: both checking the correspondence of Id Questions with Id from OperativeQuestions, and filtering by Email.

In fact, the query works with all tens of thousands of OperativeQuestions records, but only the data of interest by Email is needed.

The text of the OperativeQuestions view:

Request #2

 
CREATE VIEW [dbo].[OperativeQuestions]
AS
SELECT DISTINCT Q.Id, USR.email AS Email
FROM            [dbo].Questions AS Q INNER JOIN
                         [dbo].ProcessUserAccesses AS BPU ON BPU.ProcessId = CQ.Process_Id 
OUTER APPLY
                     (SELECT   1 AS HasNoObjects
                      WHERE   NOT EXISTS
                                    (SELECT   1
                                     FROM     [dbo].ObjectUserAccesses AS BOU
                                     WHERE   BOU.ProcessUserAccessId = BPU.[Id] AND BOU.[To] IS NULL)
) AS BO INNER JOIN
                         [dbo].Users AS USR ON USR.Id = BPU.UserId
WHERE        CQ.[Exp] = 0 AND CQ.AnswerId IS NULL AND BPU.[To] IS NULL 
AND (BO.HasNoObjects = 1 OR
              EXISTS (SELECT   1
                           FROM   [dbo].ObjectUserAccesses AS BOU INNER JOIN
                                      [dbo].ObjectQuestions AS QBO 
                                                  ON QBO.[Object_Id] =BOU.ObjectId
                               WHERE  BOU.ProcessUserAccessId = BPU.Id 
                               AND BOU.[To] IS NULL AND QBO.Question_Id = CQ.Id));

View source mapping in DbContext (EF Core 2)

public class QuestionsDbContext : DbContext
{
    //...
    public DbQuery<OperativeQuestion> OperativeQuestions { get; set; }
    //...
    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.Query<OperativeQuestion>().ToView("OperativeQuestions");
    }
}

Source LINQ query

var businessObjectsData = await context
    .OperativeQuestions
    .Where(x => x.Email == Email)
    .Include(x => x.Question)
    .Select(x => x.Question)
    .SelectMany(x => x.ObjectQuestions,
                (x, bo) => new
                {
                    Id = x.Id,
                    ObjectId = bo.Object.Id,
                    ObjectTypeId = bo.Object.ObjectType.Id,
                    ObjectTypeName = bo.Object.ObjectType.Name,
                    ObjectExternalId = bo.Object.ExternalId
                })
    .ToListAsync();

In this particular case, we are considering a solution to this problem without infrastructural changes, without introducing a separate table with ready-made results (β€œActive Queries”), for which a mechanism would be needed to fill it with data and keep it up to date.

Although this is a good solution, there is another way to optimize this problem.

The main goal is to cache entries by [Email] = @p__linq__0 from the OperativeQuestions view.

Inject table function [dbo].[OperativeQuestionsUserMail] into the database.

Sending Email as an input parameter, we get back a table of values:

Request #3


CREATE FUNCTION [dbo].[OperativeQuestionsUserMail]
(
    @Email  nvarchar(4000)
)
RETURNS
@tbl TABLE
(
    [Id]           uniqueidentifier,
    [Email]      nvarchar(4000)
)
AS
BEGIN
        INSERT INTO @tbl ([Id], [Email])
        SELECT Id, @Email
        FROM [OperativeQuestions]  AS [x] WHERE [x].[Email] = @Email;
     
    RETURN;
END

This returns a table of values ​​with a predefined data structure.

In order for requests to OperativeQuestionsUserMail to be optimal, have optimal query plans, a strict structure is needed, and not RETURNS TABLE AS RETURN...

In this case, the desired Query 1 is converted to Query 4:

Request #4

(@p__linq__0 nvarchar(4000))SELECT
1 AS [C1],
[Extent1].[Id] AS [Id],
[Join2].[Object_Id] AS [Object_Id],
[Join2].[ObjectType_Id] AS [ObjectType_Id],
[Join2].[Name] AS [Name],
[Join2].[ExternalId] AS [ExternalId]
FROM (
    SELECT Id, Email FROM [dbo].[OperativeQuestionsUserMail] (@p__linq__0)
) AS [Extent0]
INNER JOIN [dbo].[Questions] AS [Extent1] ON([Extent0].Id=[Extent1].Id)
INNER JOIN (SELECT [Extent2].[Object_Id] AS [Object_Id], [Extent2].[Question_Id] AS [Question_Id], [Extent3].[ExternalId] AS [ExternalId], [Extent3].[ObjectType_Id] AS [ObjectType_Id], [Extent4].[Name] AS [Name]
FROM [dbo].[ObjectQuestions] AS [Extent2]
INNER JOIN [dbo].[Objects] AS [Extent3] ON [Extent2].[Object_Id] = [Extent3].[Id]
LEFT OUTER JOIN [dbo].[ObjectTypes] AS [Extent4] 
ON [Extent3].[ObjectType_Id] = [Extent4].[Id] ) AS [Join2] 
ON [Extent1].[Id] = [Join2].[Question_Id]
WHERE ([Extent1].[AnswerId] IS NULL) AND (0 = [Extent1].[Exp]);

View and function mapping in DbContext (EF Core 2)

public class QuestionsDbContext : DbContext
{
    //...
    public DbQuery<OperativeQuestion> OperativeQuestions { get; set; }
    //...
    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.Query<OperativeQuestion>().ToView("OperativeQuestions");
    }
}
 
public static class FromSqlQueries
{
    public static IQueryable<OperativeQuestion> GetByUserEmail(this DbQuery<OperativeQuestion> source, string Email)
        => source.FromSql($"SELECT Id, Email FROM [dbo].[OperativeQuestionsUserMail] ({Email})");
}

Final LINQ query

var businessObjectsData = await context
    .OperativeQuestions
    .GetByUserEmail(Email)
    .Include(x => x.Question)
    .Select(x => x.Question)
    .SelectMany(x => x.ObjectQuestions,
                (x, bo) => new
                {
                    Id = x.Id,
                    ObjectId = bo.Object.Id,
                    ObjectTypeId = bo.Object.ObjectType.Id,
                    ObjectTypeName = bo.Object.ObjectType.Name,
                    ObjectExternalId = bo.Object.ExternalId
                })
    .ToListAsync();

The order of execution time has decreased from 200-800 ms to 2-20 ms, etc., i.e. ten times faster.

If we take it more average, then instead of 350 ms we got 8 ms.

Of the obvious advantages, we also get:

  1. overall reduction in reading load,
  2. a significant reduction in the likelihood of blocking
  3. reducing the average block time to acceptable values

Hack and predictor Aviator

Optimization and fine-tuning of database accesses MS SQL via LINQ is a problem that can be solved.

In this work, care and consistency are very important.

At the beginning of the process:

  1. it is necessary to check the data with which the query works (values, selected data types)
  2. to properly index this data
  3. check the correctness of join conditions between tables

At the next optimization iteration, the following are revealed:

  1. the basis of the request and the main filter of the request is defined
  2. repeating similar query blocks and the intersection of conditions is analyzed
  3. in SSMS or another GUI for SQL Server self-optimized SQL query (selection of an intermediate data store, building the resulting query using this store (there may be several))
  4. at the last stage, taking as a basis the resulting SQL query, the structure is rebuilt LINQ query

As a result, the resulting LINQ query should become identical in structure to the identified optimal SQL query from point 3.

Acknowledgements

Many thanks to colleagues jobgemws ΠΈ alex_ozr from the company Fortis for their help in the preparation of this material.

Source: habr.com

Add a comment