Query Tuning
Minimise the Data Requested
Minimising the data queried is a fundamental tenet of performance tuning for any method of data access. In querying a database using SQL, the below query should almost never make it into production :
Select * From [Table]
Instead, the query should specify the fields to be returned from the table. This is extremely familiar to any database developer or DBA. However this principal is often ignored when developing with Entity Framework.
In a recent project I was working on an alerting system which would send an email alert when a company released an SEC filing. The Alert entity had a related Company entity:
public class Alert
{
public virtual int ID { get; set; }
public virtual DateTime CreateDate { get; set; }
public virtual Company Company { get; set; }
//......more properties
}
public class Company
{
[Key]
public virtual string Ticker { get; set; }
public virtual string Name { get; set; }
//......more properties
}
The UI needed to display the CreateDate of an Alert and the Name of the Company. How should this be written in LINQ-to-Entities?
Often, the related entity (Company in this case) would be requested using Include() as below:
Often, the related entity (Company in this case) would be requested using Include() as below:
var testQuery = db.Alerts // db is the instance of the DbContext .Where(x => x.ID == 1) .Include(y => y.Company);
This is a common pattern, however, it is actually the equivalent of loading the entire rows in both the Company and Alert tables in the database. Look at the SQL this generates:
That is a lot of data requested when all that was needed was the CreateDate and Company Name. One way to solve this is by using an anonymous type which only selects the data requested :
var testQuery = db.Alerts.Where(x => x.ID == 1) .Select(y => new { CreateDate = y.CreateDate, Company = y.Company.Name });
This would generate the below SQL which is a vast improvement.
Anonymous types are not always that convenient since they cannot be used outside the scope of the method they are declared in. There are several alternatives such as creating public classes with only the properties which are required (these public classes would in some ways be analogous to Views in SQL Server since they would be a light-weight composite of the properties/fields of other entities).
Do Not Count After Query Execution
To perform a simple count, ensure that you include Count() in the LINQ statement (which will only generate a SQL COUNT query) and don’t return an IEnumberable and then execute a Count() on it which returns all the entities and then performs an in memory count. There are several varieties of this issue:
var empCount = db.Employees .Where(x => x.CreateDate < DateTime.Now.AddDays(-10) .Count() //only executes a SQL COUNT var emps = db.Employees .Where(x => x.CreateDate < DateTime.Now.AddDays(-10) var empCount = emps.Count() // returns all matching employees and then executes a count
Filter Before ToList()
Another issue is executing ToList() prior to any filtering. ToList() will instantly trigger a database query for all LINQ statements prior to it being hit, subsequent LINQ filter statements such as Where() Take() Skip() etc will then be executed on the in-memory List:
//ToList() executed too early var emps = db.Employees .Where(x => x.CreateDate < DateTime.Now.AddDays(-10) .ToList() // All employees created within last 10 days queried .Take(50) // List is filtered for the top 50 //ToList() executed last var emps = db.Employees .Where(x => x.CreateDate < DateTime.Now.AddDays(-10) .Take(50) .ToList() //Query made for top 50 employees created within last 10 days
Don’t Load Entities To Delete Them
A common problem is loading an entity which needs to be deleted. The DbContext Remove() method requires an entity to be passed in, so if only provided with a primary key the below code would effect a delete on the database:
var emp = db.Employees.Find(empID); //Executes a query against the database db.Employees.Remove(emp); db.SaveChanges(); //Executes a second hit to the database
Instead of this, create a temporary entity and use Attach() to attach it to the context and then call Remove (it is not relevant that the entity will not have all its properties loaded – all that is needed for the Remove() method is the primary key):
var emp = new Employee { ID = empID } db.Employees.Attach(emp); db.Employees.Remove(emp); db.SaveChanges(); //Executes a single hit to the database
Overall, the primary issue to bear in mind is that when an entity is requested all the properties on that entity are queried and returned unless the query states otherwise.
Profile Generated SQL
A problem in using Entity Framework (especially versions 4.1 and later) is that it encourages developers to focus solely on code and ignore the data access. This can result in an over-reliance on EF to handle all data access and SQL generation. It is often difficult to determine the efficiency of an operation simply by looking at the code, and so bloated queries due to poor LINQ structure as noted above or from EF’s SQL generation can arise. It is thus essential to look at the actual SQL generated by EF.
The simplest way to profile the SQL is to call the ToString() method on an IQueryable<> object. In the previous examples I generated the SQL by using:
var sql = testQuery.ToString(); //Note this is only in EF 4.1 +
Beyond that, most SQL Server DBAs/developers should be familiar with SQL Profiler which would provide more indepth analytics.
Julie Lerman wrote a good roundup of EF profiling methods in Profiling Database Activity in EF (note that this was written prior to the release of EF 4.1 so the coding examples use the ObjectContext and not the DbContext which has some simpler methods such as calling ToString(0 directly on an IQueryable).
Working With Change Tracking
EF works in the background to track the changes of any in-memory entities. There are two methods by which it performs this function. Snapshot Change Tracking, keeps an in-memory snapshot of the entity when it originally sees it and compares it to the current values. Change Tracking Proxies by contrast generates a dynamic proxy for the entity at run-time which notifies EF of changes when they are made.
EF will use change tracking proxies if an entity meets all its criteria (the class must be public as well as neither sealed nor abstract, every navigation property must be marked ‘virtual’ plus several other criteria), otherwise the class will use snapshot change tracking.
The advantage of change tracking proxies is that they can automatically notify EF when a entity is updated and save the overhead of EF having to scan through numerous entities comparing them to their previous values. Thus if you have a large number of tracked entities and only a small number of entities whose values change then using change tracking proxies is likely to be very efficient.
Unfortunately, change tracking proxies can be extremely inefficient in scenarios when there are a large number of changes since each proxy has to notify EF of a change. Snapshot change tracking will only perform a single scan when DetectChanges() is triggered (this is automatically called by EF when an event such as SaveChanges() occurs).
In many scenarios the performance difference between the two change tracking methods may not be great but each should be considered when looking at performance optimisation. A scan through the questions of StackOverflow reveals that change tracking proxies does seem to cause quite a few performance and development issues as the entity model is much more restricted when using them, so I would typically default to using snapshot tracking.
When using snapshot change tracking note that change tracking can be temporarily disabled. This may be useful in performance tuning as DetectChanges() is called (and all entities in memory therefore scanned) if any LINQ query is executed on the DbContext object or when any of the below methods are executed :
- DbSet.Add()
- DbSet.Find()
- DbSet.Remove()
- DbSet.Local
- DbSet.Attach()
- DbContext.SaveChanges()
- DbConext.GetValidationErrors()
- DbContext.Entry()
- DbChangeTracker.Entries()
In circumstances where there are a lot of entities being tracked in memory and when you are sure that the existing entities will not be changed it may make sense to test the performance improvement from briefly turning of change tracking:
db.Configuration.AutoDetectChangesEnabled = false; db.Employees.Add(new Employee {name=”jude”}); db.Employees.SaveChanges(); db.Configuration.AutoDetectChangesEnabled = true;
The above code sample avoids an unnecessary call to DetectChanges for Add and SaveChanges, since we know neither of these will change the existing entities we can safely turn change tracking off while this work is performed.
Use Eager Loading Instead of Lazy Loading
Lazy loading (when EF automagically loads related entities when they are required) is convenient but leads to several performance issues. Most notable is when a related entity is called within a loop. When using lazy loading this can trigger a separate database query each time the related entity is requested :
var employees = db.Employees; foreach(var emp in employees) { var address = emp.Address; //Database query is executed every time this line is hit }
By contrast, the below code eagerly loads the related Address entity and only executes a single query on the database.
var employees = db.Employees .Include(x => x.Address) .ToList(); //All data loaded from database foreach(var emp in employees) { var address = emp.Address; //No database query executed }
Lazy loading also causes a variety of development headaches (as noted in Entity Framework Gotchas) and can be globally turned off in the constructor for the context:
public MyContext() : base() { this.Configuration.LazyLoadingEnabled = false; }
Query In-Memory Entities First
EF keeps track of in-memory entities so retrieving an entity from memory is preferable to executing a database query. The easiest to implement example of this is to use the Find() instead of Where() when retrieving an entity by primary key. The Find() method first queries the in-memory entities and if no match is found it then executes a database query, whereas Where() will always hit the database:
var emp = db.Employees.Find(empID); // Queries in memory first var emp = db.Employees.Where(x => x.ID == empID); //Always hits the database
It is not always convenient to do so but can manually query in-memory entities using Local if required :
var emp = db.Employees.Local.Where(x => x.ID == empID) ;
Don’t Rely on Entity Framework Exclusively For Data Access
Finally, EF is great for productivity but executing raw SQL or a stored procedure may often produce performance benefits.
In EF 4.0 and above the DbContext the SQLQuery() method to execute raw SQL:
var emps = db.Database.SQLQuery("select email from Employees");
Similarly use ExecuteSqlCommand() to execute a stored procedure:
var emps = db.Database.ExecuteSqlCommand("...");
Wrap Up
I have of course ignored the actual SQL Server database – which is a massive subject by itself, but these are the primary performance issues I encountered. Please do let me know if you think anything has been left out.
No comments:
Post a Comment