Thursday 13 November 2014

Is Entity Framework Suitable For High-Traffic Websites?

The question "which ORM should I use" is really targeting the tip of a huge iceberg when it comes to the overall data access strategy and performance optimization in a large scale application.
All of the following things (roughly in order of importance) are going to affect throughput, and all of them are handled (sometimes in different ways) by most of the major ORM frameworks out there:
  1. Database Design and Maintenance
    This is, by a wide margin, the single most important determinant of the throughput of a data-driven application or web site, and often totally ignored by programmers.
    If you don't use proper normalization techniques, your site is doomed. If you don't have primary keys, almost every query will be dog-slow. If you use well-known anti-patterns such as using tables for Key-Value Pairs (AKA Entity-Attribute-Value) for no good reason, you'll explode the number of physical reads and writes.
    If you don't take advantage of the features the database gives you, such as page compression, FILESTREAM storage (for binary data), SPARSE columns, hierarchyid for hierarchies, and so on (all SQL Server examples), then you will not see anywhere near the performance that you could be seeing.
    You should start worrying about your data access strategy after you've designed your database and convinced yourself that it's as good as it possibly can be, at least for the time being.
  2. Eager vs. Lazy Loading
    Most ORMs used a technique called lazy loading for relationships, which means that by default it will load one entity (table row) at a time, and make a round-trip to the database every time it needs to load one or many related (foreign key) rows.
    This isn't a good or bad thing, it rather depends on what's actually going to be done with the data, and how much you know up-front. Sometimes lazy-loading is absolutely the right thing to do. NHibernate, for example, may decide not to query for anything at all and simply generate a proxy for a particular ID. If all you ever need is the ID itself, why should it ask for more? On the other hand, if you are trying to print a tree of every single element in a 3-level hierarchy, lazy-loading becomes an O(N²) operation, which is extremely bad for performance.
    One interesting benefit to using "pure SQL" (i.e. raw ADO.NET queries/stored procedures) is that it basically forces you to think about exactly what data is necessary to display any given screen or page. ORMs and lazy-loading features don't prevent you from doing this, but they do give you the opportunity to be... well, lazy, and accidentally explode the number of queries you execute. So you need to understand your ORMs eager-loading features and be ever vigilant about the number of queries you're sending to the server for any given page request.
  3. Caching
    All major ORMs maintain a first-level cache, AKA "identity cache", which means that if you request the same entity twice by its ID, it doesn't require a second round-trip, and also (if you designed your database correctly) gives you the ability to use optimistic concurrency.
    The L1 cache is pretty opaque in L2S and EF, you kind of have to trust that it's working. NHibernate is more explicit about it (Get/Load vs. Query/QueryOver). Still, as long as you try to query by ID as much as possible, you should be fine here. A lot of people forget about the L1 cache and repeatedly look up the same entity over and over again by something other than its ID (i.e. a lookup field). If you need to do this then you should save the ID or even the entire entity for future lookups.
    There's also a level 2 cache ("query cache"). NHibernate has this built-in. Linq to SQL and Entity Framework have compiled queries, which can help reduce app server loads quite a bit by compiling the query expression itself, but it doesn't cache the data. Microsoft seems to consider this an application concern rather than a data-access concern, and this is a major weak point of both L2S and EF. Needless to say it's also a weak point of "raw" SQL. In order to get really good performance with basically any ORM other than NHibernate, you need to implement your own caching façade.
    There's also an L2 cache "extension" for EF4 which is okay, but not really a wholesale replacement for an application-level cache.
  4. Number of Queries
    Relational databases are based on sets of data. They're really good at producing large amounts of data in a short amount of time, but they're nowhere near as good in terms of query latency because there's a certain amount of overhead involved in every command. A well-designed app should play to the strengths of this DBMS and try to minimize the number of queries and maximize the amount of data in each.
    Now I'm not saying to query the entire database when you only need one row. What I'm saying is, if you need the CustomerAddressPhoneCreditCard, and Order rows all at the same time in order to serve a single page, then you should ask for them all at the same time, don't execute each query separately. Sometimes it's worse than that, you'll see code that queries the same Customerrecord 5 times in a row, first to get the Id, then the Name, then the EmailAddress, then... it's ridiculously inefficient.
    Even if you need to execute several queries that all operate on completely different sets of data, it's usually still more efficient to send it all to the database as a single "script" and have it return multiple result sets. It's the overhead you're concerned with, not the total amount of data.
    This might sound like common sense but it's often really easy to lose track of all the queries that are being executed in various parts of the application; your Membership Provider queries the user/role tables, your Header action queries the shopping cart, your Menu action queries the site map table, your Sidebar action queries the featured product list, and then maybe your page is divided into a few separate autonomous areas which query the Order History, Recently Viewed, Category, and Inventory tables separately, and before you know it, you're executing 20 queries before you can even start to serve the page. It just utterly destroys performance.
    Some frameworks - and I'm thinking mainly of NHibernate here - are incredibly clever about this and allow you to use something called futures which batch up entire queries and try to execute them all at once, at the last possible minute. AFAIK, you're on your own if you want to do this with any of the Microsoft technologies; you have to build it into your application logic.
  5. Indexing, Predicates, and Projections
    At least 50% of devs I speak to and even some DBAs seem to have trouble with the concept of covering indexes. They think, "well, the Customer.Name column is indexed, so every lookup I do on the name should be fast." Except it doesn't work that way unless the Name index covers the specific column you're looking up. In SQL Server, that's done with INCLUDE in the CREATE INDEXstatement.
    If you naïvely use SELECT * everywhere - and that is more or less what every ORM will do unless you explicitly specify otherwise using a projection - then the DBMS may very well choose to completely ignore your indexes because they contain non-covered columns. A projection means that, for example, instead of doing this:
    from c in db.Customers where c.Name == "John Doe" select c
    
    You do this instead:
    from c in db.Customers where c.Name == "John Doe"
    select new { c.Id, c.Name }
    
    And this will, for most modern ORMs, instruct it to only go and query the Id and Name columns which are presumably covered by the index (but not the EmailLastActivityDate, or whatever other columns you happened to stick in there).
    It's also very easy to completely blow away any indexing benefits by using inappropriate predicates. For example:
    from c in db.Customers where c.Name.Contains("Doe")
    
    ...looks almost identical to our previous query but in fact will result in a full table or index scan because it translates to LIKE '%Doe%'. Similarly, another query which looks suspiciously simple is:
    from c in db.Customers where (maxDate == null) || (c.BirthDate >= maxDate)
    
    Assuming you have an index on BirthDate, this predicate has a good chance to render it completely useless. Our hypothetical programmer here has obviously attempted to create a kind of dynamic query ("only filter the birth date if that parameter was specified"), but this isn't the right way to do it. Written like this instead:
    from c in db.Customers where c.BirthDate >= (maxDate ?? DateTime.MinValue)
    
    ...now the DB engine knows how to parameterize this and do an index seek. One minor, seemingly insignificant change to the query expression can drastically affect performance.
    Unfortunately LINQ in general makes it all too easy to write bad queries like this because sometimesthe providers are able to guess what you were trying to do and optimize the query, and sometimes they aren't. So you end up with frustratingly inconsistent results which would have been blindingly obvious (to an experienced DBA, anyway) had you just written plain old SQL.
    Basically it all comes down to the fact that you really have to keep a close eye on both the generated SQL and the execution plans they lead to, and if you're not getting the results you expect, don't be afraid to bypass the ORM layer once in a while and hand-code the SQL. This goes for any ORM, not just EF.
  6. Transactions and Locking
    Do you need to display data that's current up to the millisecond? Maybe - it depends - but probably not. Sadly, Entity Framework doesn't give you nolock, you can only use READ UNCOMMITTED at the transaction level (not table level). In fact none of the ORMs are particularly reliable about this; if you want to do dirty reads, you have to drop down to the SQL level and write ad-hoc queries or stored procedures. So what it boils down to, again, is how easy it is for you to do that within the framework.
    Entity Framework has come a long way in this regard - version 1 of EF (in .NET 3.5) was god-awful, made it incredibly difficult to break through the "entities" abstraction, but now you haveExecuteStoreQuery and Translate, so it's really not too bad. Make friends with these guys because you'll be using them a lot.
    There's also the issue of write locking and deadlocks and the general practice of holding locks in the database for as little time as possible. In this regard, most ORMs (including Entity Framework) actually tend to be better than raw SQL because they encapsulate the unit of Work pattern, which in EF isSaveChanges. In other words, you can "insert" or "update" or "delete" entities to your heart's content, whenever you want, secure in the knowledge that no changes will actually get pushed to the database until you commit the unit of work.
    Note that a UOW is not analogous to a long-running transaction. The UOW still uses the optimistic concurrency features of the ORM and tracks all changes in memory. Not a single DML statement is emitted until the final commit. This keeps transaction times as low as possible. If you build your application using raw SQL, it's quite difficult to achieve this deferred behaviour.
    What this means for EF specifically: Make your units of work as coarse as possible and don't commit them until you absolutely need to. Do this and you'll end up with much lower lock contention than you would using individual ADO.NET commands at random times.

In Conclusion:

EF is completely fine for high-traffic/high-performance applications, just like every other framework is fine for high-traffic/high-performance applications. What matters is how you use it. Here's a quick comparison of the most popular frameworks and what features they offer in terms of performance (legend: N = Not supported, P = Partial, Y = yes/supported):
                                | L2S | EF1 | EF4 | NH3 | ADO
                                +-----+-----+-----+-----+-----
Lazy Loading (entities)         |  N  |  N  |  N  |  Y  |  N
Lazy Loading (relationships)    |  Y  |  Y  |  Y  |  Y  |  N
Eager Loading (global)          |  N  |  N  |  N  |  Y  |  N
Eager Loading (per-session)     |  Y  |  N  |  N  |  Y  |  N
Eager Loading (per-query)       |  N  |  Y  |  Y  |  Y  |  Y
Level 1 (Identity) Cache        |  Y  |  Y  |  Y  |  Y  |  N
Level 2 (Query) Cache           |  N  |  N  |  P  |  Y  |  N
Compiled Queries                |  Y  |  P  |  Y  |  N  | N/A
Multi-Queries                   |  N  |  N  |  N  |  Y  |  Y
Multiple Result Sets            |  Y  |  N  |  P  |  Y  |  Y
Futures                         |  N  |  N  |  N  |  Y  |  N
Explicit Locking (per-table)    |  N  |  N  |  N  |  P  |  Y
Transaction Isolation Level     |  Y  |  Y  |  Y  |  Y  |  Y
Ad-Hoc Queries                  |  Y  |  P  |  Y  |  Y  |  Y
Stored Procedures               |  Y  |  P  |  Y  |  Y  |  Y
Unit of Work                    |  Y  |  Y  |  Y  |  Y  |  N
As you can see, EF4 (the current version) doesn't fare too badly, but it's probably not the best if performance is your primary concern. NHibernate is much more mature in this area and even Linq to SQL provides some performance-enhancing features that EF still doesn't. Raw ADO.NET is often going to be faster for very specific data-access scenarios, but, when you put all the pieces together, it really doesn't offer a lot of important benefits that you get from the various frameworks.
And, just to make completely sure that I sound like a broken record, none of this matters in the slightest if you don't design your database, application, and data access strategies properly. All of the items in the chart above are for improving performance beyond the baseline; most of the time, the baseline itself is what needs the most improvement.

No comments:

Post a Comment

Angular Tutorial (Update to Angular 7)

As Angular 7 has just been released a few days ago. This tutorial is updated to show you how to create an Angular 7 project and the new fe...