7 min read

A Deep Dive into Entity Framework Performance When Using "Contains"

View all articles

During my everyday work, I use Entity Framework. It is very convenient, but in some cases, its performance is slow. Despite there being lots of good articles on EF performance improvements, and some very good and useful advice is given (e.g., avoid complex queries, parameters in Skip and Take, use views, select needed fields only, etc.), not that much can really be done when you need to use complex Contains on two or more fields—in other words, when you join data to a memory list.

Problem

Let’s check the following example:

var localData = GetDataFromApiOrUser();
var query = from p in context.Prices
            join s in context.Securities on 
              p.SecurityId equals s.SecurityId
            join t in localData  on 
              new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
              new { t.Ticker, t.TradedOn, t.PriceSourceId }
            select p;
var result = query.ToList();

The code above doesn’t work in EF 6 at all, and while it does work in EF Core, the join is actually done locally—since I have ten million records in my database, all of them get downloaded and all the memory gets consumed. This is not a bug in EF. It is expected. However, wouldn’t it be fantastic if there was something to solve this? In this article, I’m going to do some experiments with a different approach to work around this performance bottleneck.

Solution

I’m going to try different ways to achieve this starting from the most simple to more advanced. On each step, I will provide code and metrics, such as time taken and memory usage. Note that I will interrupt the benchmarking program run if it works longer than ten minutes.

Code for the benchmarking program is located in the following repository. It uses C#, .NET Core, EF Core, and PostgreSQL. I used a machine with Intel Core i5, 8 GB RAM, and an SSD.

The DB schema for testing looks like this:

Tables in the database: prices, securities and price sources

Just three tables: prices, securities and price sources. The prices table has tens of millions of records.

Option 1. Simple and Naive

Let’s try something simple, just to get started.

var result = new List<Price>();
using (var context = CreateContext())
{
  foreach (var testElement in TestData)
  {
    result.AddRange(context.Prices.Where(
      x => x.Security.Ticker == testElement.Ticker &&
           x.TradedOn == testElement.TradedOn &&
           x.PriceSourceId == testElement.PriceSourceId));
  }
}

The algorithm is simple: For each element in the test data, find a proper element in the database and add it to the result collection. This code has just one advantage: It is very easy to implement. Also, it’s readable and maintainable. Its obvious drawback is that it is the slowest one. Even though all three columns are indexed, the overhead of network communication still creates a performance bottleneck. Here are the metrics:

Results of the first experiment

So, for a large volume, it takes approximately one minute. Memory consumption seems to be reasonable.

Option 2. Naive with parallel

Now let’s try to add parallelism to the code. The core idea here is that hitting the database in parallel threads can improve the overall performance.

var result = new ConcurrentBag<Price>();
var partitioner = Partitioner.Create(0, TestData.Count);

Parallel.ForEach(partitioner, range =>
{
  var subList = TestData.Skip(range.Item1)
                        .Take(range.Item2 - range.Item1)
                        .ToList();
  using (var context = CreateContext())
  {
    foreach (var testElement in subList)
    {
      var query = context.Prices.Where(
            x => x.Security.Ticker == testElement.Ticker &&
                 x.TradedOn == testElement.TradedOn &&
                 x.PriceSourceId == testElement.PriceSourceId);
      foreach (var el in query)
      {
        result.Add(el);
      }
    }
  }
});

It is interesting that, for smaller test data sets, this approach works slower than the first solution, but for bigger samples, it is faster (approx. 2 times in this instance). Memory consumption changes a bit, but not significantly.

Results of the second experiment

Option 3. Multiple Contains

Let’s try another approach:

  • Prepare 3 collections of unique values of Ticker, PriceSourceId, and Date.
  • Perform the query with one run filtering by using 3 Contains.
  • Recheck locally (see below).
var result = new List<Price>();
using (var context = CreateContext())
{
  var tickers = TestData.Select(x => x.Ticker).Distinct().ToList();
  var dates = TestData.Select(x => x.TradedOn).Distinct().ToList();
  var ps = TestData.Select(x => x.PriceSourceId)
                   .Distinct().ToList();

  var data = context.Prices
               .Where(x => tickers.Contains(x.Security.Ticker) &&
                           dates.Contains(x.TradedOn) &&
                           ps.Contains(x.PriceSourceId))
               .Select(x => new { 
                           x.PriceSourceId,
                           Price = x, 
                           Ticker = x.Security.Ticker, 
                      })
                .ToList();

  var lookup = data.ToLookup(x => 
     $"{x.Ticker}, {x.Price.TradedOn}, {x.PriceSourceId}");

  foreach (var el in TestData)
  {
    var key = $"{el.Ticker}, {el.TradedOn}, {el.PriceSourceId}";
    result.AddRange(lookup[key].Select(x => x.Price));
  }
}

This approach is problematic. Execution time is very data dependent. It may retrieve just the required records (in which case it will be very fast), but it might return many more (maybe even 100 times more).

Let’s consider the following test data:

Response data

Here I query the prices for Ticker1 traded on 2018-01-01 and for Ticker2 traded on 2018-01-02. However, four records will actually be returned.

The unique values for Ticker are Ticker1 and Ticker2. The unique values for TradedOn are 2018-01-01 and 2018-01-02.

So, four records match this expression.

That’s why a local recheck is needed and why this approach is dangerous. The metrics are as follows:

Results of the third experiment

Awful memory consumption! Tests with big volumes failed due to a timeout of 10 minutes.

Option 4. Predicate Builder

Let’s change the paradigm: Let’s build a good old Expression for each test data set.

var result = new List<Price>();
using (var context = CreateContext())
{
  var baseQuery = from p in context.Prices
                  join s in context.Securities on 
                    p.SecurityId equals s.SecurityId
                  select new TestData()
                  {
                    Ticker = s.Ticker,
                    TradedOn = p.TradedOn,
                    PriceSourceId = p.PriceSourceId,
                    PriceObject = p
                  };

  var tradedOnProperty = typeof(TestData).GetProperty("TradedOn");
  var priceSourceIdProperty =
    typeof(TestData).GetProperty("PriceSourceId");
  var tickerProperty = typeof(TestData).GetProperty("Ticker");

  var paramExpression = Expression.Parameter(typeof(TestData));
  Expression wholeClause = null;
  foreach (var td in TestData)
  {
    var elementClause = 
      Expression.AndAlso(
        Expression.Equal(
          Expression.MakeMemberAccess(
            paramExpression, tradedOnProperty),
          Expression.Constant(td.TradedOn)
        ),
        Expression.AndAlso(
          Expression.Equal(
            Expression.MakeMemberAccess(
              paramExpression, priceSourceIdProperty),
            Expression.Constant(td.PriceSourceId)
          ),
          Expression.Equal(
            Expression.MakeMemberAccess(
              paramExpression, tickerProperty), 
              Expression.Constant(td.Ticker))
          ));

    if (wholeClause == null)
      wholeClause = elementClause;
    else
      wholeClause = Expression.OrElse(wholeClause, elementClause);
  }

  var query = baseQuery.Where(
  (Expression<Func<TestData, bool>>)Expression.Lambda(
     wholeClause, paramExpression)).Select(x => x.PriceObject);

  result.AddRange(query);
}

The resulting code is pretty complex. Building expressions is not the easiest thing and involves reflection (which, itself, is not that fast). But it helps us to build a single query using lots of … (.. AND .. AND ..) OR (.. AND .. AND ..) OR (.. AND .. AND ..) .... These are the results:

Results of the fourth experiment

Even worse than either of the previous approaches.

Option 5. Shared Query Data Table

Let’s try one more approach:

I added a new table to the database that will hold query data. For each query I can now:

  • Start a transaction (if not yet started)
  • Upload query data to that table (temporary)
  • Perform a query
  • Roll back a transaction—to delete uploaded data
var result = new List<Price>();
using (var context = CreateContext())
{
  context.Database.BeginTransaction();

  var reducedData = TestData.Select(x => new SharedQueryModel()
  {
    PriceSourceId = x.PriceSourceId,
    Ticker = x.Ticker,
    TradedOn = x.TradedOn
  }).ToList();
  
  // Here query data is stored to shared table
  context.QueryDataShared.AddRange(reducedData);
  context.SaveChanges();

  var query = from p in context.Prices
         join s in context.Securities on 
           p.SecurityId equals s.SecurityId
         join t in context.QueryDataShared on 
           new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
           new { t.Ticker, t.TradedOn, t.PriceSourceId }
         select p;
  result.AddRange(query);

  context.Database.RollbackTransaction();
}

Metrics first:

Results of the fifth experiment

The result is very good. Very fast. Memory consumption is also good. But the drawbacks are:

  • You have to create an extra table in the database to perform just one type of query,
  • You have to start a transaction (which consumes DBMS resources anyways), and
  • You have to write something to the database (in a READ operation!)—and basically, this will not work if you use something like read replica.

But apart from that, this approach is nice—fast and readable. And a query plan is cached in this case!

Option 6. MemoryJoin Extension

Here I’m going to use a NuGet package called EntityFrameworkCore.MemoryJoin. Despite the fact that its name has the word Core in it, it also supports EF 6. It is called MemoryJoin, but in fact, it sends the specified query data as VALUES to the server and all work is done on the SQL server.

Let’s check the code.

var result = new List<Price>();
using (var context = CreateContext())
{
  // better to select needed properties only, for better performance
  var reducedData = TestData.Select(x => new { 
    x.Ticker, 
    x.TradedOn, 
    x.PriceSourceId 
  }).ToList();

  var queryable = context.FromLocalList(reducedData);
  var query = from p in context.Prices
              join s in context.Securities on 
                p.SecurityId equals s.SecurityId
              join t in queryable on 
                new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
                new { t.Ticker, t.TradedOn, t.PriceSourceId }
              select p;

  result.AddRange(query);
}

Metrics:

Results of the final experiment

This looks awesome. Three times faster than the previous approach—that makes it the fastest yet. 3.5 seconds for 64K records! The code is simple and understandable. This works with read-only replicas. Let’s check the query generated for three elements:

SELECT "p"."PriceId",
       "p"."ClosePrice",
       "p"."OpenPrice",
       "p"."PriceSourceId",
       "p"."SecurityId",
       "p"."TradedOn",
       "t"."Ticker",
       "t"."TradedOn",
       "t"."PriceSourceId"
FROM "Price" AS "p"
INNER JOIN "Security" AS "s" ON "p"."SecurityId" = "s"."SecurityId"
INNER JOIN
  ( SELECT "x"."string1" AS "Ticker",
           "x"."date1" AS "TradedOn",
           CAST("x"."long1" AS int4) AS "PriceSourceId"
   FROM
     ( SELECT *
      FROM (
            VALUES (1, @__gen_q_p0, @__gen_q_p1, @__gen_q_p2), 
                   (2, @__gen_q_p3, @__gen_q_p4, @__gen_q_p5), 
                   (3, @__gen_q_p6, @__gen_q_p7, @__gen_q_p8)
           ) AS __gen_query_data__ (id, string1, date1, long1)
       ) AS "x"
   ) AS "t" ON (("s"."Ticker" = "t"."Ticker")
AND ("p"."PriceSourceId" = "t"."PriceSourceId")

As you can see, this time actual values are passed from memory to the SQL server in the VALUES construction. And this does the trick: The SQL server managed to perform a fast join operation and use the indexes correctly.

However, there are some drawbacks (you may read more on my blog):

  • You need to add an extra DbSet to your model (however no need to create it in the DB)
  • The extension doesn’t support model classes with many properties: three string properties, three date properties, three guide properties, three float/double properties, and three int/byte/long/decimal properties. This is more than enough in 90% of the cases, I guess. However, if it’s not, you can create a custom class and use that. So, HINT: you need to pass the actual values in a query, otherwise resources are wasted.

Conclusion

Among the things I’ve tested here, I’d definitely go for MemoryJoin. Someone else might object that the drawbacks are insurmountable, and since not all of them can be resolved at the moment, we should abstain from using the extension. Well, for me, it’s like saying that you should not use a knife because you could cut yourself. Optimization was a task not for junior developers but for someone who understands how EF works. To that end, this tool can improve performance dramatically. Who knows? Maybe one day, someone at Microsoft will add some core support for dynamic VALUES.

Finally, Here are a few more diagrams to compare results.

Below is a diagram for time taken to perform an operation. MemoryJoin is the only one which does the job in a reasonable time. Only four approaches can process big volumes: two naive implementations, shared table, and MemoryJoin.

Time taken in various cases for each experiment

The next diagram is for memory consumption. All approaches demonstrate more or less the same numbers except the one with multiple Contains. This phenomenon was described above.

Memory consumption in various cases  for each experiment

Understanding the Basics

What is DBset in Entity Framework?

DBSet is an abstraction which is literally a collection of objects (usually lazy loaded) stored in a table. Operations done on DBSet are actually performed on the actual database records via SQL queries.

About the author

Anton Shkuratov, Russia
member since September 28, 2014
Anton is a software developer and technical consultant with over ten years of experience in the creation of desktop-distributed applications using the most modern frameworks and methodologies. He has expertise in all stages of design, implementation, and deployment of information systems in both desktop and web areas. [click to continue...]
Hiring? Meet the Top 10 Freelance .NET Developers for Hire in November 2018

Comments

comments powered by Disqus
Subscribe
Free email updates
Get the latest content first.
No spam. Just great articles & insights.
Free email updates
Get the latest content first.
Thank you for subscribing!
Check your inbox to confirm subscription. You'll start receiving posts after you confirm.
Trending articles
Relevant Technologies
About the author
Anton Shkuratov
C# Developer
Anton is a software developer and technical consultant with over ten years of experience in the creation of desktop-distributed applications using the most modern frameworks and methodologies. He has expertise in all stages of design, implementation, and deployment of information systems in both desktop and web areas.