SQL Server and the Pie Eating Contest

When coding SQL, it can be a challenge to decide whether a particular SQL statement should be broken down into multiple statements, or whether one should keep adding clauses to solve the problem at hand with one statement. Hopefully this article will provide you with a framework for thinking about the issue and help you make the right choice.


As an analogy, imagine that we are entering a pie eating contest. Our challenge is to eat the pie we are given in the least time. We have a basic choice to make. Should we take many small bites, or should we take a few large bites, or should we perhaps just try to cram the whole pie at once into our mouth. The tradeoff of course is that the more pie we try to chew at once, the longer it takes to chew. For small pies, (and large mouths), a small number of large bites will be most efficient. As the pie gets larger, the number of optimal bites (and the size of the bite) will change. We also run the risk of taking far too large a bite and choking.


It’s similar with SQL. Suppose we are crafting a query that has to do a join between N tables, and also check for other Y criteria (perhaps sub queries or function calls). The question then arises, will it be faster to do it in one step or break it down into several steps and if so, how many steps?


Where SQL differs from pie eating, is that with pies, it’s relatively easy to measure the size of a pie (and of a mouth), and how much can fit in one bite. With a given piece of SQL, that’s not the case.  The “size” of a SQL statement doesn’t just depend on the amount of code, but on the size of the underlying tables and even the indexing structure.


What I would suggest is this;


Before deciding how best to write a query, one needs to answer the following questions;

  1. What would be considered reasonable performance for this query?
  2. What would be considered really bad performance for this query?
  3. Am I trying to optimize for best performance or am I trying to optimize to ensure I stay within reasonable performance and avoid really bad performance?


How you answer those questions will determine how you tackle the coding


If I had to ensure that something ran in 5 seconds or less, I would first try to avoid multiple steps, and see what the query looks like as one statement. If the query looks overly complex( many table joins, sub queries, function calls etc.) , I would see if I could eliminate parts of it, optimize indexes, or even change the way processing works in my system so that some of the items are pre-calculated. I might also have to accept that I’m asking for too much at once to expect results reliably in that amount of time and re-think my whole request.


If acceptable performance is greater than 5 seconds, and what I am really trying to avoid is the equivalent of choking on the pie, then I would start by breaking down the SQL query into multiple steps. The basic pattern for this would be to create a temporary table, populate it with an initial set of rows, and then continue to either filter the rows through multiple steps (deletes) or update columns of the temp table. It might very well be that for a  given result set needed, breaking it down into multiple steps might double or triple the time it takes to run (i.e. the complex single version might take 2 seconds while the multi-step version takes 6 seconds or more). But with a larger number of smaller steps, each one can be tuned separately and it is far less likely that SQL Server will “choke” on any of the steps.


The pie analogy can be used to explain several other aspects of SQL Server’s behavior – such as why does the statistics that SQL Server keeps about the data it is storing affect the performance of a query so dramatically? In essence, SQL Server uses its statistics to measure the size of the pie and the sizes of the different bites it might break it up into. If SQL Server has the wrong statistics, and therefore makes the wrong decision about the size of the pie, it can essentially choke.


I’d also like to use the pie analogy to point out a key crucial difference between other programming languages that developers normally work in versus coding SQL. In a language like Java or C, when we write code, to a large extent, we are literally controlling the flow of the code and the order things will be accomplished (putting aside issues like when does garbage collection occur etc.). So when we program something like a FOR loop, though we are not handing coding the actually assembly language instructions, we expect that the FOR loop will operate the same way every time we run it. Early databases operated like this also. Essentially you would open up a table (specifying index to use), then loop through each row, and if necessary, for each row, opening up yet another table etc. In effect, you got to decide at a very low level, just how to traverse the data in your database. The downside of a database engine like that is the tremendous amount of code to accomplish even the simplest queries.   SQL on the other hand, is a declarative language – you aren’t telling SQL how to do something in a given query, you are stating what data you want – and leaving it up to the database engine. Think of it as saying, “I want a chocolate pie using the following ingredients (the ingredients being the various pieces of data in different tables)” and then leaving it up to the baker to figure out how to bake the pie.  You might ask for the same pie two days in a row, but depending on the number of ingredients or how difficult it is to find particular ingredients, the baker might bake the “same” pie in different ways on each of those requests.


SQL Server actually has ways that you can influence the way it tackles a given problem. One can specify the type of join operations used, the index it should be used, and even cache whole plans (called plan guides) and tell SQL Server to use them. However, if one needs to resort to such commands, it’s a clear indication that the query is so complex that SQL Server is having a hard time figuring out the best way to accomplish it. It is far better to re-factor the query than force SQL Server to use a particular set of operations to accomplish the query.


In summary, as my mother always told me, “watch your manners and don’t stuff your mouth!” That seems to work both for table manners and SQL Server.