SQL article thumbnail

Flattening multi-level hierarchies in SQL server – Solution and Performance

Software applications are for sure one of the most valuable assets for companies, providing critical capabilities and functionalities to perform a wide range of operations in the industry. But when they don’t perform as expected could add more pain than gain.

Poor performance costs the software industry millions of money annually in the form of lost revenue, hardware costs, damaged customer relations and decreased productivity.

One of the reasons why an applications are not working as expected could be the database and how it is organized.

Let’s deep dive into this by taking one simple example

The Problem

Our data is stored in a single table and has a hierarchical relationship with itself. We must interpret it and be able to say which elements are “above” a certain element and which elements are “below”. For example, we have a table of Employees and each Employee either manages another Employee or is managed by another Employee. A more practical example is Users and Groups inside an Active Directory – a user can be part of a group but also a group can be part of a group and we should be able to list all the groups that a user is part of (including groups that contain a group that the user in in) and also list all members of a group (including members of other groups that are part of the group in question). We will be using the latter example.

Manually joining the table with itself is out of the question because the number of hierarchical levels is dynamic, it can change at any point in time (e.g. when an employee leaves the company / when a group or a user is deleted from the AD). So then what are our options?

First of all, let’s define the tables.

One solution: a while loop

First thing that comes to mind is to use to iterate through all level of the hierarchy, expanding all groups on each iteration until there’s no more groups to expand.

Ok that works but what if we have circular references (e.g. Group A -> Group B -> Group C -> Group A)? All we need to do is remember which groups have already been expanded and not expand them a second time.

As we can see this solution uses two temporary tables (three if we include need to also remember already expanded groups) and also SQL is not a programming language that is designed to work fast with loops / iterations. There has to be a better solution, and there is!

The real solution

Have you heard of the term “common table expression” (or CTE)? If not, you can think of a CTE as a table variable but instead of declaring by defining the table you declare it by defining a query. The upside is that this special kind of table variable is more optimized than the normal table variable, the downside is that you can only use it inside a single query statement. You can also define more than one CTE per query and you can reference a CTE inside another CTEs definition, so even their usage is limited you can get creative with them. And before you ask, yes, you can reference the same CTE inside its own definition.

CTEs offer us the tool necessary to write an efficient query in this case: recursion. Everything can be written in a single query statement and the best part is that SQL server can optimize the execution plan better than in the other case and this, combined with less temporary tables and less query statements in total) makes the execution more efficient.

We will use recursive CTE magic to go through the hierarchy and get a list of all the groups that are part of the group in question – but only the groups. After we have that we only need to expand all the groups in that list once (excluding groups this time) to get our desired result set.

Ok, but again, what if we have circular references? Like before, we will need to know which groups have already been marked for expansion and stop the recursion so that we don’t enter an infinite loop (SQL has a max number of recursions: 32,767, but this upper limit won’t be reached in most real world scenarios).

We will achieve this by adding a second column to the CTE, we’ll call it “hierarchy” and it will remember all the “parents” of a group – a history of all groups that were expanded to get to that specific group. We will store this data as a string of comma separated values and we will use the LIKE operator to check if a value is inside the list.

Execution Times and Conclusions

I populated my database using a mock data generation tool (like SQL Data Generator from RedGate), this gives me evenly spread out data. This most definitely does not reproduce a real-world scenario. Why? Well… for example if, after I generate the data I have 1000 members half of those members will be groups, and if I have 4000 memberships those memberships will be evenly spread out among those groups – so on average each group will have 8 members, 4 of them will be users and the other 4 will be groups.

If we execute these queries on such a dataset we get:

  • sub-second execution time for the while loop
  • 4 seconds for the recursive CTE

Unexpected? Not really. As I said this does not reflect a real-world scenario, but it does underline one of the flaws of the recursive solution – it does not work well when the dataset contains a lot of circular references.

Why? Let’s imagine the following data:

A: {B, C, D}

B: {A, C, D}

C: {A, B, D}

D: {A, B, C}

If we want to expand group A, let’s imagine what the recursive query will do.

  1. It will expand group A – for each of the resulting rows it will remember that A was expanded
  2. It will expand groups B, C and D – for each of these rows it will remember that A and B, C, or D respectively were expanded

    So right now the result table looks like this:

  3. Now it will expand groups C, D, B, D, B, C

    ...

As you can see, it expands groups needlessly because of how the circular reference check is implemented. This, in turn, makes the resulting data set much larger than needed, slowing down the execution time. Whereas the while loop solution avoids this issue by remembering the list of ALL previously expanded groups, not just the list of groups expanded to get to a particular result, so even though it introduces many more operations (inserts, deletes) it’s still better than an exponentially growing result set.

Usually enterprises are not arranged like this. There may be a small amount of circular references present in the AD but in general the hierarchy structure resembles a tree.

I choose to populate my database with 30000 random users and then I proceeded to create a binary tree hierarchical structure – every group will contain only 2 members, these two members will either be groups or users. This leaves us with 62767 rows in the Members table and 59998 rows in the Memberships table. These extra entries were created manually using a stored procedure that called itself.

Expanding the “root” group gives us the following execution times:

  • 1 minute and 3 seconds execution time for the while loop
  • 1 second for the recursive CTE

Now that’s a big difference.

In conclusion, recursive CTEs offer us a substantial increase in performance if used in the correct setting.