Choosing the right calculation type…

[Last year at Tableau Conference in Vegas, I co-presented a session on this topic with the fantastic Bethany Lyons from our UK offices. I delivered it again last week at Tableau Conference on Tour in Sydney to a very positive response so to reach a broader audience I’m sharing the key points from that presentation in this post. A lot of the initial work, including developing the examples, was done by Bethany so much of the credit must go to her. Thanks, Bethany – you rock!]

Today, Tableau has multiple forms of calculation:

Basic Calculations
These calculations are written as part of the query created by Tableau and therefore are done in the underlying data source. They can be performed either at the granularity of the data source (a row-level calculation) or at the level of detail of the visualisation (an aggregate calculation).

Level of Detail expressions
Like basic calculations, level of detail expressions are also written as part of the query created by Tableau and therefore are done in the data source. The difference is that LOD expressions can operate at a granularity other than that of the data source or the visualisation. They can be performed at a more granular level (via INCLUDE), a less granular level (via EXCLUDE) or an entirely independent level (via FIXED). For more information on LOD expressions, see my series of posts starting here:
https://blog.databender.net/2015/01/22/understanding-level-of-detail-expressions-part-1/

Table calculations
Table calculations are performed after the query returns and therefore can only operate over values that are in the query result set.

One of the challenges faced by people new to Tableau is understanding which type of calculation to employ for a given problem. This post aims to provide some clarity on how to make that determination. We do so by comparing the different calculations types against one another.

#1 – Basic calculation vs. table calculation

When trying to choose between basic calculations and table calculations, the important question is “do I already have all the data values I need on the visualisation?” If the answer is yes, then you can calculate the answer without further interaction with the data source. This will often be faster as there is less data that needs to be processed (i.e. we are just computing using the aggregated values from the result set). If you do not, then you have no choice but to go to the underlying data source to calculate the answer.

Consider the following example where we ask “what is the 90th percentile of our order details, shown by country”:

Both sides of this dashboard answer the question – if you were just interested in the 90th percentile value and didn’t need to determine further insights then the chart on the left would be optimal. It provides a minimal result set (one number per country) via a basic aggregation PCT90([Sales]) which is calculated in the underlying data source.

However, if you wanted to gain further insight (e.g. understand the greater distribution and identify outliers) or add other aggregations (e.g. you also wanted the median values) then the chart on the right allows you to do that without further queries. The initial query to return all the order detail records (the green dots) provides all the data necessary to locally compute the 90th percentile as well as explore other insights.

One of the key takeaways from this post is that the layout of the viz matters. As we’ve already discussed above, the viz design will impact how much data you initially return from the data source and this is an important factor in determining your approach. However, there are situations where, although you have all the data you need in your result set, it is not possible to achieve the required layout using a table calculation. So you also need to ask “does the layout of the viz permit me to use a table calc?”

Consider the following example where we ask for the YoY difference in sales in two formats – one as a chart and the other as a table:

The top half of this dashboard is easily achieved using a table calculation – simply duplicate the [Sales] field and apply a Difference quick table calculation, run across the [Order Date] dimension. However, if you try to then convert that computation structure into a table you end up with the following:

You will realise that it’s not possible to achieve the specified layout with a table calculation as you need the Year dimension with the Measure Names dimension nested inside. Tableau cannot suppress the “Difference in Sales” row for 2013 so in this example, your only option is to use basic calculations:

[2013 Sales]
IF YEAR([Order Date]) = 2013 THEN [Sales] END

[2014 Sales]
IF YEAR([Order Date]) = 2014 THEN [Sales] END

[Difference]
SUM([2014 Sales]) – SUM([2013 Sales])

This approach allows you to just have the Measure Names dimension which you can sort to meet the layout requirements.

#2 – Basic calculation vs. level of detail expression

If we do not have all the data we need on the visualisation, we need our calculation to be passed through to the data source. This means we must use a basic calculation or a LOD expression. But how to choose? The important question to ask here is “does the granularity of the question match either the granularity of the viz or the granularity of the data source?”

Basic calculations can be performed either as row-level calculations or as aggregate calculations so they can only answer questions at the granularity of the data source or at the level of detail of the visualisation. Level of detail expressions on the other hand, can answer questions at any granularity.

Consider the following example, where we ask “what is the 90th percentile of sales at the order detail level compared to the order total level?”:

If you are familiar with Tableau’s Superstore dataset you will know that it as one row of data per line item of each order. So if we consider the question above, we determine:

  • Granularity of data source:    Order Detail
  • Granularity of viz:        Country
  • Granularity of left chart:    Order Detail
  • Granularity of right chart:    Order

So for the left chart we can solve this with a basic calculation – PCT90([Sales]) – however for the right chart we must first total the Order Details to the Order level and then perform the percentile aggregation. So we must use a level of detail expression:

[Total Sales including Order]
{INCLUDE [Order ID] : SUM([Sales])}

We can then use the same aggregation as above – PCT90([Total Sales including Order]) – to get the answer. The following diagram explains how the LOD expression works:

Note that we use the INCLUDE expression so that Orders that are split across Countries are allocated correctly and not double-counted. Some readers might prefer to solve this problem with a FIXED expression, in which case we would need to write:

[Total Sales including Order]
{INCLUDE [Country], [Order ID] : SUM([Sales])}

This would be correct for the required chart but would limit our flexibility to change the grouping to some other dimension – e.g. by Region or by Shipping Type.

#3 – Table calculation vs. level of detail expression

This is the decision that many people find confusing, however the process to choose between a table calculation and a LOD expression is the same as for a table calculation vs. a basic calculation. You need to ask:

Consider the following example where we ask “what is the 90th percentile of sales at the order total level, shown by country?”:

You will notice that this is almost identical to the question asked in #1 above. The only difference is that the percentile calculation is done on the order total, not the order detail. You may in fact realise the chart on the left side is in fact the same chart as we saw on the right side in #2. We already know the granularity of this problem is different to the data source and the viz, so we should use a LOD expression.

The chart on the right side is the same as the right side from #1 however the dots represent Orders, not Order Details. This is simply done by changing the granularity of the viz (swap Row ID with Order ID on the Detail shelf). Because table calculations keep the calculation logic separate from the computation scope and direction we don’t even need to change the calculation – just compute using Order ID.

It can get tricky to be sure about the answer to our decision process questions, and sometimes you can solve a problem one way until you later introduce a complication. Consider the following example where we ask “for each age group, what percent of illnesses does each disease account for?”:

This is clearly a percent of total problem and we can very quickly solve this problem with a quick table calculation over the Disease. However, when we then add the complication of allowing the user to filter to a specific disease we find the following:

This is because our result set no longer contains all the data we need – the filter has removed the Patient Count data for the other diseases. You could solve this by making the filter a table calculation:

[Filter Disease]
LOOKUP(MIN([Disease]), 0) compute using Disease

Or you can use LOD expresisons, knowing that FIXED calculations are done before dimension filters. First, work out the total number of people in an age group:

[Total Patients per Disease]
{FIXED [Age]:SUM([Patient Count])}

Then you can compute the % total:

[Pct Total]
SUM([Patient Count])/SUM([Total Patients per Disease])

#4 – When only table calculations will do

Finally, we need to add one final concept to our decision process. There are several categories of problems that can only be solved using table calculations:

  • Ranking
  • Recursion (e.g. cumulative totals)
  • Moving calculations (e.g. rolling averages)
  • Inter-row calculations (e.g. period vs. period calculations)

So the question to ask here is “does my problem require the use of ranking, recursion, moving calculations or inter-row calculations?”

This is because table calculations can output multiple values for each partition of data, while LOD and basic expressions output a single value for each partition/grouping of data. Consider the following exercise where we ask “in the last year, how many times has a stock broken its record close value to date?”:

We need a recursive calculation here – I need to consider all the previous values before me in order to tell if I am a new maximum. We can do this with a RUNNING_MAX function. So we first calculate the highest value to date:

[Record to Date]
RUNNING_MAX(AVG([Close])) compute using Day

Then at the day level we need to flag those days where the record was broken:

[Count Days Record Broken]
IF AVG([Close]) = [Record to Date] THEN 1 ELSE 0 END

Finally, we need to count these days:

[Total Times Record Broken]
RUNNING_SUM([Count Days Record Broken]) compute using Day

Conclusion

The key takeaways from all this are:

  • There is no silver bullet. The answer is always “it depends…” but the decision process will get you started selecting the right approach.
  • The layout of the viz matters – as it changes you might need to change your calculation type.
  • There are scenarios where different solutions perform differently – depending on the volume and complexity of your data; on the complexity of the question and the required layout.
  • There are always trade-offs to consider (performance vs. flexibility vs. simplicity) – a good rule of thumb is that you can pick any two.

For more information and examples, you can download the above workbook from the following link: https://www.dropbox.com/s/hzuza3j347wdkxg/FINISH%20-%20Optimizing%20Calculation%20Methods.twbx?dl=0

You can also download a PPT with the above flowcharts and LOD diagrams from here: https://www.dropbox.com/s/t0ig6116f8j1ncq/For%20Blog%20Post.pptx?dl=0

Enjoy!

About Alan Eldridge

Hi. I'm Alan. By day I manage the APAC sales engineering team for Snowflake Computing. By night, I'm a caped crusader. Or sleeping. Most often it's sleeping.
This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Choosing the right calculation type…

  1. Kai says:

    Thanks Alan, that solution in step 3 helped us solve a problem straight away. Love it.

  2. Pingback: Best of the web for Alteryx , Tableau and AWS Redshift | adrianloong

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s