45 degree reference line with dynamic axes

I read a great blog post today from Andy Kriebel where he shows how to create a 45-degree reference line on a scatter plot:

http://vizwiz.blogspot.com.au/2016/03/tableau-tip-tuesday-how-to-create-45.html

His solution is very elegant, using a trend line against a secondary axis to create the reference line, however one limitation he points out in his video is that if the X and Y axes have different scales then it doesn’t work.

You either need to live with the visual confusion or fix the axes, however the latter option isn’t a good solution where you have a filter that will cause the range of the data to change significantly. A better solution is to ensure that both axes always show the same range and one way to do this is via reference lines.

First, we need to determine the maximum value on either of the two axes – we can do this with the MAX() function as follows:

MAX(
WINDOW_MAX(SUM([Selected Year Sales])),
WINDOW_MAX(SUM([Prev Year Sales]))
)

We can then use this to plot a reference line on both axes – as you can see this forces them to have the same range:

By formatting the reference lines to have no line and no label they are effectively invisible, but they still control the maximum range of the axes:

You can download the workbook with an example of this technique here:

https://www.dropbox.com/s/97wxik2e3d0yynt/45%20ref%20line.twbx?dl=0

Enjoy!

Posted in Uncategorized | 3 Comments

Choosing the right calculation type…

[Last year at Tableau Conference in Vegas, I co-presented a session on this topic with the fantastic Bethany Lyons from our UK offices. I delivered it again last week at Tableau Conference on Tour in Sydney to a very positive response so to reach a broader audience I’m sharing the key points from that presentation in this post. A lot of the initial work, including developing the examples, was done by Bethany so much of the credit must go to her. Thanks, Bethany – you rock!]

Today, Tableau has multiple forms of calculation:

Basic Calculations
These calculations are written as part of the query created by Tableau and therefore are done in the underlying data source. They can be performed either at the granularity of the data source (a row-level calculation) or at the level of detail of the visualisation (an aggregate calculation).

Level of Detail expressions
Like basic calculations, level of detail expressions are also written as part of the query created by Tableau and therefore are done in the data source. The difference is that LOD expressions can operate at a granularity other than that of the data source or the visualisation. They can be performed at a more granular level (via INCLUDE), a less granular level (via EXCLUDE) or an entirely independent level (via FIXED). For more information on LOD expressions, see my series of posts starting here:
https://blog.databender.net/2015/01/22/understanding-level-of-detail-expressions-part-1/

Table calculations
Table calculations are performed after the query returns and therefore can only operate over values that are in the query result set.

One of the challenges faced by people new to Tableau is understanding which type of calculation to employ for a given problem. This post aims to provide some clarity on how to make that determination. We do so by comparing the different calculations types against one another.

#1 – Basic calculation vs. table calculation

When trying to choose between basic calculations and table calculations, the important question is “do I already have all the data values I need on the visualisation?” If the answer is yes, then you can calculate the answer without further interaction with the data source. This will often be faster as there is less data that needs to be processed (i.e. we are just computing using the aggregated values from the result set). If you do not, then you have no choice but to go to the underlying data source to calculate the answer.

Consider the following example where we ask “what is the 90th percentile of our order details, shown by country”:

Both sides of this dashboard answer the question – if you were just interested in the 90th percentile value and didn’t need to determine further insights then the chart on the left would be optimal. It provides a minimal result set (one number per country) via a basic aggregation PCT90([Sales]) which is calculated in the underlying data source.

However, if you wanted to gain further insight (e.g. understand the greater distribution and identify outliers) or add other aggregations (e.g. you also wanted the median values) then the chart on the right allows you to do that without further queries. The initial query to return all the order detail records (the green dots) provides all the data necessary to locally compute the 90th percentile as well as explore other insights.

One of the key takeaways from this post is that the layout of the viz matters. As we’ve already discussed above, the viz design will impact how much data you initially return from the data source and this is an important factor in determining your approach. However, there are situations where, although you have all the data you need in your result set, it is not possible to achieve the required layout using a table calculation. So you also need to ask “does the layout of the viz permit me to use a table calc?”

Consider the following example where we ask for the YoY difference in sales in two formats – one as a chart and the other as a table:

The top half of this dashboard is easily achieved using a table calculation – simply duplicate the [Sales] field and apply a Difference quick table calculation, run across the [Order Date] dimension. However, if you try to then convert that computation structure into a table you end up with the following:

You will realise that it’s not possible to achieve the specified layout with a table calculation as you need the Year dimension with the Measure Names dimension nested inside. Tableau cannot suppress the “Difference in Sales” row for 2013 so in this example, your only option is to use basic calculations:

[2013 Sales]
IF YEAR([Order Date]) = 2013 THEN [Sales] END

[2014 Sales]
IF YEAR([Order Date]) = 2014 THEN [Sales] END

[Difference]
SUM([2014 Sales]) – SUM([2013 Sales])

This approach allows you to just have the Measure Names dimension which you can sort to meet the layout requirements.

#2 – Basic calculation vs. level of detail expression

If we do not have all the data we need on the visualisation, we need our calculation to be passed through to the data source. This means we must use a basic calculation or a LOD expression. But how to choose? The important question to ask here is “does the granularity of the question match either the granularity of the viz or the granularity of the data source?”

Basic calculations can be performed either as row-level calculations or as aggregate calculations so they can only answer questions at the granularity of the data source or at the level of detail of the visualisation. Level of detail expressions on the other hand, can answer questions at any granularity.

Consider the following example, where we ask “what is the 90th percentile of sales at the order detail level compared to the order total level?”:

If you are familiar with Tableau’s Superstore dataset you will know that it as one row of data per line item of each order. So if we consider the question above, we determine:

  • Granularity of data source:    Order Detail
  • Granularity of viz:        Country
  • Granularity of left chart:    Order Detail
  • Granularity of right chart:    Order

So for the left chart we can solve this with a basic calculation – PCT90([Sales]) – however for the right chart we must first total the Order Details to the Order level and then perform the percentile aggregation. So we must use a level of detail expression:

[Total Sales including Order]
{INCLUDE [Order ID] : SUM([Sales])}

We can then use the same aggregation as above – PCT90([Total Sales including Order]) – to get the answer. The following diagram explains how the LOD expression works:

Note that we use the INCLUDE expression so that Orders that are split across Countries are allocated correctly and not double-counted. Some readers might prefer to solve this problem with a FIXED expression, in which case we would need to write:

[Total Sales including Order]
{INCLUDE [Country], [Order ID] : SUM([Sales])}

This would be correct for the required chart but would limit our flexibility to change the grouping to some other dimension – e.g. by Region or by Shipping Type.

#3 – Table calculation vs. level of detail expression

This is the decision that many people find confusing, however the process to choose between a table calculation and a LOD expression is the same as for a table calculation vs. a basic calculation. You need to ask:

Consider the following example where we ask “what is the 90th percentile of sales at the order total level, shown by country?”:

You will notice that this is almost identical to the question asked in #1 above. The only difference is that the percentile calculation is done on the order total, not the order detail. You may in fact realise the chart on the left side is in fact the same chart as we saw on the right side in #2. We already know the granularity of this problem is different to the data source and the viz, so we should use a LOD expression.

The chart on the right side is the same as the right side from #1 however the dots represent Orders, not Order Details. This is simply done by changing the granularity of the viz (swap Row ID with Order ID on the Detail shelf). Because table calculations keep the calculation logic separate from the computation scope and direction we don’t even need to change the calculation – just compute using Order ID.

It can get tricky to be sure about the answer to our decision process questions, and sometimes you can solve a problem one way until you later introduce a complication. Consider the following example where we ask “for each age group, what percent of illnesses does each disease account for?”:

This is clearly a percent of total problem and we can very quickly solve this problem with a quick table calculation over the Disease. However, when we then add the complication of allowing the user to filter to a specific disease we find the following:

This is because our result set no longer contains all the data we need – the filter has removed the Patient Count data for the other diseases. You could solve this by making the filter a table calculation:

[Filter Disease]
LOOKUP(MIN([Disease]), 0) compute using Disease

Or you can use LOD expresisons, knowing that FIXED calculations are done before dimension filters. First, work out the total number of people in an age group:

[Total Patients per Disease]
{FIXED [Age]:SUM([Patient Count])}

Then you can compute the % total:

[Pct Total]
SUM([Patient Count])/SUM([Total Patients per Disease])

#4 – When only table calculations will do

Finally, we need to add one final concept to our decision process. There are several categories of problems that can only be solved using table calculations:

  • Ranking
  • Recursion (e.g. cumulative totals)
  • Moving calculations (e.g. rolling averages)
  • Inter-row calculations (e.g. period vs. period calculations)

So the question to ask here is “does my problem require the use of ranking, recursion, moving calculations or inter-row calculations?”

This is because table calculations can output multiple values for each partition of data, while LOD and basic expressions output a single value for each partition/grouping of data. Consider the following exercise where we ask “in the last year, how many times has a stock broken its record close value to date?”:

We need a recursive calculation here – I need to consider all the previous values before me in order to tell if I am a new maximum. We can do this with a RUNNING_MAX function. So we first calculate the highest value to date:

[Record to Date]
RUNNING_MAX(AVG([Close])) compute using Day

Then at the day level we need to flag those days where the record was broken:

[Count Days Record Broken]
IF AVG([Close]) = [Record to Date] THEN 1 ELSE 0 END

Finally, we need to count these days:

[Total Times Record Broken]
RUNNING_SUM([Count Days Record Broken]) compute using Day

Conclusion

The key takeaways from all this are:

  • There is no silver bullet. The answer is always “it depends…” but the decision process will get you started selecting the right approach.
  • The layout of the viz matters – as it changes you might need to change your calculation type.
  • There are scenarios where different solutions perform differently – depending on the volume and complexity of your data; on the complexity of the question and the required layout.
  • There are always trade-offs to consider (performance vs. flexibility vs. simplicity) – a good rule of thumb is that you can pick any two.

For more information and examples, you can download the above workbook from the following link: https://www.dropbox.com/s/hzuza3j347wdkxg/FINISH%20-%20Optimizing%20Calculation%20Methods.twbx?dl=0

You can also download a PPT with the above flowcharts and LOD diagrams from here: https://www.dropbox.com/s/t0ig6116f8j1ncq/For%20Blog%20Post.pptx?dl=0

Enjoy!

Posted in Uncategorized | 2 Comments

New Features == Better Solutions

Some time back I used an example of plotting great circle routes on a map as a way to demonstrate data densification (https://blog.databender.net/2014/06/27/curved-lines-in-tableau-through-data-densification/).

062714_0107_CurvedLines4.png

To do data densification for the interpolated points we need two records for each route – one representing the start of the line and the other representing the end. In the original solution I used custom SQL to generate this scaffolding – running the query twice, each time returning a different number for the PointID, then unioning the two result sets together:

SELECT [Routes+Airports].[airline] AS [airline],  
[Routes+Airports].[from_code] AS [from_code],  
[Routes+Airports].[from_lat] AS [from_lat],  
[Routes+Airports].[from_long] AS [from_long],  
[Routes+Airports].[num_flights] AS [num_flights],  
[Routes+Airports].[to_code] AS [to_code],  
[Routes+Airports].[to_lat] AS [to_lat],  
[Routes+Airports].[to_long] AS [to_long],  
1 as [PointID]
FROM [Routes+Airports] 

UNION 

SELECT [Routes+Airports].[airline] AS [airline],  
[Routes+Airports].[from_code] AS [from_code],  
[Routes+Airports].[from_lat] AS [from_lat],  
[Routes+Airports].[from_long] AS [from_long],  
[Routes+Airports].[num_flights] AS [num_flights],  
[Routes+Airports].[to_code] AS [to_code],  
[Routes+Airports].[to_lat] AS [to_lat],  
[Routes+Airports].[to_long] AS [to_long],  
<Parameters.NumPoints> as [PointID]
FROM [Routes+Airports]

Clearly this is not an optimal solution as it a) requires a custom query which is inflexible and requires the user to know how to write SQL, and b) has to query the data source twice which is inefficient. However, at the time it was the only way to generate 2 result-set records per source record. I was messing about with this workbook yesterday and it occurred to me that we now have a better way to solve this problem. In Tableau 9 we introduced a new feature for data preparation – the PIVOT feature. We can now use this to generate the required scaffolding without custom SQL and multiple passes of the source data.

Here’s our input data:

1

We can pivot this on the from/to airport codes:

2

And the result is two records per route:

3

We can now create a calculated field to assign starting and ending PointIDs:

4

From there, the densification technique is much the same as the original. However, this time around I wanted to avoid using R to calculate the interpolated values as it is not something all users have set up. Dusting off my Google-fu, I found http://williams.best.vwh.net/avform.htm#Int which had the formula I needed to do the calculation natively in Tableau:

5

The result is a much better performing workbook – partly because we are reading the input data more efficiently and partly because we are using native calculations.

6

You can download the TWBX for this workbook here.

Enjoy!

Posted in Uncategorized | 3 Comments

Generating high resolution images of Tableau views

[This is one of those posts where I am going to show you something cool and useful, and then advise you that this is something you shouldn’t do. The URL in this post is undocumented and unsupported (by Tableau and by me) so if you are not comfortable with that, go no further. Here, there be dragons…]

Below is a conversation I have semi-regularly with customers:

Customer:  Hey Alan! We love our dashboards and we’d like to use them in some printed materials we produce. Can we do that?

Alan:   Of course you can. You just copy or export an image file and you have a nice PNG, JPG or BMP to work with. Whack that in there and away you go!

Customer:  Yeah, we tried that. But the image file formats are produced at screen resolution and become fuzzy when printed as part of a document. Can we get something better?

Alan:  OK – then try using a vector file format. You can export to EMF or PDF which produce nice scalable output?

Customer:  Again, we tried that. It’s pretty good, but it doesn’t work well for embedded bitmap content like map backgrounds, and we can’t export to EMF from Tableau Server so we can’t automate the process very well. What else can we try?

Alan:  Hmm… let me have a think about it.

[Walks over to the marketing team…]

Alan:  Hey guys! How do you produce the nice looking images you use in your handouts and printed materials?

Marketing:  We open them in Tableau Desktop on a Mac and take a screenshot to get a retina-resolution image.

Alan:  Hmm…

So I started looking for a way to have Tableau produce a high resolution image in a way that could be automated, and an idea came to mind. With the new Tableau mobile applications on iOS we have offline snapshots – these are high resolution images of the dashboards/views/stories/etc. They are rendered at retina resolution so is there any way we could use these?

I started digging around and eventually I found that these applications are using a special URL to request these images:

http://<server>/vizql/sheetimage/<view_id>?:pixelratio=2

The view ID is the internal Tableau ID from the repository. You can query it from there, or you can find it by browsing to the view in Tableau Server and then searching the HTML of the page for “current_view_id”. In Chrome, just open the developer tools (F12 or CTRL-SHIFT-I) and then CTRL-F to search:

current_view_id

By adjusting the :pixelratio parameter we can generate higher resolution images:

:pixelratio=1
This is the standard resolution for most devices.

:pixelratio=2
This is the resolution we generate automatically for retina devices.

:pixelratio=4
But we can take this higher…

:pixelratio=10
Much, much higher.

This allows us to create high-resolution images of our dashboards. Playing around with this I’ve discovered a few traps for young players:

  • You need to be logged in to a Tableau Server session to hit this URL. If you aren’t then you get the following error message:

{“result”:{“errors”:[{“code”:46}]}}

  • If your dashboard contains maps, using a :pixelratio greater than 2 won’t help as the maps are generated at :pixelratio=1 and blown up so they become grainy. Using :pixelratio=2 tells Tableau to use the hi-dpi map tiles like we would on a retina device.

The final step was to find a way to automate the production of the images via script. Because this is an internal URL it isn’t part of the traditionally scriptable interfaces such as TABCMD or the REST API. However using cURL, a command-line URL transfer tool, and trusted authentication we can create a simple script to emulate a user logging in and then fetching an image from this URL:

REM – Obtain a trusted authentication ticket and load it into an environment variable.
curl http://<server>/trusted -d "username=alan" > ticket.txt
set /P ticket=<ticket.txt

REM – Use the ticket to log in using any workable URL. The session cookie is persisted to a file.
curl "http://<server>/trusted/%ticket%/workbook.xml" -L -c cookie.txt

REM – Fetch the image file from the new URL using session data from the cookie file.
curl "http://<server>/vizql/sheetimage/1?:pixelratio=2" -b cookie.txt -o image.png

Note that this requires trusted authentication to be configured with unrestricted tickets enabled.

Enjoy – but remember this is unsupported and the URL structure could change at any time. Make hay while the sun shines, friends!

Posted in Uncategorized | 15 Comments

Even Hexier than Before…

[This blog post is a result of working with Sarah Battersby (Spatial Overlord and all-around Crazy Map Lady) from our Seattle office. We worked together for a TC15 presentation “Go Deep: Interpreting Dense Data with Tableau” where she presented the technique for creating uniformly spaced hexbins. Thanks for doing all the hard work, Sarah! 🙂 ]

At TC15 I had the pleasure of presenting a session with Sarah Battersby, a research scientist from the Seattle Tableau office who is a specialist in hex. We talked about dealing with dense data and in her session she covered heat maps which naturally lead to a discussion around hexbins.

Back in March I wrote a couple of posts that showed how to create polygons that wrapped around the center points of hexbins calculated using the HEXBINX() and HEXBINY() functions. Here’s the output:

Distorted

If you look closely, you can see that the shape of the polygons is gradually distorting as we move away from the Equator. By the time we reach Tasmania they are visibly taller than they are wide. This is due to the distortion effects of the Web Mercator projection we use in Tableau.

For some customers, it may be preferable to have polygons that are uniformly sized no matter where they are on the map. I’m not going to go into the pros/cons of doing this and how you might be distorting your data as the hexbins will not actually cover the same amount of area of the Earth’s surface – I’ll leave the details of that for Sarah if/when she ever finds this post. In any case, Sarah presented an elegant solution to this problem:

  • convert your lat/lon data into Web Mercator projection coordinates
  • create the hexbins in this coordinate system
  • calculate the vertex coordinates in WM coordinates
  • convert back to lat/lon and draw on the Tableau map

She outlines the maths for this in her slides:

Go Deep: Interpreting Dense Data with Tableau

You can see the results here – notice the hexagons are now uniformly sized no matter how far south we travel:

Tessellated

You can download the workbook from here to see the solution in action. Enjoy!

Posted in Uncategorized | 15 Comments

Great community post on Tableau and R integration

I just got pointed to this great post on the Tableau forums. Jonathan Drummey has compiled a great technical resource on how Tableau integrates with R via the SCRIPT_*() functions. A recommended read!

http://community.tableau.com/docs/DOC-5313

Posted in Uncategorized | 6 Comments

Designing Efficient Workbooks – the V9.0 edition!

Update 16 June 2016 – this has been superseded by the V10.0 edition, available here. Unless you are specifically looking for V9 advice, you should use the new document. Even if you aren’t the new version is probably a better resource with updated structure and content.


Well, it’s finally done.

Designing Efficient Workbooks V9

Thanks to all the people whose hard work has contributed to this document, and to all the reviewers whose input has helped make this approachable.

This will be available via the Tableau blog soon, but this is for the early adopters. Enjoy!

Posted in Uncategorized | 14 Comments