Points and Polygons in Tableau 10.2

I have previously written about how we continue to add new features to Tableau and this means that we can develop better solutions to previously difficult problems. I’d like to present another example of this – where we want to plot points and polygons together on a single Tableau map. In an earlier blog post, I had shown how we could present points and polygons together in Tableau. However in Tableau 10.1 and earlier it was a complex process that required significant data preparation.

In Tableau 10.2 we are introducing a spatial file connector that will make it much easier to plot polygons and points together. You can now simply present your  polygon data to Tableau in an ESRI shapefile, KML or MapInfo file format:

2

and Tableau can directly plot the polygons:

1

Note that this is a much less complex viz structure than in earlier versions of Tableau. We are not using a vertex list anymore – we simply double-click on the new Geometry object (the globe field in the Measures section) and Tableau takes care of the rest. Note that in the above viz we have the (generated) latitude and longitude fields on the row/column shelves and the COLLECT(Geometry) measure on the details shelf.

When we want to overlay points on this polygon map, we simply need to have our point data also presented in a spatial file format. For this example, I had raw lat/lon data in a CSV and using Alteryx I converted it to a spatial object. I also used some of Alteryx’s spatial matching features to tag each location with the nameof the SA3 region in which it is contained (it will become clear later why this was done):

9

Once again we can simply connect to the resulting shapefile:

3

and plot the points on the map – note the layout of the viz is the same as for the polygon map above with the (generated) lat/lon fields and the COLLECT(Geometry) field on details:

4

To combine the two spatial data sets we can use the new cross-data-source join feature introduced in Tableau 10 to join the two shapefiles:

5

Starting with the previous polygon map, we can CTRL-drag the longitude field on the column shelf to duplicate the map and swap the COLLECT(Geometry) field on the detail shelf of the second map to show the Geometry field from the point location data source. Set the mark type to a red circle and you will see the following:

6

If we make the map a dual axis map, we now have points and polygons shown together. Yay! This is a much simpler (and more useful) solution compared to the previous approach.

7

Finally, because both data sets have the SA3 name we can use this to highlight both a polygon and the set of points therein, allowing for interactions that previously were not easy to do:

8

So thanks, Tableau development team, for adding these new capabilities and making old, complex techniques obsolete.

You can download the workbook and data for this example here.

Posted in Uncategorized | 25 Comments

Filled maps and low bandwidth connections

I regularly work from my home office which is where my demo server is also located. Having a gigabit Ethernet connection to the server means network bandwidth/latency isn’t something I pay any attention to most of the time. However, the other day I was doing a demo at a customer’s office (my laptop connected to the internet via my phone’s hotspot) and noticed that some of my workbooks were considerably slower to open than I remembered them to be. Not all, but certainly some – and with further investigation I narrowed it down to a couple of dashboards:

And

Notice anything in common? Yep – the both have filled maps. I wondered if that had anything to do with the slow performance?

Once I got home, I opened up the offending dashboards again and the response time was nice and quick. Not as quick as other workbooks (which can be almost instant) but still fast enough not to concern me. Thinking this was something to do with the network I used Chrome’s developer tools (press CTRL-SHIFT-I) to look at the network traffic for the first dashboard above. There was the answer:

Because we are using client-side rendering, the bootstrap package has to pass all the data to the browser so it can draw it locally. This includes the marks which for filled maps are complex polygons requiring much more data to describe than a simple line or bar chart. Consequently, the bootstrap payload for a dashboard with filled maps can be ~1000x the size of an equivalent bar chart or even symbol map!

To really test this, I created a simple workbook showing all 2647 postcodes in Australia. One view displayed the data as a simple tabular list, another used a symbol map and the third used a filled map. I published this workbook to my demo server and use the Chrome developer tools to record the bootstrap package size and the total data transfer to the browser. I also used the developer tools to throttle the network to simulate a 4G connection (a really neat tool!) and recorded the time to last byte. The results were:

  • Tabular list : bootstrap package = 2.6K; total transfer = 38KB; total time = ~1.5s

  • Symbol map : bootstrap package = 61.9K; total transfer = 182KB; total time = ~3.5s (including fetching map tiles)



  • Filled map : bootstrap package = a whopping 2.2M; total transfer = 2.3M; total time = ~10s (including fetching map tiles)

Clearly, the use of filled maps requires a much larger data transfer to the client and therefore should be used sparingly in deployments where you have limited bandwidth. Or you could switch to server-side rendering – here are the results for the same workbook using the ?:render=no URL parameter – much faster!

  • Tabular list : bootstrap package = 2.6K; total transfer = 38K; total time = ~1.5s
  • Symbol map : bootstrap package = 3K; total transfer = 111K; total time = ~2s
  • Filled map : bootstrap package = 3K; total transfer = 139K; total time = ~2s

Enjoy!

Posted in Uncategorized | 3 Comments

Updated ABS boundaries for 2016

Edit July 2017 – It seems a lot of people are still coming to this page from search engines. If this is you, hi! If you are using Tableau 10.2 or later let me be clear… you no longer need these files nor do you need to use the techniques described to show custom regions. Tableau can now read the ESRI boundary files from the ABS website natively. This gives you a much better experience and makes your workbooks much simpler. Check out the Tableau blog post on this new feature here.

One of the most popular posts on my blog is Australian region boundaries in Tableau. Many people link through it to find the ASGS boundary data files I have converted for use in Tableau. The Australian Bureau of Statistics recently released an update to the ASGS main structure with changes to:

  • Mesh blocks
  • Statistical Area Levels 1 – 4 (SA1 – SA4)
  • Greater Capital City Statistical Areas (GCCSA)

You can read the details about the update here.

I’ve updated the boundary files for these structures and put them in the Dropbox folder, here.

I’ve also generated the boundaries at different resolutions – 10m, 100m and 200m. The finer the resolution, the more data points (e.g. the SA1 file has 781K points @ 200m, 1M points @ 100m, and 2.6M points @ 10m). Use the version that best suits your needs – if you are showing the whole country, use the 200m resolution file; if you are zoomed in on a small area use the 10m resolution file.

080816_0229_UpdatedABSb1.png080816_0229_UpdatedABSb2.png080816_0229_UpdatedABSb3.png

Enjoy!

Posted in Uncategorized | 34 Comments

Description fields that work across record types

[Today’s post is just a little trick – not that hard to work out. But I liked it, and it’s my blog, so I’m writing about it. Also, after all the efficient workbook and densification stuff I could use the break.]

I’ve recently been working on some dashboards that run against the Tableau Server audit tables so I can see what my troublesome colleagues are up to on my demo server. One of the views I created was a timeline showing what is being done:

Event Timeline

I wanted the tooltip to show the name of the object that was being accessed/created/deleted/modified, but the name is in a different field depending on the object type – e.g. workbook, view, data source, data connection, etc. And only one or two of those fields would be valid for any event record. Additionally, some events require a concatenation of the names – e.g. if I’m accessing a view I want to also know the name of the workbook that it comes from:

Simply concatenating the fields together doesn’t work as some of the values will always be NULL, and any computation against a NULL results in a NULL. So we dutifully pull out the IFNULL() function and get to work:

ifnull([DC-Name], "") + ";" +
ifnull([DS-Name], "") + ";" +
ifnull([WB-Name], "") + ";" +
ifnull([VW-Name], "")

The problem with this is, if we look at a data connection the label shows:

Exasol 1GB TPC H;;;

Or a view shows:

;;Server Audit Data; Event Timeline;

So – combining our previous finding (computation with NULL returns NULL) we can modify our calculation to be:

ifnull([DC-Name] + "; ", "") +
ifnull([DS-Name] + "; ", "") +
ifnull([WB-Name] + "; ", "") +
ifnull([VW-Name], "")

And voila! We have what we want.

Enjoy!

Posted in Uncategorized | Leave a comment

Manufacturing a polygon from a single lat/lon coordinate

A little while ago I wrote a blog post on how to use pivot and densification to create great arc lines on a map without having to read the source data multiple times. In that example, each record had two fields I could pivot to create the two records – the from/to airport codes. But what if I don’t have a field that I can pivot? What if my data looks like this?

2016-06-17_13-00-18

Can I create this? (I think you can probably guess the answer…)

2016-06-17_13-03-28

The universal rule is that in order to draw a densified line/polygon we need two records – one marking the start point and the other marking the end point. We then use densification techniques to fabricate the intervening points and calculate their location.

You might think that this data set doesn’t have fields to pivot, but of course it does – lat and lon! In order to convert my single record into two records I pivot on these fields:

2016-06-17_13-09-06

We then use the pivot field name to generate out Point ID:

Point ID:
if [Lat&Lon] = "Latitude" then 1 else [NumPoints] end

And we unpivot our lat/lon values:

Unpivot Lat:
if [Lat&Lon] = "Latitude" then [Lat&Lon Value] end

Unpivot Lon:
if [Lat&Lon] = "Longitude" then [Lat&Lon Value] end

2016-06-17_13-14-11

We then replicate the lat/lon values to fill in the NULLs via LOD calculations:

Lat:
{fixed [Record ID] : max([Unpivot Lon])}

Lon:
{fixed [Record ID] : max([Unpivot Lon])}

And that’s the hard work done! We now have two records with a Point ID that ranges from 1 to a parameter-controlled value. We now create a bin on this Point ID field of width 1 that will drive our densification when used with table calculations like this:

Angle:
(6.28318 / [NumPoints]) * (index()-1)

Circle Lat:
window_avg(min([Lat])) + [SectorLength]*cos([Angle (circle)])

Circle Lon: 
window_avg(min([Lon])) + [SectorLength]*sin([Angle (circle)])

Now we can assemble our viz:

2016-06-17_13-28-47

What started me down this path was revisiting my old mobile towers viz to get it off the previous custom SQL UNION approach. It now runs faster as it only does a single pass over the data. The logic is the same as the example above but the calculations are (obviously) different as we are plotting a sector, not a full circle:

2016-06-17_13-34-25And finally (because I sometimes go a little nuts) I used the technique discussed in this post to eliminate the projection distortion from the circles in my original example, and allow the user to set the radius in real-world measurements.

All the working is available in the workbook, which you can download here.

Enjoy!

Posted in Uncategorized | 9 Comments

Best Practices for Designing Efficient Tableau Workbooks, the V10.0 edition

With the impending release of Tableau 10, I have been refreshing my whitepaper on how to author efficient Tableau workbooks.

2016-06-16_0-13-09

Once again, I would like to acknowledge that the document is a distillation of materials written by many authors. Much of what I have done is to bring their work together into a single document and apply some structure. Some people reading it will recognise their fingerprints across sections (in fact some will recognise entire swathes of text). To all of you I give thanks because without your continued excellent work and lack of copyright infringement claims, the document would not exist.

I would also like to thank the many people who have reviewed this document for accuracy and readability. Your attention to detail, and patient explanations have made this into a much more legible document than I could ever have produced on my own.

Herewith, I present…

Best Practices for Designing Efficient Workbooks, the V10.0 edition.

Enjoy!

Posted in Uncategorized | 6 Comments

You Can Never Be Too Hexy…

Some time ago I wrote a post about how to create hexagonal heat maps. Central to the technique is the need for a “path” field to plot the 6 vertices around the polygon marks. As this is a synthetic value it won’t exist in the source data, so we need to create it. The simplest way would be to just read the data multiple times, once for each PointID value:

SELECT …fields…, 1 as [PointID]
FROM Table
UNION
SELECT …fields…, 2 as [PointID]
FROM Table
UNION
SELECT …fields…, 3 as [PointID]
FROM Table
UNION
SELECT …fields…, 4 as [PointID]
FROM Table
UNION
SELECT …fields…, 5 as [PointID]
FROM Table
UNION
SELECT …fields…, 6 as [PointID]
FROM Table

Clearly this isn’t a great approach as it requires us to read the data 6 times, and for anything other than trivial volumes this becomes an issue. Another way is to simply generate the PointID value endpoints (1 and 6) and use a feature in Tableau called data densification to populate the interpolated values:

SELECT …fields…, 1 as [PointID]
FROM Table
UNION
SELECT …fields…, 6 as [PointID]
FROM Table

This is better as now we only need to read the data twice, but we’re still duplicating data. What if we could do it with just a single pass over the data? That would be even better…

One of my colleagues, Mary Solbrig, commented on the original blog post that if you have a field in your data that has at least 2 values within each hexbin, then you can synthesise the PointID field without multiple passes. At the time I nodded and thought to myself “yeah, that makes sense” but it wasn’t until someone asked me about this in more detail that I actually looked to see how it would be done.

With this approach, the PointID field is a calculation (not an output of custom SQL) and the secret is to use a LOD expression:

IF [ID] = {FIXED [Hex Lat],[Hex Lon] : MIN([ID])} THEN 1 ELSE 6 END

What this does is for each hexbin (uniquely identified by the calculated hexbin centroid) it produces two values – a 1 and a 6. It does this without needing to duplicate the data so it’s much faster and more efficient – both of which are highly desirable characteristics.

We can then build the densification scaffolding by creating a 1 unit wide bin over the PointID calculation, and then it’s the same approach as before to plot the viz:

2016-05-10_10-49-58.png

I’m surprised that it’s taken me this long to internalise this approach and to recognise the brilliance. So a belated thank you, Mary – it might be late but it is heartfelt.

You can download an example workbook for this here.

Posted in Uncategorized | 11 Comments

Choosing the right calculation type… encore!

A couple of days ago, Keith Helfrich (@KeithHelfrich posted a question to twitter about updating an old calculation selection diagram Bethany Lyon and I presented in a hands-on training session at Tableau Conference a couple of years ago. I responded with this diagram from my earlier post that outlines my current view.

Jonathan Drummey (@jonathandrummey) then jumped in and posted this absolute killer link to a discussion on the Tableau forums. He covers the options in way more detail, looking at the challenge from multiple angles such as performance, data volumes and complexity of maintenance.

This is a must-read article. Thanks for sharing, Jonathan!

Posted in Uncategorized | Leave a comment

45 degree reference line with dynamic axes

I read a great blog post today from Andy Kriebel where he shows how to create a 45-degree reference line on a scatter plot:

http://vizwiz.blogspot.com.au/2016/03/tableau-tip-tuesday-how-to-create-45.html

His solution is very elegant, using a trend line against a secondary axis to create the reference line, however one limitation he points out in his video is that if the X and Y axes have different scales then it doesn’t work.

You either need to live with the visual confusion or fix the axes, however the latter option isn’t a good solution where you have a filter that will cause the range of the data to change significantly. A better solution is to ensure that both axes always show the same range and one way to do this is via reference lines.

First, we need to determine the maximum value on either of the two axes – we can do this with the MAX() function as follows:

MAX(
WINDOW_MAX(SUM([Selected Year Sales])),
WINDOW_MAX(SUM([Prev Year Sales]))
)

We can then use this to plot a reference line on both axes – as you can see this forces them to have the same range:

By formatting the reference lines to have no line and no label they are effectively invisible, but they still control the maximum range of the axes:

You can download the workbook with an example of this technique here:

https://www.dropbox.com/s/97wxik2e3d0yynt/45%20ref%20line.twbx?dl=0

Enjoy!

Posted in Uncategorized | 3 Comments

Choosing the right calculation type…

[Last year at Tableau Conference in Vegas, I co-presented a session on this topic with the fantastic Bethany Lyons from our UK offices. I delivered it again last week at Tableau Conference on Tour in Sydney to a very positive response so to reach a broader audience I’m sharing the key points from that presentation in this post. A lot of the initial work, including developing the examples, was done by Bethany so much of the credit must go to her. Thanks, Bethany – you rock!]

Today, Tableau has multiple forms of calculation:

Basic Calculations
These calculations are written as part of the query created by Tableau and therefore are done in the underlying data source. They can be performed either at the granularity of the data source (a row-level calculation) or at the level of detail of the visualisation (an aggregate calculation).

Level of Detail expressions
Like basic calculations, level of detail expressions are also written as part of the query created by Tableau and therefore are done in the data source. The difference is that LOD expressions can operate at a granularity other than that of the data source or the visualisation. They can be performed at a more granular level (via INCLUDE), a less granular level (via EXCLUDE) or an entirely independent level (via FIXED). For more information on LOD expressions, see my series of posts starting here:
https://blog.databender.net/2015/01/22/understanding-level-of-detail-expressions-part-1/

Table calculations
Table calculations are performed after the query returns and therefore can only operate over values that are in the query result set.

One of the challenges faced by people new to Tableau is understanding which type of calculation to employ for a given problem. This post aims to provide some clarity on how to make that determination. We do so by comparing the different calculations types against one another.

#1 – Basic calculation vs. table calculation

When trying to choose between basic calculations and table calculations, the important question is “do I already have all the data values I need on the visualisation?” If the answer is yes, then you can calculate the answer without further interaction with the data source. This will often be faster as there is less data that needs to be processed (i.e. we are just computing using the aggregated values from the result set). If you do not, then you have no choice but to go to the underlying data source to calculate the answer.

Consider the following example where we ask “what is the 90th percentile of our order details, shown by country”:

Both sides of this dashboard answer the question – if you were just interested in the 90th percentile value and didn’t need to determine further insights then the chart on the left would be optimal. It provides a minimal result set (one number per country) via a basic aggregation PCT90([Sales]) which is calculated in the underlying data source.

However, if you wanted to gain further insight (e.g. understand the greater distribution and identify outliers) or add other aggregations (e.g. you also wanted the median values) then the chart on the right allows you to do that without further queries. The initial query to return all the order detail records (the green dots) provides all the data necessary to locally compute the 90th percentile as well as explore other insights.

One of the key takeaways from this post is that the layout of the viz matters. As we’ve already discussed above, the viz design will impact how much data you initially return from the data source and this is an important factor in determining your approach. However, there are situations where, although you have all the data you need in your result set, it is not possible to achieve the required layout using a table calculation. So you also need to ask “does the layout of the viz permit me to use a table calc?”

Consider the following example where we ask for the YoY difference in sales in two formats – one as a chart and the other as a table:

The top half of this dashboard is easily achieved using a table calculation – simply duplicate the [Sales] field and apply a Difference quick table calculation, run across the [Order Date] dimension. However, if you try to then convert that computation structure into a table you end up with the following:

You will realise that it’s not possible to achieve the specified layout with a table calculation as you need the Year dimension with the Measure Names dimension nested inside. Tableau cannot suppress the “Difference in Sales” row for 2013 so in this example, your only option is to use basic calculations:

[2013 Sales]
IF YEAR([Order Date]) = 2013 THEN [Sales] END

[2014 Sales]
IF YEAR([Order Date]) = 2014 THEN [Sales] END

[Difference]
SUM([2014 Sales]) – SUM([2013 Sales])

This approach allows you to just have the Measure Names dimension which you can sort to meet the layout requirements.

#2 – Basic calculation vs. level of detail expression

If we do not have all the data we need on the visualisation, we need our calculation to be passed through to the data source. This means we must use a basic calculation or a LOD expression. But how to choose? The important question to ask here is “does the granularity of the question match either the granularity of the viz or the granularity of the data source?”

Basic calculations can be performed either as row-level calculations or as aggregate calculations so they can only answer questions at the granularity of the data source or at the level of detail of the visualisation. Level of detail expressions on the other hand, can answer questions at any granularity.

Consider the following example, where we ask “what is the 90th percentile of sales at the order detail level compared to the order total level?”:

If you are familiar with Tableau’s Superstore dataset you will know that it as one row of data per line item of each order. So if we consider the question above, we determine:

  • Granularity of data source:    Order Detail
  • Granularity of viz:        Country
  • Granularity of left chart:    Order Detail
  • Granularity of right chart:    Order

So for the left chart we can solve this with a basic calculation – PCT90([Sales]) – however for the right chart we must first total the Order Details to the Order level and then perform the percentile aggregation. So we must use a level of detail expression:

[Total Sales including Order]
{INCLUDE [Order ID] : SUM([Sales])}

We can then use the same aggregation as above – PCT90([Total Sales including Order]) – to get the answer. The following diagram explains how the LOD expression works:

Note that we use the INCLUDE expression so that Orders that are split across Countries are allocated correctly and not double-counted. Some readers might prefer to solve this problem with a FIXED expression, in which case we would need to write:

[Total Sales including Order]
{INCLUDE [Country], [Order ID] : SUM([Sales])}

This would be correct for the required chart but would limit our flexibility to change the grouping to some other dimension – e.g. by Region or by Shipping Type.

#3 – Table calculation vs. level of detail expression

This is the decision that many people find confusing, however the process to choose between a table calculation and a LOD expression is the same as for a table calculation vs. a basic calculation. You need to ask:

Consider the following example where we ask “what is the 90th percentile of sales at the order total level, shown by country?”:

You will notice that this is almost identical to the question asked in #1 above. The only difference is that the percentile calculation is done on the order total, not the order detail. You may in fact realise the chart on the left side is in fact the same chart as we saw on the right side in #2. We already know the granularity of this problem is different to the data source and the viz, so we should use a LOD expression.

The chart on the right side is the same as the right side from #1 however the dots represent Orders, not Order Details. This is simply done by changing the granularity of the viz (swap Row ID with Order ID on the Detail shelf). Because table calculations keep the calculation logic separate from the computation scope and direction we don’t even need to change the calculation – just compute using Order ID.

It can get tricky to be sure about the answer to our decision process questions, and sometimes you can solve a problem one way until you later introduce a complication. Consider the following example where we ask “for each age group, what percent of illnesses does each disease account for?”:

This is clearly a percent of total problem and we can very quickly solve this problem with a quick table calculation over the Disease. However, when we then add the complication of allowing the user to filter to a specific disease we find the following:

This is because our result set no longer contains all the data we need – the filter has removed the Patient Count data for the other diseases. You could solve this by making the filter a table calculation:

[Filter Disease]
LOOKUP(MIN([Disease]), 0) compute using Disease

Or you can use LOD expresisons, knowing that FIXED calculations are done before dimension filters. First, work out the total number of people in an age group:

[Total Patients per Disease]
{FIXED [Age]:SUM([Patient Count])}

Then you can compute the % total:

[Pct Total]
SUM([Patient Count])/SUM([Total Patients per Disease])

#4 – When only table calculations will do

Finally, we need to add one final concept to our decision process. There are several categories of problems that can only be solved using table calculations:

  • Ranking
  • Recursion (e.g. cumulative totals)
  • Moving calculations (e.g. rolling averages)
  • Inter-row calculations (e.g. period vs. period calculations)

So the question to ask here is “does my problem require the use of ranking, recursion, moving calculations or inter-row calculations?”

This is because table calculations can output multiple values for each partition of data, while LOD and basic expressions output a single value for each partition/grouping of data. Consider the following exercise where we ask “in the last year, how many times has a stock broken its record close value to date?”:

We need a recursive calculation here – I need to consider all the previous values before me in order to tell if I am a new maximum. We can do this with a RUNNING_MAX function. So we first calculate the highest value to date:

[Record to Date]
RUNNING_MAX(AVG([Close])) compute using Day

Then at the day level we need to flag those days where the record was broken:

[Count Days Record Broken]
IF AVG([Close]) = [Record to Date] THEN 1 ELSE 0 END

Finally, we need to count these days:

[Total Times Record Broken]
RUNNING_SUM([Count Days Record Broken]) compute using Day

Conclusion

The key takeaways from all this are:

  • There is no silver bullet. The answer is always “it depends…” but the decision process will get you started selecting the right approach.
  • The layout of the viz matters – as it changes you might need to change your calculation type.
  • There are scenarios where different solutions perform differently – depending on the volume and complexity of your data; on the complexity of the question and the required layout.
  • There are always trade-offs to consider (performance vs. flexibility vs. simplicity) – a good rule of thumb is that you can pick any two.

For more information and examples, you can download the above workbook from the following link: https://www.dropbox.com/s/hzuza3j347wdkxg/FINISH%20-%20Optimizing%20Calculation%20Methods.twbx?dl=0

You can also download a PPT with the above flowcharts and LOD diagrams from here: https://www.dropbox.com/s/t0ig6116f8j1ncq/For%20Blog%20Post.pptx?dl=0

Enjoy!

Posted in Uncategorized | 2 Comments