Hexbin Scatterplot in Tableau

An interesting tweet came across my Twitter-stream the other day, showing a hexbin scatterplot chart type for Power BI:

Having just presented a session at TC17 on working with dense data where Sarah Battersby and I covered (among other things) hexbinning in Tableau, I was intrigued by this viz type and wondered if it could be created in Tableau. I was a little wary as mixing polygons and points together can be complicated, but I hoped it could be done.

Let’s just say that I’m glad I was bald when I started this exercise because it involved quite a bit of hair-pulling. But after a few hours of trial and error and a well-timed break to go sit in the sun and ruminate, I managed to produce this little beauty:

Hexbin Scatterplot.gif

I started with Alberto Cairo’s Datasaurus dataset – a group of datasets that behave similarly to Anscombe’s quartet. Really I was just being lazy as I had it lying around and therefore didn’t need to mock up my own sample scatterplots. The source data looks like this:

dataset

record id

x

y

dino

1

55.3846

97.1795

dino

2

51.5385

96.0256

dino

3

46.1538

94.4872

dino

4

42.8205

91.4103

dino

5

40.7692

88.3333

dino

6

38.7179

84.8718

dino

7

35.641

79.8718

With the data in this format there are two approaches for generating the hexbins – one uses densification to generate the polygon vertex records, and the other generates them through a join to a scaffolding table. I opted to use the scaffolding approach as a) I have a manageable amount of data and b) it makes life easier when you have hexbins that contain just a single point. The scaffold table looks like this:

Point ID
0
1
2
3
4
5
6

And the join of these tables in Tableau looks like this (the join simulates a Cartesian product of the two tables):

The result of this is 7 rows of data for each point on the scatterplot:

I’ll use one of these (PointID=0) to plot the actual point location, and the other 6 to plot the hexagon shape. I’ve blogged on several occasions on how to generate a dynamic hexbin polygon and we’re going to use the same techniques here:

Generate the hexbin center point:
[HexbinX]: HEXBINX([X]/[Hexbin Size], [Y]/[Hexbin Size]) * [Hexbin Size]
[HexbinY]: HEXBINY([X]/[Hexbin Size], [Y]/[Hexbin Size]) * [Hexbin Size]

 

Generate a unique identifier for each hexbin. As you may know, I’m an advocate for efficiency so I use a numeric function for this (based on Cantor’s pairing function) instead of a string function:
[HexbinID]: ([HexbinX]^3 + 3*[HexbinX] + 2*[HexbinX]*[HexbinY] + [HexbinY] + [HexbinY]^2)/2

 

Generate the actual plot points keeping the original location when PointID=0 and using trigonometry to generate the hexagon vertices when PointID=(1..6):
[PointType]: IF [Point ID] = 0 THEN 0 ELSE 1 END
[Angle]: (1.047198 * INDEX())
[PlotX]: IF MIN([PointType]) = 0 THEN MIN([X]) ELSE WINDOW_AVG(MIN([HexbinX])) + [Hexbin Size]*COS([Angle]) END
[PlotY]: IF MIN([PointType]) = 0 THEN MIN([Y]) ELSE WINDOW_AVG(MIN([HexbinY])) + [Hexbin Size]*SIN([Angle]) END

We can now start plotting our viz – first let’s just get the points up:

You can see that the blue marks are the original data points and the orange points are the vertices for the hexagons. Because we want two marks types (a polygon and a point) we need a dual axis chart:

We need to isolate the orange marks on one side and the blue marks on the other. We can’t filter them, so we have to make some clever use of the “hide” function. I duplicated the [PointType] calculation from before so I can use one to colour one axis and the other to colour the other:

We then hide the marks we don’t need on each axis (right-click on the colour swatch in each legend and select “Hide”):

We can now make the hexagon marks on one axis, and circle marks on the other. Tidy up the colours and other formatting:

Finally, we set the axis to be “dual axis”, synchronise and hide the unwanted top axis, and voila:

The last couple of steps I put in were to a) colour the hexbins by the number of points they contain, b) tidy up the tooltips for each mark type, and c) set up a hover action to highlight the elements in a hexbin:

This ended up being quite a challenging viz and required quite a few techniques to get it done. But being able to do it at all reinforces for me that an expressive presentation model that allows you to natively create complex chart types (i.e. the Tableau approach) is faster and more reliable than a model where you are reliant on a developer to write a custom chart widget (i.e. the Power BI model). Even accounting for the trial and error needed to nut out the final successful method, Tableau allowed me to achieve the result much faster than a solution based on coding.

And of course, now that I know how, I can reproduce this solution in minutes.

You can download the workbook from here. Enjoy.

 

PS. I couldn’t help myself. The workbook now includes solution examples using both the scaffolding and the densification approaches.

Hexbin Scatterplot.png

It was a mental itch that needed scratching.

Posted in Uncategorized | 2 Comments

Loupe Tooltips

Well, it’s been a hectic week in Las Vegas for our customer conference but I have a brief window to put up this post. If you’ve been here, I hope you had a great time and learned heaps!

One of the sessions I co-delivered with the ever amazing Sarah Battersby (@MapsOverlord) was “Masters of Hex: Interpreting Dense Data with Tableau”. We’ve presented this session for the last 3 years but as always we update our material for any new features and techniques. This year we have access to the 10.5 beta, and I came up with a great idea to use it for a way to dynamically zoom in on dense data.

I’ve called this idea a “loupe” tooltip – after the magnifying eyepiece used by photographers and watchmakers:

Here’s my starting data – a scatter plot with 100,000 data points packed densely:

As you can see, it’s impossible to make out what is going on in the bottom middle of the chart – there is too much overplotting of the marks even when we make the marks as small as possible and ramp up the transparency. But what if we could dynamically zoom in on a small section – like we were looking through a loupe?

To achieve this, I’ve created binning calculations (I’ve had to use calcs to do this as you can’t use a native bin object in a calculation which I need to do later) to allow me to select a small group:

BinX:    FLOAT(INT([X]*10)/10)
BinY:    FLOAT(INT([Y]*10)/10)

I’ve also created another sheet which we will use in the tooltip – essentially it’s just a duplicate of the primary scatter plot, but I’ve cleaned up the axes and the formatting to make it clean and minimal:

I also made a loupe title sheet to just show the count of the marks in the scope of the loupe:

Now we add them to the tooltip of the primary scatter plot, and set the filter fields to be BinX and BinY:

And voila!

However we have a problem when we loupe over a sparse bin – the loupe axes are zooming in to just show the point:

It would be preferable to fix the loupe to always show the extent of the bin, so we can use the neat trick of placing reference lines to pad out the axes to a larger size than the data demands. We create a couple of boundary calculations for X and Y:

XLowerBound:    MIN([BinX])
XUpperBound:    [XLowerBound] + 0.1
YLowerBound:    MIN([BinY])
YUpperBound:    [YLowerBound] + 0.1

We then put these calcs on the detail shelf of the loupe and we can create a reference band:

Now when we loupe over a sparse bin we have a much nicer view:

The workbook can be downloaded here but remember, you’ll need Tableau 10.5 to view this so make sure you enroll for the beta program!

Enjoy!

Posted in Uncategorized | 2 Comments

Letting go of custom SQL

I thought it might be helpful to share with you a recent experience where I was helping a customer who was complaining of slow performing workbooks. When I arrived their Tableau analyst showed me a couple of workbooks that were performing slowly. What do we mean by slowly? Well, for one of the workbooks, just opening it took ~5 mins!

Image result for cookie monster gif

OK – time to start looking for problems, but where to begin? Fortunately, I’m presenting a session at our customer conference next week (designing efficient workbooks: live on stage) and in it I propose a framework for identifying performance issues.

Basically the top layers are where you are most likely to find performance issues, so you should start looking at the design of the dashboards and calculations (which are easy to fix) before you start looking at the data model or the underlying platform (where often you have less ability to change things).

With this framework in mind, I started reviewing the workbook.

Presentation

I’m can’t include a screenshot of the report, but let me describe it for you:

It was a simple design – a single sheet (not even a dashboard!) with a straightforward layout. But… it was a great big crosstab!

Image result for frustrated reaction gifs
(Me when I see great big crosstabs…)

That could be part of the problem as big crosstabs are not the most efficient viz type to render. However, I noticed that the main delay when opening the workbook was with “executing query” and not “computing layout” so this probably wasn’t the main issue we needed to fix.

Image result for these aren't the droids gifs

Analytics

Note: I’ve created a mock-up of the data, anonymised to avoid any embarrassment for the client. All the screenshots you will see from here on in reflect the original situation but are totally synthetic.

The analytics layer is concerned with the complexity of calculations and how you combine and manipulate the data (e.g. blending). Surprisingly this was extremely straightforward for the report:

sdf

Image result for nope gifs

Data

The next layer down is concerned with the data sources and data connections – how you are retrieving the data from the original location. When I opened up the very simple data source I saw this:

1

OK – that explain why there were no calculations in the data source and why it was so simple. All the logic is wrapped up in a custom SQL query. It’s also a bit of a red flag… a live connection using custom SQL isn’t a recommended design pattern. Let’s take a look inside and see what we’re dealing with.

2

Image result for surprised reaction gifs

Oh my! I think we have found the problem.

The customer’s database development team had provided the Tableau analyst with several SQL statements like this from their existing reporting solution. It’s not optimal SQL and I’m sure it could be improved if we tried, but the analyst (like many Tableau users) didn’t have the time or the SQL knowledge to do this so simply cut&paste the query into a custom SQL connection. The result? The whole query needed to run before we could then sub-select off the interim result set and the experience was very slow.

What to do? Well, we could follow basic best practice and make this an extract instead of a live connection so we only run the slow query once – after that everything points to the extract. But the customer wanted to maintain up-to-date views of their data so a live connection was preferable. Even more so once we determined that the extract would require frequent intra-day refreshes and was potentially very large.

So, working with the analyst we managed to reverse engineer the relationship between the tables as this:

3

We then captured the logic for the CATEGORY and the MEASURE fields into a couple of Tableau calculated fields:

4

5

And the result? We reduced the time taken to render the dashboard from ~5 mins
to < 10 secs! A 30x improvement which the customer was absolutely delighted with. They are now going to take the learnings from this exercise and apply them to their other workbooks.

Image result for clapping gifs

The moral of the story? It reinforces that custom SQL can be a serious impediment to performance and should be used with caution. Leveraging the native capabilities of Tableau allows us to generate much more efficient queries which is good for response times and also good for the query load on the DBMS. If you must use custom SQL, try to wrap it in a view in the DBMS if you can or pull it into an extract as a last resort.

Posted in Uncategorized | 7 Comments

I am sooo not worthy…

[Sarah Battersby has on previous occasions shown how outrageously smart she is, but she has blown my mind today. I am sooo not worthy…]

I’ve been asked on occasion if we can show point data from one data source (containing explicit lat/lon fields) along with custom shapes drawn from a shapefile. My response in the past as been that you can if you convert the explicit lat/lon fields to a point shape so you can just change the GEOMETRY field on a dual axis map. The reason for this thinking was that I believed that you need to use the Latitude (generated) and Longitude (generated) fields for the GEOMETRY field and you couldn’t dual axis this with other lat/lon fields.

Well, it turns out I was wrong. I found this out when reading Sarah’s blog post about dual axis mapping (and specifically the section on combining a shapefile with a CSV):

https://community.tableau.com/people/sarah.battersby.0/blog/2017/06/24/dual-axis-mapping-many-ways

There were two particular tricks she used in that section that caught my eye. First was the “faux union” done by a full outer join on the criteria of 1=0. What a neat trick – I’d never thought of doing that before! But the real winner was how she used the ZN(Latitude) and ZN(Longitude) to allow us to plot the GEOMETRY fields when the lat/lon values were NULL. I had no idea you could do that. Forest for the trees, or something like that…

Anyhow – the short of it is that I now know that it’s relatively easy to create a viz like this:

Dots&Polys.png

Thanks, Sarah! Genius!

Posted in Uncategorized | Leave a comment

Tableau Data Day Out – Recorded Sessions

As a follow-up to my last post, we now have all the presentation recordings from our Data Day Out event in Melbourne available for viewing.

TITLE LENGTH
Ideas that Should Die
32 MIN
Scribbles and Lines
20 MIN
Playing it Safe: Freedom and Governance in a Self-Service Analytics Environment
26 MIN
More Than Dots on a Map
26 MIN
So You Want to be a Data Scientist?
30 MIN
Traditional BI versus Modern BI – Does it Matter?
29 MIN
What’s the Story?
28 MIN

Congratulations and thanks to all the presales team members who helped make this event such a success!

Posted in Uncategorized | Leave a comment

More than Dots on a Map

At the recent Tableau Data Day Out in Melbourne, I presented a session entitled “More than Dots on a Map”. The idea was to show how Tableau could allow users to do spatial analysis of their data – irrespective of whether it had implicit or explicit location fields. By this I mean to go beyond just plotting shapes and empower the user to interact in powerful ways, asking lots of different questions that require the data to be shaped and visualised differently.

Below is a recording I made of the session – enjoy!

Posted in Uncategorized | 2 Comments

More on spatial file rendering

Following from my post yesterday I had a colleague ask me: “So what happens when you force client-side rendering at the detailed level? Does it blow up with too much data?”

Interesting question. Let’s find out.

(For those of you who don’t want to read the rest of the post… the answer is “no – we’re too clever for that!”)

As outlined in the documentation here, you can adjust the complexity threshold where we switch between client-side and server-side rendering, or you can force either using “:render=true” or “:render=false” on the URL. So here is my workbook from yesterday with the “:render=true” flag set at the raw level – notice that the map is being rendered in the browser:

Clearly it didn’t blow up. This is because Tableau Server does dynamic simplification of the polygons based on the zoom level of the viewer. We reduce the complexity of the data we deliver to the browser, removing data points that can’t be seen. Clever, huh? This means that for the above map we only ship 7.2MB of data to the browser (see the bootstrap package in the debug window) so there is no danger of the runtime exploding due to too many points.

Performance was acceptable for initial view but I’m local on GB Ethernet to my server. As previously documented, it would be slower on a low bandwidth connection, but once the map is rendered responsiveness is great! You can test it yourself here:

https://tab.databender.net/views/ShapefileRenderingTest/GCCMap?:embed=y&:render=true&:jsdebug=true

The simplification of the polygons is done on the server and this becomes apparent when you zoom. From the above map, I zoomed right in to Port Melbourne – a much closer view. Initially the map renders with the outline of the polygons fuzzy – it’s just zoomed the existing client-side rendered image:

However, the server is busy in the background computing the new model. Watching the trace, the pane-zoom event sits in “pending” state for a few seconds, then it starts to stream down the more granular polygon definitions and the boundaries sharpen up:

Additionally, to manage the data we ship to the browser, we also clip the data to the viewport so we only transfer the vertices we need to render. You can see the pane-pan-server event in pending state after I pan:

The viewport is updated once the data has arrived:

So – you can safely force client-side rendering even with very complex shapefiles however there are performance trade-offs as the server must compute simplified polygons for each visual model. The upside is that you can zoom in a long way and still have detailed boundaries.

Kudos to our product and development teams for developing this feature in such a robust way!

Posted in Uncategorized | Leave a comment