MakeLine() + Transparency + Mark Hiding = Oh My!

Since I arrived at Snowflake, I’ve been using a dataset from the NYC Citibike program as my demo data. It’s about 61M bike trip records with lat/lon data for the starting and ending bike stations, and I’ve been showing it using the following dashboard where the starting and ending locations are in different maps…

What I really wanted to do though was to make a map that allowed me to select a starting station and show where passengers went. Something like this…

Well, with the latest 2019.2 beta, I did (as you can see). It’s running over my Snowflake demo database – naturally – and uses the following calculation to create the line:

TripLine = 

MAKELINE
  (
  MAKEPOINT([Start Station Latitude], [Start Station Longitude]),
  MAKEPOINT([End Station Latitude], [End Station Longitude])
  )

Oh, how long I have waited for thee, sweet MakeLine() and MakePoint()…

The reason this new feature helps so much is that the starting and ending location data is in the same record. In earlier versions of Tableau I would have had to pivot the data to split the record into two so I could use a line marker and the path shelf. This would have performance implications as I would be effectively doubling the amount of data we need to read and process.

So MakeLine() gives me the line between the start and end points, which I display in a sheet:

I also want the point map for the starting locations (so I can select them) – here’s that sheet:

As you can see, both of these maps are set up with no background. Thanks to this blog post (https://www.tableau.com/about/blog/2018/11/express-your-data-artistry-transparent-worksheets-tableau-98807) that showed me the trick for layering maps and getting the transparency to work. The secret is to set the sheet shading to be “none”:

Once I did this, I was able to layer them together using floating placement:

To get them to line up, I used the search/zoom feature of the map to set the viewport on each map to the same region:

Now to make the two maps work together… I set up a filter action from the point map to the line map to only show the lines for the selected starting station:

I also need a filter to control the Num Trips panel in the top right – this needs to be a separate filter as the logic is reversed to the lines. I.e. when I clear the selection, I want all the trips to be shown:

This gives me the interactive behaviour between the three elements (points, lines, trip count) but the problem is that the points layer is obfuscating the line layer (because it’s on top):

We need to hide the unselected points, and our good old “hide the mark using an invisible shape” trick comes into play. We define a set that will hold the selected marks, and we use the set to control the mark shape – using a filled circle for IN the set and an invisible 10x10px PNGcustom shape for OUT of the set:

We then use an Edit Set action to update the set values to the selection:

The result – the selected points are shown as circles while the unselected points disappear, allowing us to see the lines unobstructed. We get all the points back by clearing the selection:

The final point is that we need the base map for spatial context. I initially had this in the lines map but the problem was that it disappeared when there were no lines to show – i.e. the case when no points are selected.

To overcome this, I simply added a 3rd map sheet with a visible base map but with a single mark that was invisible. This meant that the map would be visible in all states of the upper layers. Again, transparency allows this to show through:

The final result is a map with both points and lines that dynamically updates based on the selection. The key elements were:

  • MakeLine() to create the line mark between the starting and ending stations, even though there is no spatial data elements in the underlying database – just raw lat/lon FLOAT fields. Also, we don’t need to pivot the data to make starting and ending records;
  • Sheet transparency to allow me to layer the maps together;
  • That good old dynamic mark visibility trick using an invisible custom shape; and
  • Edit Set actions to control the visibility of the point marks.

I’ve uploaded a version of the workbook with a reduced set of data extracted so you can pull it apart.

Enjoy!

Posted in Uncategorized | 3 Comments

Tableau and Snowflake Boolean fields

Hello, gentle reader. It’s been a while. I hope you are well. 🙂

Today, I was helping a colleague with a problem that stems from the fact that the current version of the Snowflake driver in Tableau doesn’t correctly support Boolean fields and therefore returns data from Snowflake BOOLEAN fields as a 1/0 integer value. This means that the field cannot be directly used in a Boolean expression like:

A AND B

where A and B are both themselves Booleans. So I thought I’d do some digging to understand what was happening under the covers.

For this particular problem, I came up with four different solutions… combinations of using RAWSQL_BOOL to force the result to be interpreted as a Boolean, and simple integer comparisons that return a Boolean result:

  • Create individual Boolean fields for A and B using RAWSQL_BOOL(), then create a third field that ANDs them together.
    • Formula:
      • [A_BOOL] = RAWSQL_BOOL(“A”)
        [B_BOOL] = RAWSQL_BOOL(“B”)
        [A&B] = [A_BOOL] and [B_BOOL]
    • Produced the following SQL:
      • (CASE WHEN ((A) AND (B)) THEN 1 WHEN NOT ((A) AND (B)) THEN 0 ELSE NULL END)
  • Create a single field that embeds the AND into the RAWSQL_BOOL function. Less flexible than option 1 but maybe better performing?
    • Formula:
      • [A&B] = RAWSQL_BOOL(“A AND B”)
    • Produced the following SQL:
      • (CASE WHEN (A AND B) THEN 1 WHEN NOT (A AND B) THEN 0 ELSE NULL END)
  • Create individual Boolean fields for A and B using an integer comparison, then creating a third field that ANDs them together. My thought here was that integer queries are generally very fast so this might be faster than the RAWSQL approach…
    • Formula:
      • [A_INT] = [A] = 1
        [B_INT] = [B] = 1
        [A&B] = [A_INT] and [B_INT]
    • Produced the following SQL:
      • (CASE WHEN ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 1 WHEN NOT ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 0 ELSE NULL END)
  • Create a single field that does an integer comparison for each field and ANDs the results. To be honest, I didn’t expect much difference between this approach and option 3…
    • Formula:
      • [A&B] = [A]=1 AND [B]=1
    • Produced the following SQL:
      • (CASE WHEN ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 1 WHEN NOT ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 0 ELSE NULL END)

I put the four examples above into a Tableau worksheet and the following query was run in Snowflake:

SELECT (CASE WHEN (A AND B) THEN 1 WHEN NOT (A AND B) THEN 0 ELSE NULL END) AS “Calculation_ 1822761623506259971”,
(CASE WHEN ((A) AND (B)) THEN 1 WHEN NOT ((A) AND (B)) THEN 0 ELSE NULL END) AS “Calculation_ 1822761623506604036”,
(CASE WHEN ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 1 WHEN NOT ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 0 ELSE NULL END) AS “Calculation_ 1822761623507091461”,
(CASE WHEN ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 1 WHEN NOT ((“BOOL_TEST”.”A” = 1) AND (“BOOL_TEST”.”B” = 1)) THEN 0 ELSE NULL END) AS “Calculation_ 1822761623510069255”,
SUM(“BOOL_TEST”.”C”) AS “sum:C:ok”
FROM “PUBLIC”.”BOOL_TEST” “BOOL_TEST”
GROUP BY 1,
2,
4,
3

Looking at the query profile, the two Boolean approaches produced the same expression in the query profile:

IFF(BOOL_TEST.A AND BOOL_TEST.B, 1, IFF(NOT(BOOL_TEST.A AND BOOL_TEST.B), 0, null ))

As did the two integer approaches:

IFF((BOOL_TEST.A = TRUE) AND (BOOL_TEST.B = TRUE), 1, IFF(NOT((BOOL_TEST.A = TRUE) AND (BOOL_TEST.B = TRUE)), 0, null))

Finally, I timed a series of tests against an admittedly small data volume (~100K rows) and the integer logic seemed to come out slightly faster on average – 140ms vs. 166ms over 12 iterations. I’m not convinced that the result is statistically valid, but read into it what you will.

If you want to have a play around with this yourself, here’s the create table statement:

CREATE OR REPLACE TABLE BOOL_TEST (
A BOOLEAN
, B BOOLEAN
, C INTEGER
);

The data for the table and the Tableau workbook can be found in my Dropbox.

Overall, I’m not sure if that helps but it certainly explains what was going on under the covers. It also gives some workaround options until the problem is fixed by Tableau.


Addendum:

The objective of this piece was to discuss how you could use Boolean values in Boolean logic formulae, however, there is one additional piece worth talking about… how to get the field to display as a Boolean true/false value? While the approaches above create fields that are true Booleans in Tableau, there is another approach that doesn’t require creating a calculation, and that is using an alias on the field. You can apply an alias of “true” to the numeric value of “1” and “false” to the numeric value of “0”. This would then display the alias value but the field is still a numeric for the purposes of calculations and such.

I just thought it was worth noting this additional option.

Posted in Uncategorized | Leave a comment

Snowflake multi-cluster warehouse vs. TabJolt load

One of Snowflake’s key features is something called multi-cluster warehouses.

By default, a virtual warehouse consists of a single cluster of servers that determines the total resources available to the warehouse for executing queries. As queries are submitted to a warehouse, the warehouse allocates resources to each query and begins executing the queries. If sufficient resources are not available to execute all the queries submitted to the warehouse, Snowflake queues the additional queries until the necessary resources become available.

With multi-cluster warehouses, Snowflake supports allocating a larger pool of resources to each warehouse. As the number of concurrent user sessions and/or queries for the warehouse increases, and queries start to queue due to insufficient resources, Snowflake automatically starts additional clusters, up to the maximum number defined for the warehouse.

2018-05-30_09-56-56

Similarly, as the load on the warehouse decreases, Snowflake automatically shuts down clusters to reduce the number of running servers and, correspondingly, the number of credits used by the warehouse.

As you can imagine, this capability is very useful to maintain a consistent response time for users in BI and reporting scenarios where varying user loads are common. Imagine that you normally have 5 concurrent users on your system but on Monday mornings, you have a spike of 20 concurrent users, all wanting to view reports for the weekend’s business. A multi-cluster warehouse can ensure that users during the peak load experience the same response times (from the DB that is… saturation of the BI server is a separate factor) as users during non-peak periods.

To show this capability in action, I created a dashboard that queries against about 57M records of Citibike data, hosted in a Snowflake DB. I published this dashboard to Tableau Server and then used TabJolt to simulate a) a single user baseline, and then b) a peak period with multiple concurrent users. Note that caching was turned off both in Tableau Server and in the Snowflake data source to ensure that all queries were actually run in the DB.

You can see this experiment and the results in the following video. Apologies for just providing a download link, but I tried posting this to Vimeo but the video quality was terrible. That’s most likely my fault, but right now this is an easier way to share:

https://www.dropbox.com/s/1j1437irfcjc1g1/Multi-cluster%20Scaling.mp4?dl=0

Enjoy!

Posted in Uncategorized | Leave a comment

Snowflake and Tableau – in action!

A few days ago I was invited to present Snowflake as part of a webinar run with our partner BockCorp. Check out the recording of the session here, which includes an overview of the Snowflake architecture as well as a demo showing all the cool capabilities like instant elasticity, semi-structured data support, data sharing and more:

Oh – if you want to jump straight to the demo, it starts around the 29min point. Go ahead, I won’t be offended. 🙂

Enjoy!

Posted in Uncategorized | Leave a comment

Best Practices for Using Tableau with Snowflake

Update 18 July 2018:
The whitepaper has now been released as an official Snowflake whitepaper so I’ve updated the download link to point to the document on the Snowflake website. The new whitepaper is much prettier and has been through editing to clean up all my bad writing habits. Thanks to Vincent Morello and Marta Bright in our content marketing team for all their help in making this happen.

As announced in my last post, since joining Snowflake I’ve been working on a whitepaper that provides best practice guidance for using Tableau with our built-for-the-cloud data warehouse.

Well, I’m pleased to report that it’s done. Or at least, done enough to release. You can download it from here:

https://resources.snowflake.net/ecosystem/best-practices-for-using-tableau-with-snowflake

I hope you find it useful, and please let me know if you have any feedback or corrections.

Posted in Uncategorized | 2 Comments

Tableau and Snowflake

Happy New Year everyone!

I’ve been a bit quiet of late. Probably to be expected, what with getting my head around all the new stuff here at Snowflake. Also, properly relaxing over Christmas and summer requires a degree of focus (ah, the joy of the southern hemisphere!). But I’ve not been completely idle. Over the past few weeks I’ve been steadily working on a new whitepaper:

2018-01-18_22-18-56

Here’s an overview of the document scope…

  • Introduction
  • What is Tableau?
  • What is Snowflake?
  • What you DON’T have to worry about with Snowflake
  • Creating Efficient Tableau Workbooks
  • Connecting to Snowflake
  • Working with Semi-Structured Data
  • Working with Snowflake Time Travel
  • Working with Snowflake Data Sharing
  • Implementing Role-Based Security
  • Using Custom Aggregations
  • Scaling Snowflake Warehouses
  • Caching
  • Other Performance Considerations
  • Measuring Performance

Of course, it’s turning out to be a lengthy read – it seems I know no other way. 🙂 But believe me, a lot of that is screenshots and SQL. The document is being reviewed at the moment, but I plan to break it into consumable chunks and release material as posts over the next couple of weeks. Maybe here, maybe on the Snowflake or Tableau blogs.

So, keep your eyes peeled…

Posted in Uncategorized | 4 Comments

Endings and Beginnings

Well it was a bittersweet day on Friday. After 6+ years at Tableau I have decided that it’s time for a new challenge. Tomorrow I start my first day at Snowflake Computing, a company that is revolutionising the cloud analytic database market.

Snowflake_Computing_logo[1].png

I’m going to continue to blog here and given I still have a deep love for Tableau, some (many) of my posts will continue to be about it and data visualisation in general. However, I’ll also be posting about Snowflake and interesting things I’m learning as I settle in to my new role. Given that a primary use case for Snowflake is BI and analytics, the two topics should be quite complementary.

Thanks for your support and questions over the past few years and I hope you continue to find my ramblings informative.

Posted in Uncategorized | 8 Comments

Hexbin Scatterplot in Tableau

An interesting tweet came across my Twitter-stream the other day, showing a hexbin scatterplot chart type for Power BI:

Having just presented a session at TC17 on working with dense data where Sarah Battersby and I covered (among other things) hexbinning in Tableau, I was intrigued by this viz type and wondered if it could be created in Tableau. I was a little wary as mixing polygons and points together can be complicated, but I hoped it could be done.

Let’s just say that I’m glad I was bald when I started this exercise because it involved quite a bit of hair-pulling. But after a few hours of trial and error and a well-timed break to go sit in the sun and ruminate, I managed to produce this little beauty:

Hexbin Scatterplot.gif

I started with Alberto Cairo’s Datasaurus dataset – a group of datasets that behave similarly to Anscombe’s quartet. Really I was just being lazy as I had it lying around and therefore didn’t need to mock up my own sample scatterplots. The source data looks like this:

dataset

record id

x

y

dino

1

55.3846

97.1795

dino

2

51.5385

96.0256

dino

3

46.1538

94.4872

dino

4

42.8205

91.4103

dino

5

40.7692

88.3333

dino

6

38.7179

84.8718

dino

7

35.641

79.8718

With the data in this format there are two approaches for generating the hexbins – one uses densification to generate the polygon vertex records, and the other generates them through a join to a scaffolding table. I opted to use the scaffolding approach as a) I have a manageable amount of data and b) it makes life easier when you have hexbins that contain just a single point. The scaffold table looks like this:

Point ID
0
1
2
3
4
5
6

And the join of these tables in Tableau looks like this (the join simulates a Cartesian product of the two tables):

The result of this is 7 rows of data for each point on the scatterplot:

I’ll use one of these (PointID=0) to plot the actual point location, and the other 6 to plot the hexagon shape. I’ve blogged on several occasions on how to generate a dynamic hexbin polygon and we’re going to use the same techniques here:

Generate the hexbin center point:
[HexbinX]: HEXBINX([X]/[Hexbin Size], [Y]/[Hexbin Size]) * [Hexbin Size]
[HexbinY]: HEXBINY([X]/[Hexbin Size], [Y]/[Hexbin Size]) * [Hexbin Size]

 

Generate a unique identifier for each hexbin. As you may know, I’m an advocate for efficiency so I use a numeric function for this (based on Cantor’s pairing function) instead of a string function:
[HexbinID]: ([HexbinX]^3 + 3*[HexbinX] + 2*[HexbinX]*[HexbinY] + [HexbinY] + [HexbinY]^2)/2

 

Generate the actual plot points keeping the original location when PointID=0 and using trigonometry to generate the hexagon vertices when PointID=(1..6):
[PointType]: IF [Point ID] = 0 THEN 0 ELSE 1 END
[Angle]: (1.047198 * INDEX())
[PlotX]: IF MIN([PointType]) = 0 THEN MIN([X]) ELSE WINDOW_AVG(MIN([HexbinX])) + [Hexbin Size]*COS([Angle]) END
[PlotY]: IF MIN([PointType]) = 0 THEN MIN([Y]) ELSE WINDOW_AVG(MIN([HexbinY])) + [Hexbin Size]*SIN([Angle]) END

We can now start plotting our viz – first let’s just get the points up:

You can see that the blue marks are the original data points and the orange points are the vertices for the hexagons. Because we want two marks types (a polygon and a point) we need a dual axis chart:

We need to isolate the orange marks on one side and the blue marks on the other. We can’t filter them, so we have to make some clever use of the “hide” function. I duplicated the [PointType] calculation from before so I can use one to colour one axis and the other to colour the other:

We then hide the marks we don’t need on each axis (right-click on the colour swatch in each legend and select “Hide”):

We can now make the hexagon marks on one axis, and circle marks on the other. Tidy up the colours and other formatting:

Finally, we set the axis to be “dual axis”, synchronise and hide the unwanted top axis, and voila:

The last couple of steps I put in were to a) colour the hexbins by the number of points they contain, b) tidy up the tooltips for each mark type, and c) set up a hover action to highlight the elements in a hexbin:

This ended up being quite a challenging viz and required quite a few techniques to get it done. But being able to do it at all reinforces for me that an expressive presentation model that allows you to natively create complex chart types (i.e. the Tableau approach) is faster and more reliable than a model where you are reliant on a developer to write a custom chart widget (i.e. the Power BI model). Even accounting for the trial and error needed to nut out the final successful method, Tableau allowed me to achieve the result much faster than a solution based on coding.

And of course, now that I know how, I can reproduce this solution in minutes.

You can download the workbook from here. Enjoy.

 

PS. I couldn’t help myself. The workbook now includes solution examples using both the scaffolding and the densification approaches.

Hexbin Scatterplot.png

It was a mental itch that needed scratching.

Posted in Uncategorized | 8 Comments

Loupe Tooltips

Well, it’s been a hectic week in Las Vegas for our customer conference but I have a brief window to put up this post. If you’ve been here, I hope you had a great time and learned heaps!

One of the sessions I co-delivered with the ever amazing Sarah Battersby (@MapsOverlord) was “Masters of Hex: Interpreting Dense Data with Tableau”. We’ve presented this session for the last 3 years but as always we update our material for any new features and techniques. This year we have access to the 10.5 beta, and I came up with a great idea to use it for a way to dynamically zoom in on dense data.

I’ve called this idea a “loupe” tooltip – after the magnifying eyepiece used by photographers and watchmakers:

Here’s my starting data – a scatter plot with 100,000 data points packed densely:

As you can see, it’s impossible to make out what is going on in the bottom middle of the chart – there is too much overplotting of the marks even when we make the marks as small as possible and ramp up the transparency. But what if we could dynamically zoom in on a small section – like we were looking through a loupe?

To achieve this, I’ve created binning calculations (I’ve had to use calcs to do this as you can’t use a native bin object in a calculation which I need to do later) to allow me to select a small group:

BinX:    FLOAT(INT([X]*10)/10)
BinY:    FLOAT(INT([Y]*10)/10)

I’ve also created another sheet which we will use in the tooltip – essentially it’s just a duplicate of the primary scatter plot, but I’ve cleaned up the axes and the formatting to make it clean and minimal:

I also made a loupe title sheet to just show the count of the marks in the scope of the loupe:

Now we add them to the tooltip of the primary scatter plot, and set the filter fields to be BinX and BinY:

And voila!

However we have a problem when we loupe over a sparse bin – the loupe axes are zooming in to just show the point:

It would be preferable to fix the loupe to always show the extent of the bin, so we can use the neat trick of placing reference lines to pad out the axes to a larger size than the data demands. We create a couple of boundary calculations for X and Y:

XLowerBound:    MIN([BinX])
XUpperBound:    [XLowerBound] + 0.1
YLowerBound:    MIN([BinY])
YUpperBound:    [YLowerBound] + 0.1

We then put these calcs on the detail shelf of the loupe and we can create a reference band:

Now when we loupe over a sparse bin we have a much nicer view:

The workbook can be downloaded here but remember, you’ll need Tableau 10.5 to view this so make sure you enroll for the beta program!

Enjoy!

Posted in Uncategorized | 4 Comments

Letting go of custom SQL

I thought it might be helpful to share with you a recent experience where I was helping a customer who was complaining of slow performing workbooks. When I arrived their Tableau analyst showed me a couple of workbooks that were performing slowly. What do we mean by slowly? Well, for one of the workbooks, just opening it took ~5 mins!

Image result for cookie monster gif

OK – time to start looking for problems, but where to begin? Fortunately, I’m presenting a session at our customer conference next week (designing efficient workbooks: live on stage) and in it I propose a framework for identifying performance issues.

Basically the top layers are where you are most likely to find performance issues, so you should start looking at the design of the dashboards and calculations (which are easy to fix) before you start looking at the data model or the underlying platform (where often you have less ability to change things).

With this framework in mind, I started reviewing the workbook.

Presentation

I’m can’t include a screenshot of the report, but let me describe it for you:

It was a simple design – a single sheet (not even a dashboard!) with a straightforward layout. But… it was a great big crosstab!

Image result for frustrated reaction gifs
(Me when I see great big crosstabs…)

That could be part of the problem as big crosstabs are not the most efficient viz type to render. However, I noticed that the main delay when opening the workbook was with “executing query” and not “computing layout” so this probably wasn’t the main issue we needed to fix.

Image result for these aren't the droids gifs

Analytics

Note: I’ve created a mock-up of the data, anonymised to avoid any embarrassment for the client. All the screenshots you will see from here on in reflect the original situation but are totally synthetic.

The analytics layer is concerned with the complexity of calculations and how you combine and manipulate the data (e.g. blending). Surprisingly this was extremely straightforward for the report:

sdf

Image result for nope gifs

Data

The next layer down is concerned with the data sources and data connections – how you are retrieving the data from the original location. When I opened up the very simple data source I saw this:

1

OK – that explain why there were no calculations in the data source and why it was so simple. All the logic is wrapped up in a custom SQL query. It’s also a bit of a red flag… a live connection using custom SQL isn’t a recommended design pattern. Let’s take a look inside and see what we’re dealing with.

2

Image result for surprised reaction gifs

Oh my! I think we have found the problem.

The customer’s database development team had provided the Tableau analyst with several SQL statements like this from their existing reporting solution. It’s not optimal SQL and I’m sure it could be improved if we tried, but the analyst (like many Tableau users) didn’t have the time or the SQL knowledge to do this so simply cut&paste the query into a custom SQL connection. The result? The whole query needed to run before we could then sub-select off the interim result set and the experience was very slow.

What to do? Well, we could follow basic best practice and make this an extract instead of a live connection so we only run the slow query once – after that everything points to the extract. But the customer wanted to maintain up-to-date views of their data so a live connection was preferable. Even more so once we determined that the extract would require frequent intra-day refreshes and was potentially very large.

So, working with the analyst we managed to reverse engineer the relationship between the tables as this:

3

We then captured the logic for the CATEGORY and the MEASURE fields into a couple of Tableau calculated fields:

4

5

And the result? We reduced the time taken to render the dashboard from ~5 mins
to < 10 secs! A 30x improvement which the customer was absolutely delighted with. They are now going to take the learnings from this exercise and apply them to their other workbooks.

Image result for clapping gifs

The moral of the story? It reinforces that custom SQL can be a serious impediment to performance and should be used with caution. Leveraging the native capabilities of Tableau allows us to generate much more efficient queries which is good for response times and also good for the query load on the DBMS. If you must use custom SQL, try to wrap it in a view in the DBMS if you can or pull it into an extract as a last resort.

Posted in Uncategorized | 10 Comments