The stories of my failures

Work-around for Id columns in Dimensions using Synapse Serverless SQL engine

Azure Synapse is one of the latest flagship tools for analytics within the Azure Cloud. Synapse analytics is an all-in-one platform for data integration, processing, data warehousing and analytics. Azure Synapse is very cost efficient and highly advised for small to middle companies to start their BI landscape.

However Synapse has some peculiarities, one of which I will go further in depth today as it has been a crucial point for us.
Synapse can be used in a multiple of varieties, serverless SQL pool (which you pay per TB of data around $5 transferred) and the dedicated pool which starts around $900 per month. I would like to leave aside the possibilities with the built-in data flows and Apache Spark for another post.

Our problem was really specific for the non-dedicated SQL pool, as this is the primary option most small-middle range company’s will use. When designing our Dimensions we noticed the IDENTITY operator was not available to generate our surrogate keys.
Our first idea was to use the Hash keys from our Data vault model where the history is stored. However, when using the Hashkeys in PowerBI it would not accept it, Hashbytes generates a varbinary datatype which is not compatible in PowerBI for relations.

One of the next options we tried is using the checksum function on the Business keys to generate a integer checksum, however there could be a collision due to the min -2147483648 and max 2147483647 values of integers.

One of the solutions we came up with is the use of ROW_NUMER() OVER (ORDER BY HK_Customer), this satisified our needs to create a surrogate key for the dimension Id’s which can be retrieved to the fact table without problem!
Here is a example of how a dimension surrogate key can be generated for in this case DimStation.

Thank you all for your time reading!

Leave a Comment

Your email address will not be published.