joel.fail

The stories of my failures

Tracking history in dimensions

One of the most important aspects in data warehousing and Business Intelligence, is the ability to look back in time. In fact tables this is not necessary due to the fact being registered on a set time. Dimensions do not record the moment some of their attributes change.
In this blog I want to show you one of the many approaches to handling type 2 slowly changing dimensions.
As an example let’s create a table with customers, this would be the raw import into the datawarehousing staging layer.
First of all let’s create a local database and schema’s for our demonstrations.
In our approach I have chosen to name the schema’s corresponding to their layer in the data warehouse: stg for staging layer, ods for operational data store, dim for dimensional tables and you could even create fct for fact.
One of the good practices I have learned from my colleagues is to keep the length of schema names and table aliasses the same length and to give them logical names.

Create database SCD_Example;
go
use scd_Example;
go
create schema stg
go
create schema dim
go
create schema ods
go

Now it is time to create the staging table for the customers (credit to mockaroo for the synthetic data).

use SCD_Example;

create table stg.customers (
    id int identity(1,1),
	first_name VARCHAR(50),
	last_name VARCHAR(50),
	email VARCHAR(50),
	gender VARCHAR(50),
	address VARCHAR(50)
);
insert into stg.customers (first_name, last_name, email, gender, address) values ('Oswell', 'Hampshire', 'ohampshire0@ezinearticles.com', 'Female', '905 Oakridge Junction');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Ara', 'Fitzpayn', 'afitzpayn1@fema.gov', 'Genderfluid', '2 Loftsgordon Terrace');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Brandais', 'Vaughn', 'bvaughn2@sina.com.cn', 'Female', '54 Dottie Avenue');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Hanna', 'Castillon', 'hcastillon3@reverbnation.com', 'Male', '15639 Oneill Court');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Arabel', 'Millson', 'amillson4@homestead.com', 'Polygender', '30 Steensland Crossing');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Esta', 'Overstall', 'eoverstall5@opensource.org', 'Bigender', '560 Anhalt Crossing');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Nanni', 'Vittel', 'nvittel6@google.com.br', 'Non-binary', '498 Darwin Way');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Coralie', 'Cowope', 'ccowope7@wordpress.org', 'Non-binary', '3 Bartillon Lane');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Fields', 'Urey', 'furey8@ftc.gov', 'Male', '418 Eggendart Street');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Alix', 'Colloby', 'acolloby9@issuu.com', 'Bigender', '16 Division Street');

As now the staging table is created, we simulated a part of the ETL process, namely the E part (Extract).
In this case we extracted the customer data from our source system into our data warehouse staging layer. If done correctly the output should result as in the screenshot below.

Great! So now we have our customer data in the staging layer. This table is an actual representation of our customer database as it sits in the operational system used by the employees. As you can see this table contains no history at all.
Here is the architectural schematic used by this method. We have just finished discussing the first part, the staging layer.

Operational Data Store

In the operational data store we will be storing our history of the dimensions by implement the insert date.
In this methodology explained here we will focus only on INSERT only, instead of updating records.
There is a very valid reason for this, in the recent years more and more companies designing their data warehouse take this approach.
Since the come of window functions it makes the update function obsolete in the valid_to column in a type 2 slowly changing dimension. A big issue in the past that using UPDATE alot the sql engine locks the rows and you start getting huge bottlenecks.
As said before for our ODS we will use a INSERT ONLY approach, so our ods table will look as follows

use SCD_Example
create table ods.customers (
id int identity(1,1),
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(50),
gender VARCHAR(50),
address VARCHAR(50),
inserted_datetime datetime,
deleted_record tinyint
);

The next step would be to insert the customer data to the ODS Layer right? So let’s go!

INSERT INTO ods.customers
SELECT [first_name]
      ,[last_name]
      ,[email]
      ,[gender]
      ,[address]
      ,getdate() as inserted_datetime
      ,1
  FROM [SCD_Example].[stg].[customers]

The output of the table should now look as follows, great! We have a date and time when the data was added.

This is usually the moment the staging table gets truncated, always use truncate instead of delete. Delete is a iterative function that deletes row by row and for each entry makes a entry in the transaction log (something we want to evade if we run this transformation frequently). Truncate deallocates the pages and makes the entry for the deallocation of pages in the transaction log (src: mssqltips.com).

In the next example we will simulate a new scheduled extraction process has ran and two new records have been added, a new customer: Juan Fernandez Lopez and address change of Esta Overstall.

use scd_example
insert into stg.customers (first_name, last_name, email, gender, address) values ('Esta', 'Overstall', 'eoverstall5@opensource.org', 'Bigender', '2310 Rock Street');
insert into stg.customers (first_name, last_name, email, gender, address) values ('Juan', 'Fernandez Lopez', 'j.fernandez-lopez@ucberkley.edu', 'Male', '110 Sproul Hall #5800');

To see the difference in the two tables we can do this using the SQL EXCEPT function, this will reflect the NOT matching records in the tables based on the selected attributes, in this case first_name, last_name and address.

select first_name, last_name, address from stg.customers
except
select first_name, last_name, address from ods.customers

Thanks to feedback of Mathias Halkjaer Petersen, he is correct about the fact that not all dimensions are adequate to do delta loading with EXCEPT due to the high frequency of changing a example Mathias gives is a payment validation job running each night providing a latest validation date field to the dimensional table.

Great! so now we have the differences between the actual data and the data stored in our ODS layer. This are called the delta’s.
Essentially we would want these delta’s inserted into our ODS to process later in our dimensional model right?
As we stated above we do not update or delete records in our approach even though Esta Overstall has two customer records now this will not form a problem.

with deltas as (
select first_name, last_name, address from stg.customers
except
select first_name, last_name, address from ods.customers
)

Insert into ods.customers
select del.first_name,
	   del.last_name,
	   stg.email,
	   stg.gender,
	   stg.address,
	   getdate() + 1 as inserted_datettime,
	   1
from deltas del
join stg.customers stg on del.first_name = stg.first_name and del.last_name = stg.last_name and del.address = stg.address

As you can see the records have been added, and as expected Esta Overstall is in it as duplicate record.
Of course the two newly added records have other time stamps (for now a slight difference of a day) but in practice the difference will be depended on how often that record is updated (That’s why they are called slowly changing dimensions).

One notably important mention by the commenter Eric is the reverse check (if a record has been removed from the source). The approach to this is similar as to checking the delta’s. First of all let’s remove a few records from our source.

delete stg.customers where id in (7,8,9)

Great! So in this simulation a employee decided to clean up the customers and remove these inactive records, however for reporting purposes we would like to maintain this data right? As Eric suggested let’s do a reverse lookup.

select first_name, last_name, address from ods.customers
except
select first_name, last_name, address from stg.customers

As we can see there are indeed three of our records removed from the source and will not generate any more information, however due to our wishes to maintain their history for reporting we will have to classify them. As you saw above the ods.customers now contains a boolean attribute to see if a record is deleted or not. We will insert these records with a FALSE on that attribute. This will allow us to classify it easily in our dimensional table.

with cte as (
select first_name, last_name, address from ods.customers
except
select first_name, last_name, address from stg.customers
)

insert into ods.customers
select cte.first_name,
	   cte.last_name,
	   ods.email,
	   ods.gender,
	   ods.address,
	   getdate() + 1 as inserted_datettime,
	   0
from cte 
join ods.customers ods on cte.first_name = ods.first_name and cte.last_name = ods.last_name and cte.address = ods.address

Everything comes together in dimensions

Now that we have a solid foundation for the staging and ODS layer it’s time to let the magic happen! As we have spoken about we do a only index approach to avoid locking of the database and overly unneeded expansion of the transaction log.
We will create a view for our dimensional table using the ods.customer as source.
Using a window function with LEAD() we are able to retrieve the activity of the record and classify a record as active or inactive. We can also classify deleted records as inactive.

    with expCTE as
        (
            SELECT [id]
      , [first_name]
      , [last_name]
      , [email]
      , [gender]
      , [address]
      , [inserted_datetime] as active_from
--	  , deleted_record
	  , case when deleted_record = 0 then inserted_datetime else LEAD(inserted_datetime) over (partition by first_name, last_name order by inserted_datetime) end as Valid_to
            FROM [SCD_Example].[ods].[customers]
        )

    select id,
        first_name,
        last_name,
        email,
        gender,
        address,
        active_from,
        Valid_to,
--		deleted_record,
        case when valid_to is null then 'Active' else 'Not Active' end as is_active
    from expCTE

Now let’s see what’s inside this view!

As you can see now the customers that have had been deleted from the source system have been set on all their corresponding records to not active, and their latest record valid_to date equals to their active_from date (as their activity is gone). But their records in fact tabels will still refer to their historical transactions like purchases or quotations.
Thanks again to commenter Eric to bringing up this topic.

As you can see the duplicate record we have inserted for Esta Overstall now has a valid_from and valid_to date with a activity code. This will create a history of all the (slowly) changing attributes.

This technique is essential in datawarehousing, some examples where it can be applied:
– Order status tracking => Order to shipment timing
– Quotation status tracking => How long does it take to receive a RFQ from the supplier
– Customer representative => Did one representative generate more revenue while responsible for a certain Customer? Why and how did he do it?

I hope you all enjoyed my blog and all suggestions are welcome. Please leave any questions below and idea’s for next blogs!
Until next time.

2 thoughts on “Tracking history in dimensions”

  1. There is one thing you still have to cover:
    What if a row is deleted from the source?

    You will want to end the validity of the last known row for a correct history in your ods.

    You can find the deleted rows using:
    select first_name, last_name, address from ods.customers
    except
    select first_name, last_name, address from stg.customers
    i.e. reversing stg and ods data.

  2. Hi Eric,
    Thank you kindly for your constructive feedback. You are totally right about this.
    I have edited the post and included your approach for checking removed records (added a deleted_record bool attribute in ods layer).

Leave a Comment

Your email address will not be published.