Sunday, May 11, 2014

Business Intelligence ETL

There appears to be a quiet revolution going on in commoditisation of business intelligence. Microsoft (late as ever) has weighed in with BI platforms that are very compelling for existing MS shops. Utilising Excel and the vertipaq columnstore in-memory tabular data models
allows advanced and fast dashboard and reporting solutions to be achieved with very little effort and cost.

The key, as ever, is separating the transactional requirements of the business systems that provide the raw data from the reporting requirements of the business joining data from various sources. Too often vendor supported software assumes that its software will be the centre of your universe and that the reports that come with the application are all you will need.

Data needs to be centralised and doing this in a simple, source-agnostic manner is a huge challenge, which requires good business analysis, technical knowledge, and a degree of foresight. In order to address this challenge we're working on a framework to map raw data sources to data marts supporting various types of slowly changing data in a way that supports current and point-in-time reporting in a tabular model utilizing Excel (and PowerPivot) as the platform to surface information

The 'Transactional' to 'Reporting' format process supports the following features:

  • The mapping between the source data to the reporting format is defined in mapping configuration tables
  • The mapping configuration tables auto generate the TSQL required to update and insert reporting tables
  • The framework supports type 1 and type 6 updates to slowly changing data
  • The framework will support set-based operations to maximize the performance of the load. No row-by-solitary-row operations should be considered
  • The framework will track all operations including the row count of the updated / inserted rows
  • The framework will support loading of backdated data if available
  • The framework will support a denormalised input data source - aggregating this data into constituent dimension and fact tables in the target data mart
  • The framework validates denormalised input data to ensure there are no inconsistencies in the import - e.g. different customer names for the same customer ID in the import table

Why go with a denormalised input to the framework?

  • It enforces row by row transactionality - if a row is processed all the dimension and fact data contained in the row must be processed
  • It more clearly segregates the source data structure from the target structure. Without this denormalisation step it is all-to-tempting to replicate the source schema into the reporting schema
  • Point 2 also makes it easier to swap out the source of the data with a new (or additional) system if the transactional systems get replaced for any reason
  • The data extraction process is simplified - a single (complex) query can gather all of the required information from the source system to be uploaded to the reporting data mart.
  • If historical extracts of data have been collected for 'human consumption' over time this method supports the backloading of this data into the reporting data mart, as this data is normally presented as a single denormalised set.

So what are the drawbacks?
  • The re-transalation of the denormalised view of data to a star or snowflake schema required for reporting is not trival to generate even by hand. Having a framework to autogenerate this code is a challenge.
  • Any historical denormalised data often lacks the internal surrogate keys from the source system and natural keys need to be identified and validated (what do you do if you find two John Smiths - is the second record a change of details or a new person?)
  • Auto-generated code is harder to performance tune especially in any round-trip way. Of course indexes can be added to source and target tables to speed up the generated queries (the framework could even define these for you based on it's knowledge of the natural keys in use)