Sunday, September 07, 2014

Graph Databases

Graph Databases are not new - sites like LinkedIn and Facebook are based on highly connected data which is not managed on traditional RDBMS (Relational Database Management System) infrastructure. Graph DB technology is being rapidly commoditised with platforms like Neo4J and OrientDB leading the way. I believe they will become a new defacto standard in developing all sorts of business and online applications once the inertia of 30+ years of RDBMS thinking is slowly broken down.

Often when I describe what a graph database is - the ideal way to store highly connected data - most people just shrug and say that was solved years ago with the RDBMS platforms - they have 'Relational' in the name after all, right there as the first letter of the acronym! This post is an attempt to explain what makes them a better choice for many applications.

Firstly let's take a look at an example. Say you have a permissioning service which manages permissions for various systems grouped by roles which in turn have a list of functions. In a relational model we may end up with a tables for 'System', 'Role', 'Function', and 'Person' with additional join tables for 'Role_Function' and 'Person_Role'. A typical query of this model would be to determine which functions Person 'A' has permissions to for Application 'X'. The most basic TSQL implementation would be something like

SELECT Function.Id, Function.Name
FROM Function 
INNER JOIN Role_Function ON Role_Function.FunctionId = Function.Id
INNER JOIN Role ON Role.Id = Role_Function.RoleId
INNER JOIN System ON System.Id = Role.SystemId
INNER JOIN Person_Role ON Person_Role.RoleId = Role.Id
INNER JOIN Person ON Person.Id = Person_Role.PersonId
WHERE Person.Name = 'A'
AND System.Name = 'X'

Of course if you add in some reasonable complexity like supporting the fact that some functions may imply permissions to other functions (to edit a record you need to be able to view or search for it). Or that you might have profiles linked to positions rather than people you end up with an explosion in the JOIN factory and the TSQL becomes many times more complicated.

In a graph world however each row in each table simply becomes a node in the graph. The person A node would have a 'IS_IN_ROLE' relationship to a bunch of Roles which would be linked to systems with a 'HAS_ROLE relationship and to functions with an 'INCLUDES' relationship. Functions could relate to each other in a hierarchy. You could add Profile nodes which a Person could hold which can have Roles of their own etc. E.g. a graph looking something like this:


With graph technology comes new querying languages / syntaxes. Neo4J provides a very elegant Cypher language which allows you to query the graph very succinctly. E.g. our complex non-performat TSQL statement might look like this in a graph world:

MATCH (:System {Name:"X"})-->(r:Role)-[*]->(f:Function)
,(p:Person {Name:"A")-[*]->(r)
RETURN DISTINCT f

This query would accommodate function hierarchies and any sorts of connections from a Person to a role (e.g. via positions). In fact whilst it's already simplier than the TSQL version above it is also more powerful, performant, and flexible to changes in the underlying model. We actually have this permissioning system in our organisation written with a RDBMS back end and with the additional complexities mentioned, the TSQL query to retrieve functions for a user takes up over 200 lines which in Cypher condenses down to the simple 3-line statement above.

So with the intro out of the way, what are the benefits?

  • One of the obvious benefits of graph DBs is the types of queries that are easily supported and often DO NOT require changing even for changes to the graph structure itself. Greatly speeding up development time.
  • Another benefit is performance. In the TSQL world there are many index lookups going on to find data in separate tables to JOIN on. In the graph world each node has direct references to its related nodes meaning that traversing the graph (given known starting points like the Person with Name "A" and system with name "X") is super fast as it only ever considers related nodes to see if they match the query. In fact, although indexes are supported in graph DBs they are generally only used to 'anchor' the query to fixed starting points in the graph not to find the data being retrieved.
  • Flexibility to requirements changes. In an agile development world (which is everywhere now really right?) Graph databases accommodate changes to requirements far more easily. The rise of ORMs was due largely to the impedance mismatch between Object Oriented development and the RDBMS data storage structure. Graph DBs remove this issue by allowing data to be stored in a way that more closely matches the code. In fact Graph DBs do not strictly have schemas (though this is somewhat dependent on the technology used) - there is nothing to prevent one node representing a Person having an 'Eye Colour' attribute and another node having a 'Height' attribute. Obviously for use in business applications you will expect some conformity but this is held and defined in code rather than in a separate DB schema as with RDBMS.
  • Deployment of changes is also simplified. Though there are gotchas to look out for with the lack of a schema driven model you are free to add and remove nodes and relationships dynamically meaning you could re-organise the structure of the graph in a live environment
And the drawbacks?
  • Most obvious is the lack of mainstream support. Graph technology is new and untrusted in both enterprise architect and development worlds. This will change over time as exposure increases.
  • The market has not yet stabilized meaning even the most prominent players have not yet settled on a standardized querying language or code base (e.g. Neo4J have recently deprecated their original APIs)
  • There are some applications where a 'good old' RDBMS is still more suitable. Any application with serious aggregation / number crunching requirements or where the structure of the information is very static, not highly related, nor subject to frequent change is probably still going to be developed using an RDBMS backend. Though, I'd hazard a guess that there are fewer and fewer of these systems left. 
  • Reporting requirements are also probably better suited to a properly structured reporting cube maintained separately to the graph. This is actually true of systems running on an RDBMS but since TSQL can aggregate data well often reporting and transactional requirements are supported by a single DB. If you are a reporting purist in someways this is another benefit of the Graph DB as it forces us to think about the reporting requirements separately to the transactional requirements of the system.
If you want to investigate Graph DBs some further reading and suggestions:
  • Neo4J - the self-proclaimed worlds leading Graph DB has a free community edition and a fantastic query language in Cypher. Beware of the 'Oracle' like license model for enterprise implementations though. The site includes some great intro information and links to graph DB examples, demos, and tutorials.
  • OrientDB is another great graph DB tool which also operates as a document DB. Its query language is based on SQL to make it 'more familiar to TSQL developers'. It also supports more of a controlled definition of node types with inheritance from higher order nodes. Like Neo4J there is a community edition available and licensing for enterprise is very reasonable.


Sunday, May 11, 2014

Business Intelligence ETL

There appears to be a quiet revolution going on in commoditisation of business intelligence. Microsoft (late as ever) has weighed in with BI platforms that are very compelling for existing MS shops. Utilising Excel and the vertipaq columnstore in-memory tabular data models
allows advanced and fast dashboard and reporting solutions to be achieved with very little effort and cost.

The key, as ever, is separating the transactional requirements of the business systems that provide the raw data from the reporting requirements of the business joining data from various sources. Too often vendor supported software assumes that its software will be the centre of your universe and that the reports that come with the application are all you will need.

Data needs to be centralised and doing this in a simple, source-agnostic manner is a huge challenge, which requires good business analysis, technical knowledge, and a degree of foresight. In order to address this challenge we're working on a framework to map raw data sources to data marts supporting various types of slowly changing data in a way that supports current and point-in-time reporting in a tabular model utilizing Excel (and PowerPivot) as the platform to surface information

The 'Transactional' to 'Reporting' format process supports the following features:

  • The mapping between the source data to the reporting format is defined in mapping configuration tables
  • The mapping configuration tables auto generate the TSQL required to update and insert reporting tables
  • The framework supports type 1 and type 6 updates to slowly changing data
  • The framework will support set-based operations to maximize the performance of the load. No row-by-solitary-row operations should be considered
  • The framework will track all operations including the row count of the updated / inserted rows
  • The framework will support loading of backdated data if available
  • The framework will support a denormalised input data source - aggregating this data into constituent dimension and fact tables in the target data mart
  • The framework validates denormalised input data to ensure there are no inconsistencies in the import - e.g. different customer names for the same customer ID in the import table

Why go with a denormalised input to the framework?

  • It enforces row by row transactionality - if a row is processed all the dimension and fact data contained in the row must be processed
  • It more clearly segregates the source data structure from the target structure. Without this denormalisation step it is all-to-tempting to replicate the source schema into the reporting schema
  • Point 2 also makes it easier to swap out the source of the data with a new (or additional) system if the transactional systems get replaced for any reason
  • The data extraction process is simplified - a single (complex) query can gather all of the required information from the source system to be uploaded to the reporting data mart.
  • If historical extracts of data have been collected for 'human consumption' over time this method supports the backloading of this data into the reporting data mart, as this data is normally presented as a single denormalised set.

So what are the drawbacks?
  • The re-transalation of the denormalised view of data to a star or snowflake schema required for reporting is not trival to generate even by hand. Having a framework to autogenerate this code is a challenge.
  • Any historical denormalised data often lacks the internal surrogate keys from the source system and natural keys need to be identified and validated (what do you do if you find two John Smiths - is the second record a change of details or a new person?)
  • Auto-generated code is harder to performance tune especially in any round-trip way. Of course indexes can be added to source and target tables to speed up the generated queries (the framework could even define these for you based on it's knowledge of the natural keys in use)

Friday, January 10, 2014

IT jobs

When computers were first around almost everything revolved around the programmer- it took a programmer to write the code to make the computer actually run, to write the compiler to interpret, the the code for the compiler to run, etc. Over time, things were abstracted away. Chip sets were standardised, then standard operating systems were introduced, then development platforms and databases made development and storage easier, then whole industry standard packages meant programmers weren't relied on to deliver a business function. 25 years ago a company may have considered writing their own software for handling accounts payable, 15 years ago they may have considered building a content management system for their web site, 10 years ago they may have considered building an enterprise service bus. All of those decisions, if made today, would be considered crazy.

In the meantime, hardware has also been steadily moving from an engineer-centric world to a commoditised virtual world - physical boxes to virtual machines and now to virtual data centres in the cloud. 10 years ago you might have required 4 weeks notice to procure a new server, now it can take minutes. In fact most of the software applications we develop now are deployed using a scorched earth policy - a new server is spun up, code deployed, tests run, DNS switched over, old server decommissioned automatically in minutes.

What does this mean for IT workers? My guess is that network engineers should consider other skills - especially dev ops as a natural progression. Developers continue to be needed, but expect more and more work in 'filling in the gaps' between off the shelf systems (be this integration or functional gaps). Developers with a few strings to their bow will be in demand especially in regards to service buses, emerging technologies (graph DBs, mobile), and platforms (SalesForce, Dynamics, Sharepoint). Of course if you're near retirement age and know Cobol there will still be a demand until at least 2023!