Data Virtualization: (Part 1 of 2) What Is It?

Overview

I was recently asked what it would take to implement a data virtualization strategy for an organization.  That is a big and open ended question - and possibly a premature one. 

Data virtualization is one of many data integration strategies.  It may not be the right one for you or it may be a small part of a larger one.  It depends on how data fits into the organization’s strategic vision (fit for purpose).  The decisions made will affect how well data is adopted and leveraged across the organization.

The intent of these two post are to: (1) provide a simple and concise description of data virtualization; and (2) share advice from those who have gone through the process successfully.  The advice comes from interviews with corporate customers who have first-hand experience with selecting, purchasing, implementing, managing, and being accountable for corporate wide data virtualization projects.

Framing the Conversation

For the purposes of this conversation, you have lots of data from lots of different sources (some structured like SQL, others unstructured like Twitter feeds or PDF documents).  You are looking for a strategy that will facilitate business growth by allowing users to access and leverage the entire data landscape.

This conversation is about data integration.  Topics like big data, data formats/languages, structured/unstructured, document stores, appliances, in-memory caching, warehouses, statistical analytics, etc. are secondary to the central question of how best to integrate all of the data in a way that will create business value.

What is Data Virtualization?

Data virtualization is middleware that provides a virtual data layer between disparate data sources and front end applications that use the data.

See the diagram below.  Placed above the (in yellow) virtual layer are connections to consumer facing applications (A and B).  Below the layer are multiple data sources (1, 2, 3, and 4) .  In the middle are data views and functions that transform the source data into outputs that are tailored to the needs of each consumer application.  In this diagram, the consumer facing applications need a combination of data from different sources.  There are three important points to make.

The Secret Sauce:  Each vendor handles data virtualization solutions differently with regard to data integrity, optimization, connectivity, and security.

The Secret Sauce:  Each vendor handles data virtualization solutions differently with regard to data integrity, optimization, connectivity, and security.

First, the data views and functions in the middle access only the relevant data sources.  For example, if application “A” needs combined data from data source “1” and “3”, data virtualization  fetches only the required data from those sources.  It doesn’t copy all of the data from sources “1” and “3”, and it doesn't touch sources “2” and “4” (which, in this case, are used by application “B”).

Second, data is fetched only when the data is needed.  The process of requesting, accessing, transforming, and assembling the data into what the consumer application needs is on demand.

Third, the data sources remain in their original space and format. They are not physically extracted, transformed, and loaded into one large data store (see footnote 1 below).

The Secret Sauce

What is the jar in the middle of the diagram?  That is the secret sauce.  Each vendor handles solutions differently with regard to data integrity, optimization, connectivity, and security.

At each data source, you have probably already addressed concerns such as performance, security, atomicity, consistency, isolation, durability, availability, etc..  The measure of a data virtualization product is how well it lets you leverage all of that good work, expand upon it, and consistently apply it across the organization.

Summary

With data virtualization, the data consumers are decoupled from the data sources.  In doing so, the complexity of integrating and vending data is reduced (if you do it well).

If, after you fully understand your business needs, you have decided that data virtualization is a data integration strategy that you want to pursue, either on it’s own or as part of a hybrid strategy, then the next step is to evaluate the solutions that are out there.

The secrets in the sauce—are each vendor’s approach to data integrity, optimal query performance, security, and making data available to the business. When evaluating vendors, be sure to get a good taste of the sauce.

In part 2 of this post,  I share advice from those who have gone through this process successfully.


Footnotes

  1. Compare this to a data warehouse.  With a data warehouse, all of your data is periodically copied into a separate system in advance and assembled into well formed data models that will ensure support for all defined requirements.  This is especially good if you need time-series analysis that requires a persistent view of historical data for comparison.  However, data warehouses are difficult to adjust as business strategies evolve.  Data virtualization is much more agile but it may not provide the intense historic capacity you may need. (go back)