Making the Most of a Logical Data Warehouse
Since the introduction (or, perhaps, popularization) of the term logical data warehouse by Gartner in 2011, the idea has gained ground in the industry and quite a few supporters. Considered thoughtfully and with care, it is a useful concept.
However, there is one key aspect that is often understated and that prospective implementers must consider in order to succeed. The logical data warehouse is not a replacement for a traditional data warehouse; it's an extension.
What Is a Logical Data Warehouse?
Simply put, a logical data warehouse stands in contrast to a physical data warehouse where "all" data is consolidated into a single physical database before use. A logical data warehouse allows data to stay wherever it currently resides and to be directly accessed there by BI applications and users. Data virtualization tools offer users a single interface -- often based on SQL -- to access data in multiple places or formats, leading to an alternative name: a virtual data warehouse.
Despite claims to the contrary, the idea of a logical store to underpin a data warehouse is not new. My original data warehouse architecture paper, published in 1988, specifically defined "The business data warehouse ... [as] the single logical storehouse of all the information used to report on the business." Given the mostly mainframe technology and primitive relational database technology of the time, the phrase was more aspirational than real.
The long-term vision was clear -- the data warehouse should not be confined physically to a single database or machine. Nonetheless, the implemented reality of the time was a highly centralized, single enterprise data warehouse, often supported by distributed data marts, and common practice has continued to be the same.
It took some 20 years of evolution in technology and thinking for the concept of "a single logical storehouse" to be considered feasible. Data virtualization tools from both big database vendors such as IBM and smaller specialist companies such as Composite Software (acquired by Cisco) and Denodo have allowed more distributed and virtualized implementations.
What Are We Overlooking?
Virtualized access is only one aspect of the story, and arguably, it's the smaller part. As with the original data warehouse, the bigger questions are: How will the data from different sources actually work together? How can business users know what data exists and how it can be joined? A logical data model can answer these questions. (The importance of this model is a strong argument that the description "logical" should be favored over "virtual" in this approach to data warehousing.)
In these common questions, we also find one of the challenges of the logical data warehouse. Under certain circumstances, data from different sources cannot be reconciled at the time of access because of differences of meaning, timeliness, etc. Sometimes the business meaning of diverse data must be constructed in advance.
This means that the logical data warehouse has to have a physical, beating heart of core data that is created before users gain access to the warehouse.
Reimagining the Enterprise Data Warehouse
Creating such reconciled, meaningful core data was, of course, a central purpose of the original enterprise data warehouse (EDW). In its original incarnation, the EDW had to contain much more than this (because there was no other place to put it) and became a repository of all data required for reporting and analysis purposes.
Such a load was probably too great for any construct to bear; performance suffered and implementation and maintenance was challenging. With the adoption of a logical data warehouse architecture, the EDW can be reduced in size and scope, but its importance cannot be underestimated. It remains a key component in the architecture.
Metadata -- or, as I prefer to call it, context-setting information -- also remains vital. Data virtualization tools certainly create and use it, but its scope goes far beyond the physical metadata needed to find and access diverse data sources. It must also carry the business meaning of the logical data model that enables users to relate to data across the enterprise without knowing where it may reside.
Data integration, or ETL, does not go away either. It continues to have a central role in populating the EDW. Additionally, it must be synchronized with the data virtualization component so that preloaded and real-time data are consistent.
The key lesson, therefore, is that a logical data warehouse is more of an extension and improvement to a traditional data warehouse rather than something that can replace an existing one or simplify the implementation of a new one.
A logical data warehouse implementation offers the opportunity to focus on new business needs such as timely access to operational data or bridging data warehouse and big data stores. That's where implementers should concentrate their efforts.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.