Dimensional Models in the Big Data Era
Technological advances have enabled a breathtaking expansion in the breadth of our BI and analytics solutions. On the surface, many of these technologies appear to threaten the relevance of models in general and of the dimensional model in particular. However, a deeper look reveals that the value of the dimensional model rises with the adoption of big data technologies.
The Dimensional Model of Yesterday
The dimensional model rose to prominence in the 1990s as data warehouse architectures evolved to include the concept of the data mart. During this period, competing architectural paradigms emerged, but all leveraged the dimensional model as the standard for data mart design. The now familiar "stars" and "cubes" that comprise a data mart became synonymous with the concept of the dimensional model.
In fact, schema design is only one of several functions of the dimensional model. A dimensional model represents how a business measures something important, such as an activity. For each process described, the model captures metrics that describe the process (if any) and the associated reference data. These models serve several functions, including:
- Capture business requirements (information needs by business function)
- Manage scope (define and prioritize data management projects)
- Design data marts (structure data for query and analysis)
- Present information (a business view of managed data assets)
Because the dimensional model is so often instantiated in schema design, its other functions are easily overlooked. As technologies and methods evolve, some of these functions are beginning to outweigh schema design in terms of importance to data management programs.
New Technology and Data Management Programs
Since the 1990s, business uses for data assets have multiplied dramatically. Data management programs have expanded beyond data warehousing to include performance management, business analytics, data governance, master data management, and data quality management.
These new functions have been enabled, in part, by advances in technology. Relational and multidimensional databases can sustain larger data sets with increased performance. NoSQL technology has unlocked new paradigms for organizing managed data sets. Statistical analysis and data mining software have evolved to support more sophisticated analysis and discovery. Virtualization provides new paradigms for data integration. Visualization tools promote communication. Governance and quality tools support management of an expanding set of information assets.
As the scope of data management programs has grown, so too has the set of skills required to sustain them. The field of data management encompasses a broader range of specialties than ever before. Teams struggle to keep pace with the expanding demands, and data generalists are being stretched even thinner. These pressures suggest that something must give.
Amidst the buzz and hype surrounding big data, it's easy to infer that dimensional modeling skills might be among the first to go. It is now possible to manage data in a nonrelational format such as a key-value store, document collection, or graph. New processing paradigms support diverse data formats ranging from highly normalized structures to wide, single table paradigms. Schema-less technologies do not require a model to ingest new data. Virtualization promises to bring together disparate data sets regardless of format, and visualization promises to enable self-service discovery.
Coupled with the notion that the dimensional model is nothing more than a form of schema design, these developments imply it is no longer relevant, but the reality is precisely the opposite.
Dimensional Model Functions in the Age of Big Data
In the wake of new and diverse ways to manage data, the dimensional model has become more important, not less. As a form of schema design, the news of its death has been greatly exaggerated. At the same time, the prominence of its other functions has increased.
Schema Design: The dimensional model's best-known role, the basis for schema design, is alive and well in the age of big data. Data marts continue to reside on relational or multidimensional platforms, even as some organizations choose to migrate away from traditional vendors and into the cloud.
Although NoSQL technologies are contributing to the evolution of data management platforms, they are not rendering relational storage extinct. It is still necessary to track key business metrics over time, and on this front relational storage reigns. In part, this explains why several big data initiatives seek to support relational processing on top of platforms such as Hadoop. Nonrelational technology is evolving to support relational; the future still contains stars.
The Business View: That said, there are numerous data management technologies that do not require the physical organization of data in a dimensional format, and virtualization promises to bring disparate data together from heterogeneous data stores at the time of query. These forces lead to environments where data assets are spread across the enterprise and organized in dramatically different formats.
Here, the dimensional model becomes essential as the business view through which information assets are presented and accessed. Like the semantic layers of old, the business view serves as a catalog of information resources expressed in nontechnical terms, shielding information consumers from the increasing complexity of the underlying data structures and protecting them from the increasing sophistication needed to formulate a distributed query.
This unifying business view grows in importance as the underlying storage of data assets grows in complexity. The dimensional model is the business's entry point into the sprawling repositories of available data and the focal point that makes sense of it all.
Information Requirements and Project Scope: As data management programs have expanded to include performance management, analytics, and data governance, information requirements take on a new prominence. In addition to supporting these new service areas, they become the glue that links them together. The process-oriented measurement perspective of the dimensional model is the core of this interconnected data management environment.
The dimensional model of a business process provides a representation of information needs that simultaneously drives the traditional facts and dimensions of a data mart, the key performance indicators of performance dashboards, the variables of analytics models, and the reference data managed by governance and MDM.
In this light, the dimensional model becomes the nexus of a holistic approach managing BI, analytics, and governance programs. In addition to supporting a unified road map across these functions, a single set of dimensional requirements enables their integration. Used at a program level to define the scope of projects, the dimensional model makes possible data marts and dashboards that reflect analytics insights, analytics that link directly to business objectives, performance dashboards that can drill to OLAP data, and master data that is consistent across these functions.
As businesses move to treat information as an enterprise asset, a dimensional model of business information needs has become a critical success factor. It enables the coordination of multiple programs, provides for the integration of their information products, and provides a unifying face on the information resources available to business decision makers.
Editor's Note: This author will be leading an in-depth session on this topic at TDWI Chicago.