What will the data architectures of the future look like? You can bet that something like data federation will be in the mix.
Federation is an old concept that hasn't quite shrugged off its baggage. When it first emerged 15 years ago, vendors -- and even some data management practitioners -- championed it as a replacement for traditional data warehouse architecture.
This was wildly misguided; however, a decade and a half later, similar concepts live on in data virtualization, as well as in products such as Teradata's QueryGrid. IBM, Oracle, and SAP-Sybase not only market data federation technologies, they've introduced federation-like capabilities in their flagship RDBMSs.
Why Data Federation?
Because both data storage and data processing are -- and will continue to be -- highly distributed, something like data federation is inevitable. Data lives in more and different places today than at any other time in human history.
The architectural and platform innovations that were supposed to eliminate this distribution -- the data warehouse and the Hadoop data lake -- have failed. Data still lives in legacy repositories and black box applications, operational data stores and sandboxes, a new generation of streaming repositories, cloud apps and services, and a slew of disparate -- in some cases unknown -- silos.
Data lives in so many places because data use and consumption paradigms have changed, argues Mark Madsen, a research analyst with information management consultancy Third Nature. This is particularly true of how organizations create and use analytics -- the data warehouse was designed for a paradigm in which users passively consumed reports and dashboards. It worked because it centralized both data and access, and its data was extracted from upstream systems and transformed to conform to a predefined schema.
In contrast, the advanced analytics use cases of today are characterized by open-ended exploration and much deeper (usually iterative) data analysis, Madsen says. Unlike the reports and dashboards that were the bread and butter of the data warehouse, exploratory uses have unpredictable data and processing needs.
Data Movement's the Thing
As we retool our organizations for advanced analytics, we're increasingly confronting the problem of accessing data and making it available for analysis. At its core, this is an issue of data movement. The problem is that the physics of moving data at big data scale is incredibly daunting.
According to Madsen and other experts, data movement will be one of the biggest problems going forward, and not just moving data, but minimizing how much must be moved. This requires shifting the data engineering workload -- the preparation and transformation of data -- to the systems on which the data to be moved "lives." Instead of moving a large volume of data en bloc, data is processed in place so only a small subset of data is actually moved.
Philip Russom, senior director of data management for TDWI Research, tackles this issue in a new TDWI Checklist Report, Evolving Toward the Modern Data Warehouse. Russom sees a more central (if radically transformed) role for the data warehouse than does Madsen, but he likewise zeroes in on data distribution -- and the attendant issue of data movement -- as a challenging problem.
"The trick is integrating big data or data lake platforms and an RDBMS so they work together optimally. For example, an emerging best practice ... is to manage diverse big data in [the Hadoop Distributed File System] but process it and move the results ... to RDBMSs ... that are more conducive to SQL-based analytics," Russom writes.
"This requires new interfaces and interoperability between big data or data lake platforms and RDBMSs, and it requires integration at the semantic layer in which all data -- even multistructured, file-based data in Hadoop or Spark -- looks relational," he argues. "This is the secret sauce that unifies the RDBMS/big data and data lake architecture. It enables distributed queries based on standard SQL that simultaneously access data in the warehouse, HDFS, and elsewhere without preprocessing data to remodel or relocate it."
What Russom describes sounds an awful lot like data federation -- or rather its replacement, data virtualization.
Third Nature's Madsen expands on this point.
By virtue of the diversity and complexity of analytics workloads, Madsen says, the data warehouse is now just one among several environments for analytics. Analytics sandboxes -- in the form of standalone RDBMS systems, small Hadoop (or Hadoop/Spark) clusters, and data lakes -- are increasingly common. So are other repositories of record, from the data lake itself to streaming repositories to graph database systems to (effectively limitless) cloud storage services.
The modern data warehouse must be able to get data from, and share data with, all of these platforms. "Data movement in the new analytics environment is bidirectional. Think about it. Data lives in a variety of sources. It doesn't just flow from these sources. In some cases, for example, you might want to push new or aggregated data back to those sources. The upshot is that analysts will often initiate data movement from different systems at different times," he points out.
"There is no 'center' in the new environment. Every system in it is a possible source of data and a possible source of queries to other systems for data. Data movement requires a fabric, not a one-way connector or a retrieval mechanism that only works from one location."
Federation by Any Other Name
Russom doesn't call the secret sauce he refers to "federation." Madsen, too, shies away from the term. This is because the core problem they're describing isn't strictly one of federated query -- the raison d'etre of data federation and virtualization. The core problem is, instead, least-cost data movement.
Least-cost data movement is a strategy for pushing data transformations and other aspects of data preparation up or down to source or target systems. This is more involved than data federation or virtualization. A good illustration of an approach that substantively tackles this problem is Teradata's QueryGrid.
On the one hand, QueryGrid does something similar to database links in Oracle -- it provides a means to transparently redirect queries to distributed RDBMSs. On the other hand, QueryGrid is a least-cost data movement technology. It's a scheme for transparently shifting data processing to where data lives -- in DBMSs or data sources. More important, QueryGrid is cooperative. It can push processing out to non-Teradata platforms, such as MongoDB, Hadoop, and Spark, as well as to non-Teradata RDBMSs.
Bill Grenwelge, a technical advisor with FedEx Services, says QueryGrid will permit FedEx to simplify and optimize its data architecture. Instead of an emphasis on moving data (as with ETL), QueryGrid permits FedEx to move just enough data. The difference is critical, Grenwelge says.
"It's going to enable our users to do things they never could before. From our perspective, it's an opportunity to maybe leave the data where it sits. For example, to do these reports, I don't need to pull data from that platform over there. If your data sits in a separate platform, [QueryGrid is a means to] just grab the information you need -- no more, no less. It aggregates and generates [data] in a summary table over there so that I can create a report from it," he explains.
This has additional benefits, too, Grenwelge says. "QueryGrid is going to give us the opportunity to leave data where it needs to be or where it already is and then utilize it from there in a more efficient manner. I can cut down on the replication and slim down some of my databases because I don't need to have my data replicated unless it's a disaster recovery scenario," he says.
True, QueryGrid is a Teradata-centric technology. It's likewise very much a work in progress. Teradata continues to develop it and revamp it, with a QueryGrid 2.0 release that represents a significant improvement over version 1.0. This is particularly true of QueryGrid's support for bidirectional data movement. Teradata-centric or not, QueryGrid is a more cooperative solution than other approaches.
It's likewise a lighter-weight alternative to -- and in its capacity to push data processing workloads up or down to source or target systems, more pragmatic than -- full-fledged data virtualization. As a technology for both federated query and least-cost data movement, QueryGrid anticipates the data fabric or synthetic data architecture that will knit together the enterprises (or data centers) of the future.