Once More into the Data Lake
Data lakes disturb me. It's their shallowness -- especially the shallowness of the definition of "data lake" -- that has concerned me most. I am now more deeply troubled than ever.
According to the 9sight/EMA survey results published at the end of November, acceptance and implementation of data lakes has soared. In 2016, fully two-thirds of respondents reported adopting a data lake strategy, up from just over 50 percent in the previous survey some 15 months earlier. There's more, though: 14.9 percent of the 2016 survey respondents said that not only had they adopted a data lake strategy, it replaced their data warehouse environment. This demands explanation.
Defining the Data Lake
I am not against the concept of a data lake. To quote the earliest definition, from Pentaho CTO James Dixon in 2010, "If you think of a data mart as a store of bottled water -- cleansed and packaged and structured for easy consumption -- the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." This is a nice visionary statement, but it doesn't say much about how to design, build, or even dig one.
Detailed and usable architectural definitions of a data lake are somewhat rare on the Web. In many cases, they can be essentially reduced to lists of Hadoop and related components. Some go as far as to suggest that a data lake can replace the entire data warehouse environment, as seen in the survey results.
A few target the entire traditional data management environment -- operational and informational -- for a major rip-and-replace project. A minority take a more holistic approach, suggesting that the data lake either contains or complements the traditional data warehouse. What, therefore, is a data lake?
Functional Components of a Data Lake
The 9sight/EMA survey probed what respondents considered to be the functional components of their data lakes. Interestingly, a data warehouse -- at 16.2 percent -- was the most popular choice.
One might think that those who replaced their traditional data warehouse with a data lake found they must recreate similar functionality in the new environment. However, cross-tabulating the results of the two questions showed that only one-sixth of those who replaced their data warehouse with a data lake report that their data lake contains a "repository of reconciled business transactional data," such as a data warehouse.
This is perplexing. The results suggest that some 12.5 percent of respondents (five-sixths of those whose data lake has replaced their data warehouse) have discovered that they no longer need the data warehouse functionality that delivers reconciled business transactional data. Further research is needed.
However, suggesting that a data lake can be whatever you want it to be, a similar number of respondents chose each of the seven possible choices for the functional components of a data lake. The least popular choice was event streaming, cited as a functional component by 11.5 percent of respondents, and a diverse group of options -- operational applications, departmental or functional data marts, and data discovery environments -- were bunched together with the data warehouse with responses near 16 percent.
Determining Data Lakes' Business Focus
For enterprises striving to improve and expand their use of data, this confusion in the market presents major challenges. The market's fascination with the bright, shiny baubles of new technology diverts attention from both old technology that works, but could be improved, and from the often-ignored reality that technology -- shiny or dull -- cannot solve all problems.
From a technological viewpoint, the data lake is closely aligned with the open source software/commodity hardware environments that can handle the quantities and characteristics of externally sourced data from social media, Web commerce, and the Internet of Things. With much of this data being unreliable in various respects and highly transient, business value emerges from broad analysis, speed of use, and low-risk, narrowly scoped action. This is one clear set of business needs.
This is a very different business and technical environment from that supported by data warehousing. The business goal of data warehousing is to address the opportunities and problems enabled by a consistent, historically reconciled record of the enterprise. With cleaner and smaller data and, in many cases, longer timeframes for action, well-developed and stable technologies -- such as relational databases and data integration tools -- provide a better, more reliable foundation.
In short, data lakes and data warehouses focus on very different business needs and technological constraints.
Considering Modern Challenges
In light of these considerations, the 14.9 percent of survey respondents who have replaced their data warehouse with a data lake may be on the wrong track, and the 16.2 percent who consider data warehousing as a functional component of a data lake may be disappointed with the result. Unfortunately, we cannot be sure what is actually happening or what to recommend because we don't know what definition of "data lake" they are using.
In the end, the problem is more serious. This confusion around terminology is not only potentially misdirecting technology choices; it is diverting attention from the deeper issues that technology -- warehouse or lake, Hadoop or RDBMS -- cannot solve. We need to step out of the lake water and address the tough challenges of data governance, organizational dysfunction, and social ethics that are growing as we gather ever more data.