Successful Data Lakes: A Growing Trend
- By Dale Kim
- February 16, 2017
A fortunate trend we're seeing in the big data world is the re-emergence of the data lake. Although the data lake itself never went away, it certainly has collected baggage over the years. This widely misunderstood big data implementation is now benefiting from lessons learned and helping organizations gain significant value.
How Not to Use a Data Lake
Pejoratively referred to as data swamps and data landfills, data lakes have received a bad rap over the years because of how they're misused. Like many technologies, they will fail if not deployed with proper planning. The problem starts with the notion that if you put all information in a consolidated repository, you'll automatically get value.
Bringing all your data together appears to be an obvious solution to address data access and data correlation requirements. However, without the right rules and processes, the right buy-in, and even the right technologies, you end up fixing problems rather than achieving positive business impact. After all, the data management effort is the most difficult aspect of big data implementations.
Organizations will gain value from data lakes if they plan ahead for various processes, especially those that deal with interoperability, business continuity, security, and governance. Choosing the right technology platform to support those processes will make the implementation phase run much more smoothly.
The Role of the Data Warehouse
One growing sentiment that is increasing acceptance of data lakes is the recognition that data warehouses still provide value and still have a place in the data center. This coexistence of data warehouses and data lakes simply means that organizations should continue to use the right tools for the job. They should not expect a complete replacement when such an effort is often not necessary -- perhaps even detrimental.
The early big data vendors promised lower costs and greater flexibility, which led to misguided attempts to replace existing data warehouses with data lakes. Many implementers were disappointed when they saw that the promises of data warehouse replacement were unfounded. Instead, a multi-temperature data strategy, in which the data lake covers all but the most actively used data, is working well for businesses today.
Data Governance and Management
Businesses are also seeing that throwing more technology at the problem will not necessarily get them to the end goal. This is particularly true with data governance, where the popular misconception was that as long as you had a set of tools designed for data governance, you'd end up with a successful data lake.
Two common problems arose with this approach. First, organizations ended up implementing processes according to the framework of the tools rather than according to their business operation. This led to a mismatch where organizations spent effort on governance activities that delivered little value and ignored the areas that truly needed attention. Second, many of the tools that worked well with traditional data platforms did not function well in environments that leveraged many different data sources and transformations.
One example of this was relying on graphical user interfaces for handling data management tasks; the heavy requirement for human intervention was inefficient at scale. One can imagine the difficulty of visualizing thousands or millions of files to identify data lineage.
Instead, successful data lake users are thinking in a "big data way" that views data as sets rather than individual fields and puts more emphasis on automation and agility. To gain value from a data lake, best practices such as creating separate work areas for data scientists as part of a data pipeline, masking personally identifiable data to allow greater information sharing, and transforming data in multiple ways for multiple user groups are all important.
Big Data Thinking
A related trend around data lakes is that organizations are recognizing this separate way of thinking with regard to big data. Rather than simply following the same best practices and processes that were developed on traditional systems, newer approaches acknowledge the challenges of volume, velocity, and variety.
As mentioned earlier, automation is emerging as a necessary component of a data lake. It is important to include operational capabilities that can derive insights in real time and immediately apply them to business operations with little to no human intervention. An example of this is tracking real-time cellphone signals so that when user hot spots arise, self-directing antennas can be tuned to address those hot spots while not starving other areas, ensuring optimal coverage for all customers.
Another intriguing example of "big data thinking" comes from a start-up that provides technology that helps to rate the trustworthiness of business data. Instead of assessing data after the fact (i.e., through data lineage models), this new approach uses an experiential model for assuring data quality. This lets users focus more on deriving insights from accurate data rather than troubleshooting unverified data sets. In this environment, more usage of verified data sets leads to higher levels of trustworthiness and more business-critical analysis.
Data Lake Best Practices
It was only about two years ago that an industry analyst told me there simply were no established best practices for data lakes. Since then, many organizations have learned to take the right approaches to their initial big data implementation.
As long as organizations plan ahead, reduce the challenges of big data, and set their objectives up front, data lakes will provide value for more businesses.
Dale Kim is the senior director of industry solutions at MapR. Although his experience includes work with relational databases, much of his career pertains to nonrelational data in the areas of search, content management, and NoSQL and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University and a BA in Computer Science from the University of California, Berkeley.