Use a Hadoop-based Data Lake to Empower New Best Practices for Business Analytics

Managing diverse big data is important, and yet it's just a means to an end. The end goal for savvy managers is to gain business value and drive organizational effectiveness from the data, not just capture it in a cost center. The primary path to that value is through analytics. In that context, what follows are some of the new analytics best practices empowered by a Hadoop-based data lake.

Advanced Analytics that Complement Older Analytics

In a nutshell, organizations need to preserve existing analytics based on reporting, OLAP, and SQL but also complement these with advanced analytics based on technologies for mining, clustering, graphing, statistics, and natural language processing.

Traditional forms of analytics and reporting are mostly about tracking the facts and business entities that you know well and need to monitor over time. New, advanced forms of analytics are mostly about discovering facts that you did not know before as well as linking together highly diverse facts, events, and entity characteristics (such as customer behavior, partner reliability, and operational metrics) to form new insights and develop new business opportunities.

Note that traditional analytics tends to require squeaky-clean data on a relational platform for highly precise and structured output, as seen in standard reports and cubes. However, today's analytics focuses on raw, detailed data because it fuels discoveries and complex linkages without the need for obsessive precision or structure.

Given the differences in data requirements, a growing number of data warehouse and data management teams work with both relational databases and Hadoop-based data lakes.

Multiple Forms of Analytics in Tandem

One of the strongest trends in analytics is toward using multiple forms of analytics because each method tells you something different about the same issue. Connect multiple analytics results together and you get more comprehensive insight for business advantage.

When a Hadoop-based data lake captures and manages data in its raw, original state, data can easily be repurposed for multiple forms of analytics. Depending on the design and data content of an individual data lake, it may support both set-based analytics (based on OLAP, SQL, and other relational techniques) and algorithmic analytics (based on mining, clustering, graphing, statistics, and NLP).

Integrated Sequence of Self-Service Best Practices

One of the most desirable emerging analytics practices today is to connect, in sequence, several related self-service, data-driven tasks. The sequence typically follows this order: data access, exploration, prep, visualization, and analysis.

For example, as users access and explore data, they may discover something meaningful, such as the root cause of the most recent churn or a cost center that's eroding bottom-line profits. After the discovery, they want to quickly prepare a data set based on what they learned, then share the prepped data set with colleagues or seamlessly move the data set to other tools for further analytics and visualization.

The assumption is that several tool types are employed (one per step in this multistep process) and the tools are tightly integrated for seamless handoff. This multistep analytics process seems to work well with Hadoop-based data lakes but only when users are given an integrated toolset that supports self-service. Self-service isn't for everyone; it succeeds when provided to certain classes of users who are governed carefully.

Value from Human Language, Text, and Other Unstructured Data

Theoretically, you can put any data or other digital information in a file and Hadoop can manage it and make it available for analytics processing. Within the category of unstructured data, file-based human language and other text is already being leveraged via analytics.

The "killer app" is sentiment analysis, which scans mountains of comments from customers, prospects, and other people (perhaps drawn from social media or text fields in call center apps) to determine what the marketplace is saying about your firm, its products, and its services.

As another example, the claims process in insurance captures a ton of text about losses; insurance companies collect this in lakes, process it to extract facts about entities of interest, and use the output data to extend analytics applications in fraud detection and actuarial calculations. Similar text-driven analytics are seen in patient outcome analyses in healthcare (by both insurer and provider).

Further Reading For a deeper dive into these issues, read the 2016 TDWI Checklist Report: Emerging Best Practices for Data Lakes, online at https://tdwi.org/research/2016/12/checklist-emerging-best-practices-for-data-lakes.aspx.

About the Author

Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 500 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.


Subscribe to Upside

Q&A with Jill Dyché

Find out what's keeping teams up at night and get great advice on how to face common problems when it comes to analytic and data programs. From head-scratchers about analytics and data management to organizational issues and culture, we are talking about it all with Q&A with Jill Dyche.

View Recent Article

Submit Your Questions to Jill

Powered by TDWI. Advancing All Things Data
A Division of 1105 Media, Inc.