You Still Need a Model! Data Modeling for Big Data and NoSQL
NoSQL systems are footloose and schema-free. That's the conventional wisdom, at any rate.
A session in NoSQL data modeling at TDWI's upcoming Chicago conference will put this conventional wisdom to the test. It's part of TDWI's Data Modeling for the Future certificate track, a series of four classes that look at data modeling in context.
In his session, "Data Modeling in the Age of Big Data," veteran TDWI instructor Chris Adamson will separate fact from fiction when it comes to nonrelational data modeling. The biggest fiction of them all might be that it isn't necessary to model nonrelational data.
Is Data Modeling Optional with NoSQL?
"There's a lot of confusion right now in the market ... that leads people to believe you don't need a model with NoSQL technologies," argues Adamson, president of information management consultancy Oakton Software. "Some of it is a function of messaging for vendors, which are touting these new, so-called schema-less products where you can put in data without having to model it first. This leads people to believe you don't need a model."
There is some truth to this. With a relational database, you need to define schema before you can load data into the database. (Vendors use some tricks, such as late binding, to work around this, but most of the data destined for an RDBMS will be modeled beforehand.)
Not so with a NoSQL system, where data modeling is strictly optional -- at least during the ingest phase. As a result, you really can put data of any type into a NoSQL repository. Using that data once it's there is a more complicated problem, however, as is getting the same data -- exactly the same data -- back out again.
The upshot, Adamson argues, is that far from obviating schema, NoSQL systems make modeling more important than ever -- especially when the systems are used as data sources for advanced analytics. "Even though you don't have to model when you bring information into them, the process of making sense of that information and producing something useful from it actually yields a model as a byproduct even if people don't realize it," he points out.
Models Support Fundamental Practices
There's an iron law of data management: if you want to do anything with data, you're eventually going to have to derive, impute, or invent schema. You have to model data.
"A model, a data model, is the basis of a lot of things that we have to do in data management, BI, and analytics. You need a model to do things like change management. You need a model as the centerpiece of a data quality program. You need a model around which you can do data governance," Adamson says. "A model also supports that most fundamental of activities: somebody needing to query the data. If you don't know what's there, how do you get to it?"
Nonrelational Modeling Principles
This isn't to say that the same practices and methods we used to model data in a relational context will transfer to the world of nonrelational data modeling. "Everything else is different. When you model is different. We model at a different time. Because NoSQL systems are schema-on-read, you can dump data into them without a schema -- but by the time you pull stuff out, you're imposing a model," Adamson explains.
"Different people may be doing the modeling. Rather than an architect or a requirements analyst, modeling may be done by a programmer, by a business analyst, or in some cases by a business subject matter expert. Also, how you do the modeling is different. The process is inverted. It tends to be the outcome of an exploratory process, rather than a starting point for everything else you do."
There's another critical difference. Traditional approaches to data modeling developed in the context of a highly centralized IT model: a scheme in which IT acted as a gatekeeper, controlling access to data. The rise of nonrelational data -- and the NoSQL systems and cloud services optimized for storing it -- coincides with the widespread decentralization of data access, use, and dissemination.
"BI evolved over time out of an IT function. For the most part, it's always been centralized, usually under IT. Historically, analytics has evolved in the opposite direction -- it started in many organizations inside of business areas, inside of marketing, inside of finance, inside of risk management, where people were usually hand coding analytics," Adamson says.
"Now organizations are trying to figure out ways to centralize [analytics] because they need to scale it beyond these niche functions. The danger here is that we treat it the same way we treat the data warehouse and install a modeler as a gatekeeper. That would be a disaster with analytics because the entire advantage that we get out of these nonrelational technologies is that we can explore data and find value first before we develop a model."
Learning All the Types of Modeling
"One of the key points is that we shouldn't throw away everything we've learned: this knowledge base is incremental. There are still some things we will continue to do with good old-fashioned relational data. Then there are altogether new things we need to do with the nonrelational stuff," Adamson concludes.
[Editor's note: In addition to Adamson's big data modeling class, TDWI's certificate track includes three other data modeling-oriented sessions: "Dimensional Modeling from a Business Perspective," "TDWI Dimensional Modeling Primer," and "Dimensional Modeling Beyond the Basics." Participants who complete all four classes will master both established data modeling methods and new techniques to support advanced analytics use cases.]