Amazon Risks Repeating Itself with New Service

Last month, Amazon announced Spectrum, a new service that makes it possible for users of its Redshift massively parallel processing (MPP) database to query against data stored in S3, Amazon's Scalable Storage Service. With Spectrum, Redshift users no longer have to extract data from S3 and transform it -- in Redshift itself or via another engine, such as Amazon Elastic MapReduce (EMR) -- in order to run SQL queries against it.

S3 is increasingly used as a persistence store for data archiving, data lakes, and other data management initiatives. Spectrum is able to query S3 data on-demand without moving it. The upshot, Amazon claims, is that Spectrum opens up S3 to SQL query via common business intelligence (BI) and BI discovery tools, including offerings from IBM, Microsoft, MicroStrategy, Oracle, Qlik, SAS, and Tableau.

Spectrum supports the same SQL syntax as Redshift itself, which means that if your BI tools are already pointing at Redshift, you should be able to query against both Redshift and S3 without modifying your queries. What's not to like?

Is Amazon Devouring Itself?

Amazon didn't say it, but it's a virtual certainty that Spectrum is based on the same technology (Presto) that underpins Athena, the SQL-query-for-S3 service it announced at its annual re:Invent event last November. Back then, some industry watchers speculated that Athena could actually depress demand for Amazon's retinue of cloud RDBMS services: viz., Redshift, Amazon Relational Database Service (RDS), and Amazon Aurora.

Prior to Athena, if analysts wanted to query against data in S3, they first had to spin up a Redshift or Aurora instance and move data from S3 into either database engine. With Athena, they could query directly against data in S3. This begs a question: does Amazon risk cannibalizing the profits of offerings such as Redshift and Aurora by promoting a service such as Athena?

The basic idea behind first Athena and now Spectrum -- i.e., in situ SQL query of an object data store -- hardly requires a conceptual leap. Analytics database specialists such as Pivotal Greenplum and Teradata have supported in situ SQL query against S3 for more than a year now.

Amazon's pre-Athena posture wasn't anomalous, either. In Google's Cloud ecosystem, you can use Google BigQuery to query against data stored in Google Storage -- with a catch: you must first import the requisite data from Google Storage into BigQuery. This is analogous to moving data from S3 into Aurora or Redshift.

To answer the question: Yes, there is some risk of cannibalization, but it's (a) negligible, especially at a time when Amazon Web Services (AWS) is generating $12.22 billion in annual revenues, and (b) more than likely to be offset by revenue from new/increased workload and subscriber growth.

The fact is, services such as Athena and Spectrum are extremely attractive to AWS subscribers.

Redshift, Aurora, and Amazon's other data management services are doing just fine, judging by the overall success of AWS. It generated $3.53 billion in Q4 2016 revenues -- an increase of 47 percent from Q4 of 2015 -- even though Amazon cut prices seven times in the quarter. For Q1 of 2017, AWS did even better, posting $3.66 billion in revenue and growing at a 43 percent clip, year-over-year. Amazon doesn't break out the performance of its data management services, but it doesn't appear overly concerned about cannibalization.

Besides, Redshift is a special case. Cannibalization, if a threat, is more likely with RDS or Aurora, both of which are more likely to be used for one-off, ad hoc, and dev-test use cases -- such as staging, preparing, and transforming data for SQL query -- and are priced accordingly.

A Bona Fide Boon for Redshift Customers

Spectrum brings this specialness into sharp relief. Consider that both RDS and Aurora are based on the open source MySQL database; design-wise, they're conventional, nonparallel, SMP RDBMS platforms -- albeit SMP RDBMSs that have been retrofitted to run in the AWS cloud.

Redshift, by contrast, is an MPP database. It's designed to power common BI use cases (reporting, ad hoc query, ad hoc analysis) and can be used in conjunction with conventional BI tools, BI discovery tools, and "extreme" SQL analytics. In an on-premises context, an MPP system can support dozens of concurrent interactive users. MPP is a query processing platform par excellence; it does one thing, and it does it very well.

Not all of this MPP goodness translates into the cloud, of course. Nevertheless, Redshift is used in different ways, and to different ends, than Amazon's other RDBMSs. Unlike RDS and Aurora, Redshift isn't a good candidate for one-off, ad hoc, or dev-test scenarios; it's arguably too expensive to be used for anything other than SQL query and SQL analytics use cases.

From the perspective of Redshift subscribers, then, Spectrum probably looks like a win-win: it opens up S3 to ad hoc SQL query and makes it easier to support a number of other data management requirements, such as the data lake and the query-able data archive.

It does so at a cost, however: $5 per TB scanned -- the same rate Amazon charges for Athena.

This is costlier than the status quo. Today, Redshift subscribers can query against S3, but they can only do so by creating data transformation jobs to extract the data from S3 and load it into Redshift.

Understandably, then, you're paying a price for the on-demand convenience Spectrum provides. For this reason, most customers will probably use Spectrum in tandem with batch-oriented S3-to-Redshift data integration routines. Five dollars per TB can quickly add up, after all.

Spectrum Versus Athena

The availability of Spectrum begs one more question: When, if at all, should Redshift customers use Athena? Some are probably already making use of it. Should they stop? Are there use cases for which Athena is superior to Spectrum?

Amazon's got you covered, devoting a new entry in its Redshift FAQ to just this question. The simplest and most direct explanation is that Athena is a general-purpose SQL-query-for-S3 service; you can't use it to query against Redshift itself. Spectrum, by contrast, is designed specifically for Redshift: it makes it possible to query across both Redshift and S3.

"If you have frequently accessed data, that needs to be stored in a consistent, highly structured format, then you should use a data warehouse like Amazon Redshift," the FAQ says. "This gives you the flexibility to store your structured, frequently accessed data in Amazon Redshift, and use Redshift Spectrum to extend your Amazon Redshift queries out to the entire universe of data in your S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need."

Subscribe to Upside

Q&A with Jill Dyché

Find out what's keeping teams up at night and get great advice on how to face common problems when it comes to analytic and data programs. From head-scratchers about analytics and data management to organizational issues and culture, we are talking about it all with Q&A with Jill Dyche.

View Recent Article

Submit Your Questions to Jill

Powered by TDWI. Advancing All Things Data
A Division of 1105 Media, Inc.