What is cloud data warehousing and how does it work?
A cloud data warehouse performs all of the functions you would expect of a traditional data warehouse — data processing, collation, integration, cleansing, loading, reporting, and so on — but does so within a public cloud environment. Major examples include Microsoft Azure SQL Data Warehouse, Amazon Redshift, Teradata Vantage, Google Cloud's BigQuery, and Snowflake Cloud Data Platform.
Like its on-premises counterpart, an enterprise data warehouse deployed in the cloud is typically a relational database, focusing on structured and semi-structured data. This is the kind you'd see in various customer relationship management (CRM), enterprise resource planning (ERP), and point-of-sale applications, to name just a few. Unstructured data, meanwhile, is typically aggregated using a data lake framework, which can also be cloud-based.
At the most granular level, the majority of data stored in a warehouse is characterized as either facts, measures, or dimensions:
- Facts: Data points connected to specific events or transactions, e.g., "Compensated employee John Smith $4,000 for the month of $48,000 per year gross salary."
- Measures: Precise numbers connected to facts — based on the above example, one of the measures would be "Monthly pay: $4,000."
- Dimensions: These categorize facts and measures with more structured contextual information, such as "Employee name: John Smith" and "Direct deposit dates: Nov. 15 and Nov. 30."
A cloud database is distinguished by its versatility, and as such it can easily be multi-dimensional. In addition to its ability to easily manage many dimensions of both current and historical big data in a single venue, a modern cloud database can operate on serverless architecture, which can help minimize an enterprise's data management responsibilities. Alternatively, cloud databases may use the cluster-and-node approach, in which two or more physical servers are used.
Traditional data warehousing vs. cloud data warehousing
Aside from being located on-premises as opposed to in the cloud, the most basic differences between traditional enterprise data warehouse tools and warehouses managed by a cloud provider are found in the architecture and modeling:
Types of traditional warehouse architectures
A traditional
data warehouse solution is delineated by tiers.
- Bottom: This is where the database server is located, collecting data from its many origin sources.
- Middle: The online analytical processing (OLAP) server is here, transforming the data relationally or multi-dimensionally.
- Top: The highest of the three tiers is where reporting and data analytics tools reside.
The models seen in traditional data warehousing are as follows:
- Virtual: A virtual warehouse is a set of separate databases that can be simultaneously queried, giving the illusion of a single data warehouse.
- Enterprise: This style of traditional data warehouse aggregates data from every unit of the business.
- Data mart: In this model, the warehouse is split into components that each focus on one department within an organization. For example, one data mart might solely collate marketing-related data from the enterprise's sources, whereas another would stick to sales.
Cloud data warehouse architecture
The types of cloud data warehouse services you'll see can generally be categorized as either a cluster-based or serverless architecture:
- Cluster-based: Multiple server nodes combine to host this data warehouse within a public or hybrid cloud deployment, each with its own compute, storage, and RAM resources. The lead nodes handle query intakes and assign them to compute nodes that execute those queries and produce results.
- Serverless: In this architectural style, the database cluster is managed wholly by the cloud service provider and spread across many different clients.
Both cloud data warehouse types detailed above can offer very quick query responses. The main difference is management: Enterprises must oversee cluster-based warehousing to a certain extent, requesting that their provider add or subtract nodes based on data traffic. Serverless users will expect their provider to dynamically allocate resources as necessary to maximize query speed.
Benefits of using cloud data warehousing for analytics
Amassing and collating all of those gigabytes upon gigabytes (and, eventually, terabytes) of data isn't about storage or operations. The insights it can reveal are capable of being the foundation for strategic development that drives growth and the bottom line — and they must be unlocked with analytics tools.
Running data analytics and reporting on a data warehouse hosted via a cloud solution is quite different from completing the same tasks for an on-premises warehouse. In fact, it's arguably one of the most exciting cloud computing trends in the enterprise world right now.
Whether working in a single public cloud, using a multi-cloud solution, or operating a hybrid cloud deployment tied to on-premises data infrastructure, a cloud data warehouse offers greater and more cost-effective scalability and elasticity for analytics workloads, as they expand and contract in conjunction with an enterprise's shifting priorities. Queries will run more quickly than they would in an on-premises warehouse, at a lower overall cost due to the lack of hardware overhead.
With the right data analytics engine for the cloud, you can give your organization the flexibility to craft and implement algorithms as sophisticated as your circumstances require, using the programming languages you're familiar with, such as SQL, Python, SAS, and R. Scalable analytics in this context brings to bear leading-edge machine learning processes, clustering and segmentation, sentiment parsing, text extraction, graphing, and geospatial or time series analysis.
Additionally, running data warehouse analytics in the cloud allows you to integrate with numerous data management services: Amazon EBS, S3, SageMaker, Glue, and Lambda, as well as Azure Blob Storage, Data Factory, ML Studio, and PowerBI, are just a few examples.
How to select and deploy a cloud-based data warehouse
First, you must consider whether a cluster-based or serverless warehouse architecture will be right for your organization's cloud deployment.
Clustered warehouses have more predictable pricing and allow more direct oversight, but the latter advantage comes at the cost of devoting more time and resources to managing elasticity, capacity, and cluster health. By contrast, serverless models are completely overseen by your CSP and elasticity is scaled automatically, but you pay either per query or based on utilization, which can be difficult to predict.
Pricing, in fact, may be the most complicated aspect of choosing a cloud data warehouse, regardless of model. One of the chief advantages of a strong cloud platform is its elasticity, but at times when data workloads are steady, you may come upon cost inefficiencies. Additionally, it's critical to monitor any costs associated with workflows that move data out of the cloud, as well as complicated budgeting and cost controls that can quickly spin out of control.
Last but not least, initial implementation of a cloud-based warehouse may come with slower-than-expected performance and require users to change their practices to accommodate this early hiccup.
The key to making the most of a cloud data warehouse is using it alongside an agile, scalable, and flexibly priced connected multi-cloud data platform like Teradata Vantage. Vantage is compatible with complementary data tools from major cloud providers, and pricing is based solely on use. Also, the platform works seamlessly in any cloud environment or on-premises, and allows for fluid movement of data and applications back and forth from physical data infrastructure to the cloud — and even between cloud providers in a multi-cloud model — as need be.
Learn more about Vantage