Data lakehouse defined
A data lakehouse combines the structured data management and processing ability of a data warehouse alongside the inexpensive storage capacity of a data lake.
For example, storage and compute resources are separate in the structure of the data lakehouse, allowing for greater scalability, and it typically uses standardized storage formats. The data lakehouse also supports strong schema enforcement and governance, allows for concurrent data reading and writing, uses end-to-end streaming, and is compatible with multiple data types—structured, semi-structured, and unstructured.
Data lake vs. data warehouse vs. data lakehouse
Before we go any further, it's critical to quickly illustrate the key differences between the two terms from which data lakehouse is derived:
- Data lake: A collection of raw data that can be structured, semi-structured, or unstructured, with a flat architecture. The facilitation of low-cost, long-term storage for eventual use in analytics applications is arguably the key benefit of the data lake, along with flexibility. However, governance and security are often lacking, along with
- Data warehouse: This data architecture stores structured data using hierarchical tables and dimensions. Traditional data warehouses are often noted as being ideal for complex queries and offer considerable security and governance. That said, in cases where IT teams are the main driver behind data warehouses' implementation, there can be issues with agility.
The two concepts can and should be used simultaneously, as they both serve valuable business functions. Data warehouses allow your business users to see the data the way they need to, while data lakes are excellent for the staging and processing layers.
In theory, the data lakehouse gives enterprises all of the advantages of the data warehouse and data lake and very few of the drawbacks. The pros of the data warehouse architecture offset any limitations of the data lake, and vice versa.
Why and how might you implement a data lakehouse architecture?
Because the data lakehouse is still, in many ways, more of a theoretical concept than an in-practice system, experts vary in their assessments of how data lakehouses would be put together.
Data lakehouse models
In an early concept of how the architecture might work from October 2017, the data warehouse is within the data lake. Data ingestion processes ranging from extract, transform, and load (ETL) to stream processing funnel data from many sources into the data lake. Then, it passes through raw and refined data zones, as well as analytics sandboxes, before being integrated and processed in the data warehouse. Finally, the data is transferred to data access and preparation tools that provide security and governance, before finally ending up with the applications that need it.
Microsoft data and intelligence strategist Pradeep Menon proposed a similar but much more multifaceted model in 2021. Structured, semi-structured, and unstructured data enters the data lake through ingestion services, but then goes back and forth between various zones within the architecture to a data processing service as it is cleansed, joined, and properly formatted. Then it can go through various analytics environments and processes—such as sandboxes and artificial intelligence (AI) or machine learning (ML) frameworks—and eventually emerge from the data lake for downstream consumption.
The data warehouse portion of the architecture that Menon described is outside of the data lake. It can meet any structured processing needs that data teams may have, accounting for more demanding SLAs and enabling self-service for end users. Cataloging and security tools run parallel to the data lake to ensure proper data governance and protection.
Advantages of a data lakehouse
Among proponents of the data lakehouse concept, one of its major advantages is that it serves as the repository for all data, including anything that must be warehoused. Because of this, the data lakehouse architecture could potentially mitigate administrative and governance challenges that a data lake would face on its own. Also, decoupling storage and compute—another common thread among theoretical data lakehouses—permits far greater flexibility and scalability.
Enterprises' business needs are becoming more complex and more data-dependent than ever before, particularly with the continued emergence of advanced ML and disciplines like decision intelligence (DI). The data lakehouse's combination of capabilities could have immense value to organizations' most intricate data science and ML initiatives. Attributes including the architecture's enablement of complex schema enforcement and atomic, consistent, isolated, and durable (ACID) transactions—an area in which data warehouses excel—may help achieve such objectives.
Challenges of a data lakehouse approach
Adopting the data lakehouse methodology may not be simple. For one, thinking of the data lakehouse as an all-in-one perfect structure or "one platform for everything" is a mistake due to the complexity of its component parts. Also, widespread implementation of data lakehouses will require enterprises to migrate their existing data warehouses to their data lakes, based only on a promise of success with no proven business value. This could end up being costly, time-consuming, and potentially risky if there are any latency issues or outages during the migration process.
Some of the vendors offering solutions explicitly or tacitly as data lakehouses require business users to adopt very specific tools. These may not necessarily be compatible with other tools connected to the data lake at the architecture's core, causing further difficulties. Also, delivering 24/7 analytics with business-critical workloads is a challenge that requires an infrastructure designed for cost-effective scalability.
Maximizing data storage and analytics within the cloud
There is already a clear and established trend of using cloud-based object storage tools to create data lakes. As such, it's not surprising that some cloud service providers (CSPs) are taking the approach of adding data warehousing functions to such an architecture, though solutions of this kind won't be ideal for every organization. Nevertheless, various cloud vendors have begun selling solutions that are either explicitly called data lakehouses or function in a manner similar to proposed data lakehouse models.
Meanwhile, data lakes—the foundation of data lakehouse architecture—remain a huge part of the overall data management market: According to analysis from Mordor Intelligence, data lakes had a $3.74 billion market value in 2020 that is projected to reach $17.6 billion by 2026, with a 29.9% compound annual growth rate between 2021 and 2026. Enterprise customers certainly believe in data lakes' importance. As data teams and IT leaders become more aware of the benefits they can potentially realize by adding structured processing capabilities to their data lakes in the near future, data lakehouses could quite easily become more popular.
Operating Teradata Vantage, the connected multi-cloud data platform for enterprise analytics, atop a cloud data lake can give you any essential warehouse functionality you need, while allowing you to still enjoy the data lake's large-scale, low-cost storage of data from many sources. Dashboards, reporting, streaming analytics, and advanced data science operations can help bring structure and clarity to your organization's critical data, without dealing with any of a traditional data warehouse's limitations.
If you use an on-premises data lake in conjunction with Hadoop, Vantage can help you leave behind the complexities and resourcing burdens imposed by the open-source platform. Migrating your data lake to a multi-cloud environment—or a hybrid multi-cloud architecture that preserves some on-premises data center resources—with Vantage by your side can allow you to leverage your data in ways you never thought possible.
Learn more about Data Lakehouses