What is a data mart?
The data mart is a subject-oriented slice of the data warehouse or database logical model serving a narrow group of users. It's created so that a particular business unit, such as sales, marketing, or customer service, can quickly access crucial data for faster analysis.
Many data marts contain only a subset of data from the full tables in the data warehouse: For example, while a data mart for a specific department may have information from multiple sources—like sales transactions and inventory records—there won't be many sources. Typical data marts range from 5 to 20 tables, as opposed to the 4,000 or more tables common in an enterprise data warehouse.
A small number of tables doesn't necessarily mean a small amount of data. A single data mart table could hold hundreds of terabytes. But that vast sum is just one type of data, like call logs at a wireless telecom company. Additionally, data marts only capture summaries of data as result tables, potentially leaving behind worthwhile details.
Data mart vs. data warehouse vs. data lake
Data marts have become an important part of many enterprises' overall data management strategies. Because of this, it's worth taking the time to explain exactly how they differ not only from a central data warehouse, but also from a data lake—and why those differences are important.
Data mart
Based on the example of a repository containing many terabytes of phone call logs, a good way to characterize the data mart would be "deep, but never wide." Also, they aren't necessarily long term: Some data marts are completely reloaded weekly or monthly, as it's easy to delete and refresh the data to only look at a month's worth of transactions.
Data warehouse
Many data warehouses contain multiple data marts, as they're meant to manage massive amounts of enterprise data. The clearest distinction between data marts and data warehouses comes down to subject areas and integration: Marts are dedicated to one—or a select few—of the former and don't depend on the latter. Data warehousing, meanwhile, relies on integration to function as a central informational hub.
Data lake
If a data mart can be deep but not wide while a data warehouse is always wide in scope but is limited to structured or semi-structured data, a data lake is—at least theoretically—limitless in its dimensions. The data lake is designed to store raw data from many disparate sources in its original formats. This allows it to handle unstructured data and structured data, unlike a data warehouse or relational database. Data lakes have become just as important to cloud computing, storage, and management as the cloud data warehouse.
Data mart structure: Tables and schema
Tables are the primary components of a data mart, and they combine to form schema, which come in two main categories:
Star schema
Imagine rows and columns of data in five spreadsheets. Four of the spreadsheets are connected via key fields that match the largest sheet, called the fact table. If that fact table contains 50 million records, it likely won't fit in any conventional spreadsheet, so it must be spread into multiple data mart tables. When arranged in 5 to 10 tables, the resulting design pattern is called a star schema. The fact table is the center, and the others, known as dimension tables, make up the points of the star.
Snowflake schema
To handle large data sets that require multiple fact tables, it's still possible to do so with a data mart, but the structure must be different: Hence snowflake schema, so named for the shape taken by a diagram of the tables and relationships within this data mart structure. Each fact table in the mart has four or five dimension tables, and their combination forms the snowflake.
Data mart types: Advantages and disadvantages
Today's enterprise can use any of the following data mart types, and sometimes do so concurrently.
Dependent data mart
Dependent data marts rely on information sourced frzom data warehouses to organize the data into particular subsets, as discussed earlier. Storage, engineering, and other key operations are handled by those running the warehouse.
Dependent data marts are especially useful for business units whose team members want to harness the power of analytics without having to be experts in data science or data warehouse architecture. They can focus on specific queries and analysis most relevant to their departments. That said, because these data marts are bound to a warehouse, any problems with the warehouse become problems with the marts.
Independent data mart
An independent data mart doesn't need a data warehouse to exist. In fact, it effectively functions as its own warehouse: The independent data mart collects external and internal data from various sources and aggregates it into a miniature data warehouse.
This type of mart is most useful for short-term projects, small and medium enterprises that want to quickly create a data warehouse, and niche business units that want a highly customized data repository. The main drawback to an independent data mart is that it must put the data through extract, transform, and load (ETL) and cleansing processes, both of which require significant data engineering knowledge.
Hybrid data mart
With a hybrid data mart, analysts and other users source data from a data warehouse, but it's not the only data source it can access: The hybrid mart also collects data from independent databases unconnected to the enterprise data warehouse, cloud applications, and many other points of origin.
Hybrid data marts are ideal for organizations with multiple databases or even multiple warehouses. But like its independent counterpart, it demands near-expert technical aptitude to run most effectively.
Challenges of data marts
Data marts are another tool in the data management toolbox and should be used when they can provide a specific benefit. For example, their structure and relative simplicity make them ideal for ad-hoc reporting in a hurry, or for maintaining reporting operations when resources are tight.
However, data mart limitations make them impractical for every business unit. Redundancies and duplicate data are common when overusing data marts, potentially leading to data drift that impugns data quality and can endanger key reporting processes. Additionally, data marts invariably create data silos. While the occasional data silo isn't disastrous—and may be necessary for business units with extraordinary security needs—many silos create a disjointed organization. Finally, data marts can't handle the complex queries that warehouses can.
Data teams and their leaders should use a balanced approach that leverages data marts, warehouses, and lakes in their appropriate contexts. Teradata VantageCloud, the complete cloud analytics and data platform, provides an ideal database management system (DBMS) for this multifaceted, disciplined data management approach.