Companies and consumers are generating more data than ever. The proliferation of digital devices and products is driving exponential expansion of the digital universe. While notionally an asset, the scale of this data presents a challenge: how can companies practically organize their information to reveal actionable insights?
While data mining and business intelligence provide the valuable extraction and presentation of such insights, the data warehouse (DWH) is the preparatory aggregation and reorganization of the vast underlying data, which often resides in multiple locations. Understanding the role of the DWH within the broader ecosystem of data science, data mining and business intelligence is essential for the modern manager.
What is a data warehouse?
The DWH is a centralized repository of digital information, aggregated from a variety of disparate sources, and organized in a structure optimized for reporting. Most importantly, the DWH provides actionable information to the entire enterprise, enabling employees to perform tailored analysis and to make better decisions.
Essential data warehouse concepts
The relational versus dimensional model
To appreciate the functionality of a data warehouse, it is important to understand the difference between a relational and dimensional model. While technical sounding, they are easy to distinguish.
From a practical use perspective, relational and dimensional databases differ in one critical criteria: information flow. While relational databases are optimized for data input, dimensional databases are built for output, notably in the form of reporting and analytics known as business intelligence.
The relational model organizes information around a single point of information, for example a customer name. In such a model, the customer name exists in one location, with all associated information - such as contact details and transaction dates - listed in associated or related tables.
By contrast, the dimensional database essentially “unpacks” the relational database, allowing users to easily “slice and dice” data in the requisite permutation needed to fulfill their reporting needs. For example, in the relational database entry above, customer contact details would be broken out into discrete fields, such as phone number, street address, city, state and zip code.
The dimensional database essentially “unpacks” the relational database, allowing users to easily “slice and dice” data
The distinction between the relational and dimensional database may seem abstract. However, for those charged with delivering increasingly complex analytics and reporting, appreciating the distinction provides a valuable foundational understanding for working with technical teams who maintain these resources.
The Data Warehouse - “It’s On”
As detailed by Bill Inmon, one of the creators of the data warehouse, a few specific characteristics govern the design of the data warehouse. According to Inmon, the data warehouse is a subject-oriented, nonvolatile, integrated, time variant collection of data in support of management’s decisions.
That’s a mouthful, but once broken down into parts, this definition paints a clear picture of the DWH basic structure. To make these criteria easy to remember, we reorganized Inmon’s criteria according to the anagram “It’s On”
Integrated: Data must have consistent formats. Often drawn from different sources, data fields must have consistent naming conventions.
Time variant: The DWH reveals trends, which depend on change over time. Recording data points over time is fundamental to revealing relationships between data.
Subject-Oriented: The DWH enables subject-focused analysis and reporting. For example, a company may want to evaluate the sales of a product over time, and then drill down into regional or customer-segment specific trends.
Nonvolatile: Once data enters the warehouse, it does not change.
The Data Warehouse differs from the transactional database
The DWH and transactional database systems perform fundamentally different functions, and serve different users. While the DWH is optimized for reporting and analysis, transaction systems - often referred to as OnLine Transaction Processing (OLTP) - are optimized for availability and processing speed.
OLTP users are usually front-end employees, and they typically access several records at a time. DWH users are often analysts and managers, whose reports may simultaneously call up to several million records.
The transaction system and DWH also differ in data granularity and permanence. In the OLTP, data contains current values, which are detailed and highly variable (every few seconds, thousands of transactions change the values of these records). By contrast, the DWH contains restructured data that cannot be changed once loaded.
The consumer loan process succinctly illustrates key differences between these systems. When a customer secures a car loan, for example, the transactional database captures details such as car type, color, year of purchase, purchase price, and buyer personal details. Once converted into the DWH model, transactional information (around the single customer transaction) is disaggregated into component parts. These parts, in turn, are pooled with comparable parts from other transactions.
Querying the DWH, an employee at the lender might access reports comprised of aggregated customer data. For example, trying to optimize advertising spend, a marketing manager might seek cars of a given type or price range with the highest loan approval rate, or the average age and income level of loan applicants over time. Such information might direct re-direct advertising spend to more relevant channels with more targeted messaging.
The Data Warehouse vs. the Data Mart and Data Lake
The DWH may be accompanied by related databases - the data mart and data lake - whose descriptive names suggest distinct functions. A subset of the DWH, the data mart serves a specific group of users, for example a division or specific business unit. While the DWH holds multiple subjects relevant to multiple departments - such as sales, customers, products, inventory, suppliers - a data mart typically holds one subject area for one department, such as sales or finance.
There are two types of data marts - dependent and independent - and each presents unique benefits. The dependent data mart draws from the DWH, and has the advantage of consistency. Because all data is centralized and consistent within the DWH, the resulting data marts are also consistent. While more robust, dependent data marts require a DWH, and are therefore more costly to develop.
Independent data marts, on the other hand, draw data directly from the same source databases, much like a mini DWH. Although faster and less costly to develop, independent data marts carry increased risk, as data definitions can become inconsistent across independently developed data marts. However, if developed with discipline, independent data marts can ultimately be assembled into a DWH.
Data lakes are usually configured on a cluster of inexpensive and scalable commodity hardware. This allows data to be dumped in the lake without having to worry about storage capacity. While the DWH is typically limited to text and numerical data, the lake can also hold a wider variety, including social media, sensor data, and images.
Data Warehouse and Data Mining
The DWH enables data mining, which equips companies with the power to predict the future. The primary objective of data mining is to reveal patterns in large datasets. Such patterns, in turn, reveal relationships between different categories of data and their underlying business functions.
Such relationships provide managers actionable information, essentially new levers to pull to drive desired business outcomes such as customer growth, or increased sales per customer. For example, reviewing historical sales data by geographic or industry segment may highlight anomalous growth, the source of which may provide sales managers with learnings to apply to other segments.