Data Warehousing

Data Warehousing (DW) is the method for storing large amounts of data and making it accessible via convenient channels for analysis. It differs from Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) by not being 'Online'. In data warehouses, data in its complete form (i.e., along with all its properties/attributes and relationships) is stored for longer terms and its history is preserved so that analysts and scientists can observe all kinds of changes (major or minor) over time. This nature of DWs makes them the ideal backend for statistical and machine learning models.

We will not be discussing OLTP and OLAP in this post but in layman's terms, OLTP systems support CRUD (Create, Read, Update, Delete) operations of real-time data on databases that are online and directly interact with front-end systems that read and write data. OLAP systems on the other hand support online analytical processing of data stored in data warehouses. The data in a DW is modelled in OLAP systems for online analysis and reporting. OLAP systems create the end products of the business intelligence pipeline thereby influencing decision support directly.

Because of the nature of the data they process, OLTP systems are upstream and OLAP systems are downstream from DWs in the business intelligence pipeline. One advantage of data warehouses is that they support traditional business intelligence pipelines by providing data to OLAP systems as well as the modern data science and advanced analytics pipelines powered by artificial intelligence.

Approaches

The two prominent approaches to Data Warehousing are:

  1. The Normalisation approach devised by Bill Inmon. In this approach, the data is stored in the 3rd Normal Form.
  2. The Dimensional Modelling approach devised by Ralph Kimball. In this approach, the data is stored in a Star Schema, or its normalised derivative, Snowflake Schema.

There are other approaches that focus on specific aspects of data warehousing (e.g. the Data Vault is a good approach for preserving history) but the two aforementioned approaches are most commonly used and cover a wide range of use cases.

I work mainly on the Dimensional Model and will try to describe it in more detail as it lends itself easily to the needs of both OLAP systems and advanced analytics platforms. If you're interested in understanding and trying the normalisation model you can check this post or other resources on the web.

Dimensional Modelling

Star Schema

This Dimensional Modelling approach, also known as the Star Schema, consists of two types of tables/relations: facts and dimensions. Facts are usually trades, transactions or events that happen in real time and affect the Dimensions. Dimensions typically are things like securities, customers, products etc. When you draw the fact in the middle and link the dimensions to it, like a graph with nodes and edges branching out from a central node, the diagram looks like a Star.

Example (source: Wikipedia)

Consider a database of sales, perhaps from a store chain, classified by date, store and product.

Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and Dim_Product.

Each dimension table has a primary key on its Id column, relating to one of the columns (viewed as rows in the example schema) of the Fact_Sales table's three-column (compound) primary key (Date_Id, Store_Id, Product_Id). The non-primary key Units_Sold column of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Year of the Dim_Date dimension).

For example, the following query answers how many TV sets have been sold, for each brand and country, in 1997:

SELECT
P.Brand,
S.Country AS Countries,
SUM(F.Units_Sold)

FROM Fact_Sales F
INNER JOIN Dim_Date D ON (F.Date_Id = D.Id)
INNER JOIN Dim_Store S ON (F.Store_Id = S.Id)
INNER JOIN Dim_Product P ON (F.Product_Id = P.Id)

WHERE D.Year = 1997 AND P.Product_Category = 'tv'

GROUP BY
P.Brand,
S.Country

Snowflake Schema

Oft-cited complaints against Star Schema include loss of data integrity and potential for duplicates because the data is not fully normalised. For simplicity, you could imagine normalisation as granularity of data. When data is not broken down to its lowest grain based on definition, it's difficult to update or retrieve the right grain. This has led to the creation of the Snowflake Schema, the normalised version of the Star Schema. In this schema, the dimensions are broken down further and tables are created for these "sub-dimensions". In the graph, when the dimensions themselves get more branches, the Star starts looking a bit like a Snowflake.

In summary, facts are transactions that are performed on dimensions and the Snowflake Schema introduces additional dimensions that define the fact more deeply and/or are attributes and properties of the dimensions.

Example (source: Wikipedia)

The following example query is the snowflake schema equivalent of the star schema example code which returns the total number of units sold by brand and by country for 1997. The benefit of using the snowflake schema in this example is that the storage requirements are lower since the snowflake schema eliminates many duplicate values from the dimensions themselves.

SELECT
B.Brand,
G.Country,
SUM(F.Units_Sold)
FROM Fact_Sales F
INNER JOIN Dim_Date D ON F.Date_Id = D.Id
INNER JOIN Dim_Store S ON F.Store_Id = S.Id
INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id
INNER JOIN Dim_Product P ON F.Product_Id = P.Id
INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id
INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id
WHERE
D.Year = 1997 AND
C.Product_Category = 'tv'
GROUP BY
B.Brand,
G.Country

Slowly Changing Dimensions

One of main purposes of the Data Warehouse is to preserve the history of changes that happen to dimensional values. Typically, these values don't change often and are hence said to be "slowly changing" but the DW's Extract-Transform-Load (ETL) process needs to have strategies for recording the history of its dimensions. These are known as Slowly Changing Dimensions (SCD) strategies (or types) and are defined for each dimension based on businesses' reporting and analytics requirements.

The internet abounds with definitions and examples of SCD. Here's one such page with a succinct explanation.

As mentioned the above page, a high-level description of SCD Types is:

  • Type 0 - The passive method
  • Type 1 - Overwriting the old value
  • Type 2 - Creating a new additional record
  • Type 3 - Adding a new column
  • Type 4 - Using historical table
  • Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Do read the page fully for a better understanding of the various types.

Conclusion

As you might see, Dimensional Modelling is an optimal way to design a relational Data Warehouse. It has been adopted widely across industries. A valuable resource for an in-depth understanding of the idea and its nuances is Ralph Kimball's own book: The Data Warehouse Toolkit.

More such content...

Data Pipeline for Customer Success Dashboards
Nov. 17, 2020 • The Data Beetle
A retrospective
Read more »
When less is more but bigger is better
Sept. 6, 2020 • The Data Beetle
Data Sufficiency Challenges
Read more »
Of problems of data and problematic data
June 12, 2020 • The Data Beetle
Matching old solutions with new problems based on a better understanding of the requirements and data
Read more »