Category Archives: data warehouse

Query a Star Schema for Trends over Time

A data mart in the form of star schema provides us with a powerful but easy way to develop SQL for metrics and reports.  The typical query will get data from a fact table, join to one or more dimensions, and perform some grouping and aggregations.  Very quickly the foundation for a new report can be developed.

With a good data model we can develop a fair number of reports with queries like this, but there are more complex queries that we need to write.  SCD type 2 dimensions ensure that fact data is related to dimension data in the state the dimension was in when the fact was created.  However, to report accurate sales results requires us to stray from the typical star schema query.

A requirement comes to you asking you to compare 1st quarter sales with 2nd quarter for each business unit in your organization, but to only include business units that have greater than $100K in sales year to date.

The first thing we do is get a list of business units with YTD sales greater than $100K.

Select  bud.bus_unit_id from Sales_Fact sf
Join Business_Unit_Dim bud on bud. business_unit_dim_id = sf. business_unit_dim_id
Join Date_Dim dd on dd.date_dim_id = sf.date_dim_id
Where dd.Year_YYYY = 2012
Group by bud.bus_unit_id
Having sum(sales_amt) > 100000

The Business_Unit_Dim table has SCD type 2 changes but the query doesn’t care.  It will return the right results even if a particular business unit has been changed 100 times this year.  The key is getting the right list of business unit ids and then using them as a subquery in the outer query.

The outer query looks like this:

Select bud.business_unit_name, dd.quarter_name, sum(sales_amt) from Sales_Fact sf
Join Business_Unit_Dim bud on bud. business_unit_dim_id = sf. business_unit_dim_id
Join Date_Dim dd on dd.date_dim_id = sf.date_dim_id
Where dd.Year = 2012 and dd.Quarter in (1,2) and
Bud.business_unit_id in
(Select  bud.bus_unit_id from Sales_Fact sf
Join Business_Unit_Dim bud on bud. business_unit_dim_id = sf. business_unit_dim_id
Join Date_Dim dd on dd.date_dim_id = sf.date_dim_id
Where dd.Year_YYYY = 2012
Group by bud.bus_unit_id
Having sum(sales_amt) > 100000)
Group by bud.business_unit_name, dd.quarter_name
Order by bud.business_unit_name, dd.quarter_name

This example shows us how to trend data over time using a star schema and that it’s the date dimension that give us the ability to do that.  It doesn’t depend on SCD changes and it shows we often need to query a star schema to get the right subset of data we want first, and then we query the star again to get the ‘report’ data.  It also shows us that a healthy library of queries that we can use as subqueries can greatly improve our query development process.

Advertisements

Good Business Intelligence depends on a good Data Warehouse

Good BI, Good DW

There are certain components that are essential for a successful Business Intelligence implementation and in this post I want to emphasize the importance of the data warehouse.  It’s obviously a foundational component and doing it right requires choosing the right DW model and implementing a quality design.

My Baseball Woes

I play on a baseball team that hasn’t been very successful this season.  Last week we were winning early in the game and had a very good chance, but starting in the 4th inning our pitchers lost control and walked batter after batter, totaling 12 walks in a span of three innings.  We could live with some of our other failings, but one thing we can’t live with and be successful is to walk the opposing team’s batters.  The team’s fortunes are sometimes documented by our stalwart third baseman and well-known database author Steve Jones.

The BI ‘Team’ of Technical Components

A Business Intelligence project is also a team effort and while it’s important to put together the human part of the team properly, the numerous technical components that make up BI must work as a team also, and this is what I want to focus on in this post.  While all of the components are important, we can live with weakness in some areas, just like my baseball team.  There are some failings we can’t succeed with, and one of these is the data warehouse.  I’ve never seen a successful BI implementation without a really good data warehouse.

The success of a Business Intelligence project often depends on starting with a clear vision as to what you need in a data warehouse.  The DW is rightly known as the foundation of BI but it can still be a vague concept to digest.  Do you create one big dimensional model or a lot of small data marts?  Should you implement a data vault model or a corporate information factory?  There’s no one size fits all answer and each implementation should be evaluated individually.

It’s important that the direction you choose be based on the BI requirements, which hopefully you have completed in a comprehensive manner.  Make sure it’s an autonomous data model, meaning that it should be independent and live on its own, and not simply be a mirror of source application databases.

How to Choose the DW Model?

Take a look at these links to get an idea of some of the different types of data warehouse models that are generally accepted as good practice.  Choosing whether to go with a Dimensional Model, Corporate Information FactoryData Vault, OLAP Cubes, or some combination or variation of these methodologies, is a good place to start.  There are various schools of thought on how to use and implement these models and their strengths and weaknesses, so get comfortable with them before moving forward.

After you learn about each models strengths and weaknesses, here’s a list of questions to ask yourself when deciding what direction to take.

What are your primary needs?  This is a short list to start with to determine where to head next:

  • Keep a record of every transaction.
  • Centralized Reporting.
  • Distribute data for analytics to specialized teams.
  • Operational reporting.
  • Create cubes for ad-hoc data analysis.
  • A full Audit Trail of data modifications.
  • Ad-hoc SQL queries by power users.

Sample Models

There are as many solutions as there are implementations.  These are a few different scenarios that I have been a part of over the last few years in my consulting work:

Company A doesn’t have large data and needs quick reporting and analysis.  They were able to build a few data marts using a dimensional model.  Some data staging was necessary but in many cases the DW can be populated directly from its source(s).

Company B is a manufacturer that requires a record of every process, tool and measurement in their factories, while the engineers want to rapidly analyze the data to improve the quality of the product they’re building.  To record the data this company implemented a Data Vault warehouse model and from there populate a Dimensional Model with a number of targeted data marts.  The data marts are used for reports but are also the source for a number of engineering and manufacturing OLAP cubes that allow engineers to perform quick data analysis in Excel.

Company C wants to create a set of consistent reports that can be deployed to anyone within the company that accurately reflect what the company is doing.  To produce the reports correctly, the data is gathered from multiple databases within the company and stored in a staging area.  A dimensional model is then populated from which all reports are produced, using a single controlled data source for all reports, providing the consistency the organization requires.

Informed Choices, Successful BI

Make an informed choice you can grow with and don’t make the data warehouse the sink hole of an otherwise solid Business Intelligence implementation.  Throw more strikes than balls and you’ll have a data warehouse that will let your team develop great BI; don’t be the pitcher who loses by walking the opposing batters.

Core Principles of Data Warehouse Design #5 – 8

Last year I started a sequence of blog posts to discuss what I believe are the core principles you should adhere to when developing a data warehouse.  I meant to post them over a period of several weeks, but I only made it through the first half.  In order to complete what I start, this post will finish the sequence.  I’ll cram the last 4 posts into one, which I probably should have done to begin with.  These were the posts that comprise the first half of this sequence, followed by the remaining four principles:

#5 Slowly Changing Dimensions

A good data warehouse has good history, and this can’t be done in a dimensional model without slowly changing dimensions (SCD).  The key to appropriate SCD development to identify for each dimension table on a column-by-column basis whether it is important for that column’s previous state to be maintained in the DW.  For an Employee dimension, the worker’s name might not be a something I want to maintain historically.  If the name changes, it’s still the same worker, so when I’m reporting sales figures by worker, I want all of the sales reported under the worker’s current name.  However, if the worker changes department, I would want the change to be maintained historically, so when I report sales by department, the numbers are reported correctly.

Here’s an example of a Worker dimension before any changes.  For worker Fred Smith, he changes offices from West to East.  Jane Doe’s has a name change to Jane Jones.  Since office is a SCD type 2 column the historical row is maintained, but it is end dated.  A new row is added and it is made the current row since it has current data.  Since Name is a SCD type 1 column, a new row is not added for Jane Jones, instead the existing row is updated.

Worker Dimension

Worker Dim ID Name Gender Office Start Date End Date Current
545 Fred Smith M West 6/1/08 true
711 Jane  Doe M North 7/15/09 true

After SCD Process

Worker Dim ID Name Gender Office Start Date End Date Current
545 Fred Smith M West 6/1/08 5/21/10 false
910 Fred Smith M East 5/22/10 true
711 Jane  Jones F North 7/15/09 true

#6 New Data Structures

This principle says that new database tables should be created that don’t exist in any source system.  It’s a reminder to not get locked in to how source databases are structured, and to invent new entities that supoort the requirements of the data warehouse.  An example of this is a data warehouse I designed for a foster care agency.  The OLTP database tracked abuse incidents, but the agency needed to report on repeat abuse incidents within a certain time frame and in particular situations.  This caused me to create a fact table that specifically tracked Repeat Abuse.  This way the agency could report on this metric easily, without concern for applying the right business rules or time frames.

#7 Ease of Query

The data warehouse should be easy to query; no long SQL, complex joins, nested subqueries, etc.  The queries should perform well and they should be easy and obvious.  There’s a standard query that should deliver most metrics from the dimensional model and it looks like this:

SELECT col1, col2, SUM(col3) FROM Fact Table
 JOIN Date Dimension
 JOIN Dimension2
 JOIN Dimension3
 WHERE col3 = ? and col4 = ?
 GROUP BY col1, col2
 

#8 Completeness

This is simple.  How often does the data warehouse satisfy a user’s request for a metric, report, or analysis?  How often do they have to go to sources other than the DW to get the answer?  This is a measure of Completeness.  A high percentage demonstrates that the DW contains most or all of the data it requires.  A low number indicates that users still don’t use the DW as much as they should.  Completeness is difficult to achieve in Phase 1 of a development effort.  A DW that can provide 65-70% of the data requests in Phase 1 is a good accomplishment.

Conclusion

If you are embarking on your first Data Warehouse design effort, keep these 8 principles in mind.  Achieve them all and you will produce an excellent DW that will be valuable right away on initial deployment.