My first Strata Conference came to an end yesterday evening but I’ve already decided it won’t be my last. I found the conference a good place to learn about a large variety of technologies and methodologies, and it was also an excellent networking opportunity. As a whole it was a great collection of people on the leading edge of data technology.
The conference kept me busy from early morning until the evening and I’m happy to say that I didn’t skip one session. I enjoyed hearing Jeremy Howard from Kaggle talk about predictive modeling two times as well as chatting with him afterwards in the bar. I also heard details of the next release of Hadoop from Arun Murthy of Hortonworks, saw a good session on Automated Understanding by Tim Estes of Digital Reasoning, and attended an interesting session called Exploring Social Data by Chris Moody of Gnip, a local company in Boulder, CO.
The best day for me was the first day since this was the tutorial/deep dive day. This is the day we saw more code and demos. I wanted to see more live demos the next two day but most of the speakers didn’t show any, and that’s one thing I’d like to see Strata have more of. The regular sessions were 40 minutes long which make doing a demo difficult, but it’s often helpful to see things in action to go along with the speaker’s words and slide deck.
I also had the opportunity to have good conversations over lunch or coffee with people from Microsoft, Red Gate, the Census Bureau, Kaggle, Shell, and many more. What I found out was that there are so many organizations out there setting the bar higher and doing great things in the realm of data, and it’s going to change our world.
Good BI, Good DW
There are certain components that are essential for a successful Business Intelligence implementation and in this post I want to emphasize the importance of the data warehouse. It’s obviously a foundational component and doing it right requires choosing the right DW model and implementing a quality design.
My Baseball Woes
I play on a baseball team that hasn’t been very successful this season. Last week we were winning early in the game and had a very good chance, but starting in the 4th inning our pitchers lost control and walked batter after batter, totaling 12 walks in a span of three innings. We could live with some of our other failings, but one thing we can’t live with and be successful is to walk the opposing team’s batters. The team’s fortunes are sometimes documented by our stalwart third baseman and well-known database author Steve Jones.
The BI ‘Team’ of Technical Components
A Business Intelligence project is also a team effort and while it’s important to put together the human part of the team properly, the numerous technical components that make up BI must work as a team also, and this is what I want to focus on in this post. While all of the components are important, we can live with weakness in some areas, just like my baseball team. There are some failings we can’t succeed with, and one of these is the data warehouse. I’ve never seen a successful BI implementation without a really good data warehouse.
The success of a Business Intelligence project often depends on starting with a clear vision as to what you need in a data warehouse. The DW is rightly known as the foundation of BI but it can still be a vague concept to digest. Do you create one big dimensional model or a lot of small data marts? Should you implement a data vault model or a corporate information factory? There’s no one size fits all answer and each implementation should be evaluated individually.
It’s important that the direction you choose be based on the BI requirements, which hopefully you have completed in a comprehensive manner. Make sure it’s an autonomous data model, meaning that it should be independent and live on its own, and not simply be a mirror of source application databases.
How to Choose the DW Model?
Take a look at these links to get an idea of some of the different types of data warehouse models that are generally accepted as good practice. Choosing whether to go with a Dimensional Model, Corporate Information Factory, Data Vault, OLAP Cubes, or some combination or variation of these methodologies, is a good place to start. There are various schools of thought on how to use and implement these models and their strengths and weaknesses, so get comfortable with them before moving forward.
After you learn about each models strengths and weaknesses, here’s a list of questions to ask yourself when deciding what direction to take.
What are your primary needs? This is a short list to start with to determine where to head next:
- Keep a record of every transaction.
- Centralized Reporting.
- Distribute data for analytics to specialized teams.
- Operational reporting.
- Create cubes for ad-hoc data analysis.
- A full Audit Trail of data modifications.
- Ad-hoc SQL queries by power users.
There are as many solutions as there are implementations. These are a few different scenarios that I have been a part of over the last few years in my consulting work:
Company A doesn’t have large data and needs quick reporting and analysis. They were able to build a few data marts using a dimensional model. Some data staging was necessary but in many cases the DW can be populated directly from its source(s).
Company B is a manufacturer that requires a record of every process, tool and measurement in their factories, while the engineers want to rapidly analyze the data to improve the quality of the product they’re building. To record the data this company implemented a Data Vault warehouse model and from there populate a Dimensional Model with a number of targeted data marts. The data marts are used for reports but are also the source for a number of engineering and manufacturing OLAP cubes that allow engineers to perform quick data analysis in Excel.
Company C wants to create a set of consistent reports that can be deployed to anyone within the company that accurately reflect what the company is doing. To produce the reports correctly, the data is gathered from multiple databases within the company and stored in a staging area. A dimensional model is then populated from which all reports are produced, using a single controlled data source for all reports, providing the consistency the organization requires.
Informed Choices, Successful BI
Make an informed choice you can grow with and don’t make the data warehouse the sink hole of an otherwise solid Business Intelligence implementation. Throw more strikes than balls and you’ll have a data warehouse that will let your team develop great BI; don’t be the pitcher who loses by walking the opposing batters.
Last year I started a sequence of blog posts to discuss what I believe are the core principles you should adhere to when developing a data warehouse. I meant to post them over a period of several weeks, but I only made it through the first half. In order to complete what I start, this post will finish the sequence. I’ll cram the last 4 posts into one, which I probably should have done to begin with. These were the posts that comprise the first half of this sequence, followed by the remaining four principles:
- Data Warehouse Design Fundamentals
- #1 Autonomous Design
- #2 Dimensional Model
- #3 Transformation
- #4 Redundancy
#5 Slowly Changing Dimensions
A good data warehouse has good history, and this can’t be done in a dimensional model without slowly changing dimensions (SCD). The key to appropriate SCD development to identify for each dimension table on a column-by-column basis whether it is important for that column’s previous state to be maintained in the DW. For an Employee dimension, the worker’s name might not be a something I want to maintain historically. If the name changes, it’s still the same worker, so when I’m reporting sales figures by worker, I want all of the sales reported under the worker’s current name. However, if the worker changes department, I would want the change to be maintained historically, so when I report sales by department, the numbers are reported correctly.
Here’s an example of a Worker dimension before any changes. For worker Fred Smith, he changes offices from West to East. Jane Doe’s has a name change to Jane Jones. Since office is a SCD type 2 column the historical row is maintained, but it is end dated. A new row is added and it is made the current row since it has current data. Since Name is a SCD type 1 column, a new row is not added for Jane Jones, instead the existing row is updated.
|Worker Dim ID||Name||Gender||Office||Start Date||End Date||Current|
After SCD Process
|Worker Dim ID||Name||Gender||Office||Start Date||End Date||Current|
#6 New Data Structures
This principle says that new database tables should be created that don’t exist in any source system. It’s a reminder to not get locked in to how source databases are structured, and to invent new entities that supoort the requirements of the data warehouse. An example of this is a data warehouse I designed for a foster care agency. The OLTP database tracked abuse incidents, but the agency needed to report on repeat abuse incidents within a certain time frame and in particular situations. This caused me to create a fact table that specifically tracked Repeat Abuse. This way the agency could report on this metric easily, without concern for applying the right business rules or time frames.
#7 Ease of Query
The data warehouse should be easy to query; no long SQL, complex joins, nested subqueries, etc. The queries should perform well and they should be easy and obvious. There’s a standard query that should deliver most metrics from the dimensional model and it looks like this:SELECT col1, col2, SUM(col3) FROM Fact Table JOIN Date Dimension JOIN Dimension2 JOIN Dimension3 WHERE col3 = ? and col4 = ? GROUP BY col1, col2
This is simple. How often does the data warehouse satisfy a user’s request for a metric, report, or analysis? How often do they have to go to sources other than the DW to get the answer? This is a measure of Completeness. A high percentage demonstrates that the DW contains most or all of the data it requires. A low number indicates that users still don’t use the DW as much as they should. Completeness is difficult to achieve in Phase 1 of a development effort. A DW that can provide 65-70% of the data requests in Phase 1 is a good accomplishment.
If you are embarking on your first Data Warehouse design effort, keep these 8 principles in mind. Achieve them all and you will produce an excellent DW that will be valuable right away on initial deployment.