I’ll be presenting The Data Mining Lifecycle for the PASS Data Warehousing/Business Intelligence Virtual Chapter on June 2 at 12pm MST. This is the first time I’ll speak in front of the Virtual Chapter and I’m really looking forward to it. You might have seen me deliver this session at SQL Rally or at a SQL Saturday. It’s pretty much the same session even though I change it slightly almost every time I do it.
If you haven’t seen my presentation before, it’s a good chance to see data mining presented in a way that will allow you to take it with you and start using it. I show you how to go through the process of developing a model that you can use for forecasts and predictions, but I also emphasize the ways to get the most bang from the model by using it in your .NET applications, ETL and reports. It’s a great way to take your Business Intelligence work to the next level by adding predictive analytics to your solutions.
I look forward to seeing you there and to have a good discussion about Data Mining with Analysis Services!
Last year I started a sequence of blog posts to discuss what I believe are the core principles you should adhere to when developing a data warehouse. I meant to post them over a period of several weeks, but I only made it through the first half. In order to complete what I start, this post will finish the sequence. I’ll cram the last 4 posts into one, which I probably should have done to begin with. These were the posts that comprise the first half of this sequence, followed by the remaining four principles:
- Data Warehouse Design Fundamentals
- #1 Autonomous Design
- #2 Dimensional Model
- #3 Transformation
- #4 Redundancy
#5 Slowly Changing Dimensions
A good data warehouse has good history, and this can’t be done in a dimensional model without slowly changing dimensions (SCD). The key to appropriate SCD development to identify for each dimension table on a column-by-column basis whether it is important for that column’s previous state to be maintained in the DW. For an Employee dimension, the worker’s name might not be a something I want to maintain historically. If the name changes, it’s still the same worker, so when I’m reporting sales figures by worker, I want all of the sales reported under the worker’s current name. However, if the worker changes department, I would want the change to be maintained historically, so when I report sales by department, the numbers are reported correctly.
Here’s an example of a Worker dimension before any changes. For worker Fred Smith, he changes offices from West to East. Jane Doe’s has a name change to Jane Jones. Since office is a SCD type 2 column the historical row is maintained, but it is end dated. A new row is added and it is made the current row since it has current data. Since Name is a SCD type 1 column, a new row is not added for Jane Jones, instead the existing row is updated.
|Worker Dim ID||Name||Gender||Office||Start Date||End Date||Current|
After SCD Process
|Worker Dim ID||Name||Gender||Office||Start Date||End Date||Current|
#6 New Data Structures
This principle says that new database tables should be created that don’t exist in any source system. It’s a reminder to not get locked in to how source databases are structured, and to invent new entities that supoort the requirements of the data warehouse. An example of this is a data warehouse I designed for a foster care agency. The OLTP database tracked abuse incidents, but the agency needed to report on repeat abuse incidents within a certain time frame and in particular situations. This caused me to create a fact table that specifically tracked Repeat Abuse. This way the agency could report on this metric easily, without concern for applying the right business rules or time frames.
#7 Ease of Query
The data warehouse should be easy to query; no long SQL, complex joins, nested subqueries, etc. The queries should perform well and they should be easy and obvious. There’s a standard query that should deliver most metrics from the dimensional model and it looks like this:SELECT col1, col2, SUM(col3) FROM Fact Table JOIN Date Dimension JOIN Dimension2 JOIN Dimension3 WHERE col3 = ? and col4 = ? GROUP BY col1, col2
This is simple. How often does the data warehouse satisfy a user’s request for a metric, report, or analysis? How often do they have to go to sources other than the DW to get the answer? This is a measure of Completeness. A high percentage demonstrates that the DW contains most or all of the data it requires. A low number indicates that users still don’t use the DW as much as they should. Completeness is difficult to achieve in Phase 1 of a development effort. A DW that can provide 65-70% of the data requests in Phase 1 is a good accomplishment.
If you are embarking on your first Data Warehouse design effort, keep these 8 principles in mind. Achieve them all and you will produce an excellent DW that will be valuable right away on initial deployment.
There was a lot to like about the first (and best) SQL Rally that concluded last Friday on May 13th. It was held in a first class venue at the Marriott World Center in Orlando, FL that made it feel important and professional. Even though the Marriott is a large complex, the sessions, registration desk, vendor booths, and lunch were all right next to each other, making everything easy to find and allowing as much free and easy contact between attendees, organizers, and speakers. A diverse set of of sessions were delivered spanning the SQL Server product line and every session I attended had excellent audience participation with a lot of good questions and comments. It seems that because of the proximity of the sessions, I met more people than at any other conference I’ve attended. This was great because meeting people is one of the real benefits of being there. I also got to hang out with familiar faces from back home in Colorado including Chris Shaw and Mike Fal.
The success of a conference starts with the organizers and it was no different with SQL Rally. Jack Corbett, Kendal Van Dyke and Andy Warren and the volunteers are to be congratulated for making it happen. Registration was easy, locations were well-marked, and every session I attended was delivered by a true professional. I also went to the Women in Technology (WIT) luncheon and participated in the Speed Networking sessions. The WIT luncheon discussed the role of mentors in the high tech world and stressed the importance of finding a good mentor, especially for women. The mentor doesn’t have to be a woman, and men who have good mentoring skills should be open to mentoring women also. The Speed Networking session is a great tool for us ‘introverts’ and gives us a chance to engage in conversation with people we might not normally talk to.
On Friday morning I presented ‘The Data Mining Lifecycle’ and like the sessions I attended, had really good participation by the attendees. I think I got more questions in this session than any other time I’ve spoken on the subject, and the follow up after the session kept me talking in the hall for another hour, which I enjoyed very much. Then I went to lunch and continued talking about data mining, so I really got my money’s worth.
Two Successful Launches
I stayed in Orlando over the weekend and visited the Kennedy Space Center on Sunday. On Monday morning I looked out of my hotel room at 8:56 and saw the Space Shuttle Endeavour take off for the last time. I was luckier than most of the people who attended SQL Rally, since I got to see two successful launches in one week!