Monthly Archives: September, 2011

Predictive Models start by Gathering Data

In preparation for the PASS Summit, I’m putting together a few posts that describe the process I went through to put together my demo for the session See the Future with Visual Predictive Analytics.  I use SQL Server Analysis Services and Report Services in the presentation, and the first 3 posts will show how I created the predictive models using the Data Mining algorithms in Analysis Services.

We’re going to start by gathering the data we want to use and making it useful to us by structuring it in a way that works for us.  We could use what is in our OLTP databases or our data warehouse.  But why don’t we?  Because we want it to be structured something like this:

The key point of this post is to make the data look how you need it to look.  This is a mantra I repeat often about any work you do in Business Intelligence development.  The data model of an OLTP database does not serve BI well.  That’s why we create dimensional models, data vaults, and other data structures that make it easier for us to produce BI reports and analysis.  See these posts for more info on this concept: Autonomous Design and Transformation.

Get your Data Together and Transform It

The columns on the left in the table are what we are going to use to predict the column on the right titled ‘Predictive Value’.  In its most simple form this data can all come from the same table and it might already take this form, but that’s unlikely, so we have to prepare it.

Let’s change the table above to have specific column names.  We want to predict the total purchases a customer will make with us based on demographics we have acquired about our customers:

This is an important step.  Now we have a ‘view’ of the data that lays it out the way we want.  Where does this data comes from?  Hypothetically, Gender, Marital Status, Education Level and Total Children came from a customer survey we took so they are in our survey DB.  Yearly Income and Home Owner are in our loan application DB.  Total Annual Purchases can be found in our data warehouse where we have a star schema with a sales fact.

Adventure Works Sales Data Mart

The AdventureWorksDW2008R2 data warehouse has a sales data mart we can use to get all of this data.  Originally, it may have come to the data warehouse from different source systems, but now it is all in one place.  I created a Named Query in the data source view in my project called ‘Customer Purchasing Power’ with this underlying query:

SELECT     dc.CustomerKey, dc.Gender, dc.MaritalStatus, dc.YearlyIncome, dc.EnglishEducation, dc.HouseOwnerFlag, dc.NumberCarsOwned, DATEDIFF([year], dc.BirthDate, ‘2005-12-31’) AS Age, dg.PostalCode, dg.StateProvinceName, SUM(fs.SalesAmount) AS PurchaseTotal
FROM      FactInternetSales AS fs
INNER JOIN DimCustomer AS dc ON fs.CustomerKey = dc.CustomerKey
INNER JOIN DimGeography AS dg ON dg.GeographyKey = dc.GeographyKey
GROUP BY dc.CustomerKey, dc.Gender, dc.MaritalStatus, dc.YearlyIncome, dc.EnglishEducation, dc.HouseOwnerFlag, dc.NumberCarsOwned, dc.BirthDate, dg.PostalCode,  dg.StateProvinceName

I calculated Age as of 12/31/2005 because the Adventure Works data is not current and this seemed to be reasonable.  The result is that we get a view of the data that looks like this and it give us a good way to see what variables we are going to use to make the prediction lined up with the value we want to predict:


We’ve completed the first phase of data mining development by putting the data together that we need.  In my demo project I set this up so in the data source view and then I’m able to use it as the source for a Mining Structure.

Advertisements

Denver SQL Server Users Group and the Gartner Hype Cycle

I had a good time presenting my Visual Predictive Analytics session last night at the Denver SQL Server Users Group.  I’ve been a regular member of the group for abut 6 years and this is at least the 4th time of I’ve done something for the group.  There were around 50 people in attendance with really good interaction, several job openings announced, a great sponsor in Confio, and food from Maggiano’s.  The group is really well run by Marc Beacom and the Board, and I’ve gotten to know so many members of the group that they’ve become good friends.  I even had some clients in the audience who attended just to see me speak!

During the session, I show a graph called The Gartner  Hype Cycle for Emerging Trends to point out where Predictive Analytics is on the curve.  I had a little blip during the session when the graph came up on my slide deck.  I couldn’t remember the name of the graph and I didn’t have the title on my slide deck (I was reminded of it’s name by a member of the audience).  I also noticed that my version of the graph was one year old and it dawned on me that Gartner has published a newer graph this year.  I made sure I replaced it in my slide deck with the new graph published in July 2011:

It’s interesting that Predictive Analytics hasn’t changed much from one year to the next.  It’s still in pretty much the same spot having moved slightly forward within the slope of enlightenment.  It’s also interesting to see where other areas of Business Intelligence are.  Big Data has been getting a lot of attention lately and is on the way up the curve, and Social Analytics seems ready to peak and then fall.

Gartner does a good job putting this measurement of hype together and I recommend anyone working with emerging technology pay attention to where it is in the cycle.

From SQL Saturday 64 to 94

I worked on my presentation and finalized my travel plans for SQL Saturday #94 in Salt Lake City this weekend.  I’m looking forward to heading a short distance west to our neighboring state of Utah and spending time with other SQL Server professionals, as well as presenting my session See the Future with Visual Predictive Analytics.  This session will show people how to use the newer features of Reporting Services to display the results of predictive models built using the data mining algorithms of Analysis Services.  It’s a shorter and simpler version of the same presentation I will do at the PASS Summit in October.

As I write this, it feels like I just I just came back from SQL Saturday #64 in Baton Rouge that was held on August 6.  I didn’t even write a recap of the event like I normally do because the last several weeks have been so busy.  It was an excellent event that was held on the Louisiana State University campus.  Speaking there made me feel like a real professor!  I presented two sessions, Business Intelligence for Managers/Decision Makers and Tuning a Star Schema for High Performance.  The former session drew a large group but had little interaction, while the latter session had a small group but a lot of interaction during the session.  I want to thank all of the organizers and volunteers who treated us so well with their famous southern hospitality.

I’m looking forward another SQL Saturday as this has become a regular event for me.  I’ve been to a dozen or so and I’ve had a great experience at each one.  I believe that because they are held on Saturday (a day off work)  in a cozy venue is one reason because people are relaxed, but it also seems to me that these events attract more people that are either new to SQL Server or beginner level.  Meeting and working with people like this who take a weekend day to work on their technical skills and knowledge is very satisfying and enjoyable.