Predictive Models start by Gathering Data

In preparation for the PASS Summit, I’m putting together a few posts that describe the process I went through to put together my demo for the session See the Future with Visual Predictive Analytics.  I use SQL Server Analysis Services and Report Services in the presentation, and the first 3 posts will show how I created the predictive models using the Data Mining algorithms in Analysis Services.

We’re going to start by gathering the data we want to use and making it useful to us by structuring it in a way that works for us.  We could use what is in our OLTP databases or our data warehouse.  But why don’t we?  Because we want it to be structured something like this:

The key point of this post is to make the data look how you need it to look.  This is a mantra I repeat often about any work you do in Business Intelligence development.  The data model of an OLTP database does not serve BI well.  That’s why we create dimensional models, data vaults, and other data structures that make it easier for us to produce BI reports and analysis.  See these posts for more info on this concept: Autonomous Design and Transformation.

Get your Data Together and Transform It

The columns on the left in the table are what we are going to use to predict the column on the right titled ‘Predictive Value’.  In its most simple form this data can all come from the same table and it might already take this form, but that’s unlikely, so we have to prepare it.

Let’s change the table above to have specific column names.  We want to predict the total purchases a customer will make with us based on demographics we have acquired about our customers:

This is an important step.  Now we have a ‘view’ of the data that lays it out the way we want.  Where does this data comes from?  Hypothetically, Gender, Marital Status, Education Level and Total Children came from a customer survey we took so they are in our survey DB.  Yearly Income and Home Owner are in our loan application DB.  Total Annual Purchases can be found in our data warehouse where we have a star schema with a sales fact.

Adventure Works Sales Data Mart

The AdventureWorksDW2008R2 data warehouse has a sales data mart we can use to get all of this data.  Originally, it may have come to the data warehouse from different source systems, but now it is all in one place.  I created a Named Query in the data source view in my project called ‘Customer Purchasing Power’ with this underlying query:

SELECT     dc.CustomerKey, dc.Gender, dc.MaritalStatus, dc.YearlyIncome, dc.EnglishEducation, dc.HouseOwnerFlag, dc.NumberCarsOwned, DATEDIFF([year], dc.BirthDate, ‘2005-12-31’) AS Age, dg.PostalCode, dg.StateProvinceName, SUM(fs.SalesAmount) AS PurchaseTotal
FROM      FactInternetSales AS fs
INNER JOIN DimCustomer AS dc ON fs.CustomerKey = dc.CustomerKey
INNER JOIN DimGeography AS dg ON dg.GeographyKey = dc.GeographyKey
GROUP BY dc.CustomerKey, dc.Gender, dc.MaritalStatus, dc.YearlyIncome, dc.EnglishEducation, dc.HouseOwnerFlag, dc.NumberCarsOwned, dc.BirthDate, dg.PostalCode,  dg.StateProvinceName

I calculated Age as of 12/31/2005 because the Adventure Works data is not current and this seemed to be reasonable.  The result is that we get a view of the data that looks like this and it give us a good way to see what variables we are going to use to make the prediction lined up with the value we want to predict:

We’ve completed the first phase of data mining development by putting the data together that we need.  In my demo project I set this up so in the data source view and then I’m able to use it as the source for a Mining Structure.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: