Big Data, BI and SQL Saturday in Silicon Valley

In preparing to speak at SQL Saturday in Silicon Valley this week, I was thinking of ways to explain map/reduce in my session.  One way is to look at the parts of map/reduce that are analogous to a SQL query, which most of the attendees will find familiar. A simple SQL statement like ones we write for business intelligence purposes often have a filter, an aggregation, and a group by for aggregations.
Here’s an example of a query that has all 3 of these components to sum sales by region and quarter for the year 2013:

select dg.region, dd.quarter, sum(f.sales_amt) as sum_sales from fact_sales f
join dim_geography dg on dg.dim_geography_id = f.dim_geography_id
join dim_date dd on dd.dim_date_id = f.dim_date_id
group by dg.region, dd.quarter
where dd.dd.year = 2013

When writing the code in a map/reduce environment, the 3 components are still there in a different way.  Any data not needed for the result is filtered out in the mapper function using code, and the the group by values are identified as the key in the mapper function.  In the reducer, the aggregation is explicitly coded to sum by the key value.  All three steps take place in a procedural way in this C# code:

public override void Map(string inputLine, MapperContext context)
{
   //Split the pipe-delimited input into an array
   var aryValues = inputLine.Split(‘|’);

   // This is the filter since the 10th position in the array is year of sale. If it isn’t 2013, discard the input.
   if (aryValues[10] == “2013”) {
      // write the region and quarter (this is the group by), and sales amount (offset 22)
      context.EmitKeyValue(aryValues[11].ToUpper() + ” ” + aryValues[12].ToUpper(), aryValues[22]);
   }
}

//Reducer public class NamespaceReducer : ReducerCombinerBase {
public override void Reduce(string key, IEnumerable<float> values, ReducerCombinerContext context) {
   // Initialize sum of sales for this key (region)
   decimal dSumSales = 0;

   // loop though each sales and sum it
   foreach (string value in values)
      dSumSales += Convert.ToDecimal(value);
   // write the sum of the sales to output
   context.EmitKeyValue(key, dSumSales.ToString());
   }
}

Since we’re working without a schema and the data is semi-structured (or unstructured), the challenge that we have is in finding the sales amount, sales quarter, and the region in each data row.  The example shown assumes the data is pipe-delimited, so that makes it easier to identify.

To see this in action, join me at 11am in Mountain View this weekend for my session Getting Started with Big Data and BI.

BI and Analytics in 2014

It’s become a tradition at the end of the year for bloggers and industry pundits to make predictions about their fields of interest, and business intelligence is not spared this annual ritual. While I’ve avoided (or forgotten) making my own predictions, it seems that the future always comes up when I’m speaking at an event. People want to know what’s coming so they can prepare.

Four Areas to Explore

With the PASS BA conference coming up in May in San Jose, the question of future direction will certainly come up again. The field of Business Intelligence is undergoing a shift in how it’s delivered to users and the way it’s developed. After a period of having established common data architectures and practices, including Star Schema models, ETL processes, and cubes for what-if analysis, we are seeing new technologies driving us in different directions. In my opinion BI Architects and Practitioners should start thinking about the following:

1 – Adding predictive analytics to BI teams to get even more value from their data.

2 – Look into data virtualization for some reporting solutions.  This can greatly reduce the time and cost of ETL development.

3 – Consider putting some data in the cloud, when it makes sense.

4 – Make Excel a centerpiece of your designs. If you’re like a lot of BI Architects then you’ve been trying to displace Excel for years, but now with so much BI functionality, it’s worth keeping.

Take the time to learn more about these topics and find out how they fit into you architecture’s future.  These topics will be covered in multiple sessions at the PASS BA Conference.

24 Hours of PASS BA Edition

As a preview to the PASS Business Analytics conference in San Jose this May, a full day of webinars is being held today called 24 Hours of PASS BA Edition. The webinar speakers are all presenting at the BA conference, and I’m scheduled to present tonight at 8pm MT (3am GMT). The session I’m doing is ‘An Introduction to Predictive Modeling’ and is intended for people who have worked with data and business intelligence but haven’t been exposed to predictive analytics.  The session will introduce you to some of the terms, concepts and ideas in this growing area so you can start moving you and your company in that direction.

As is usually the case, I think I have too much material for a 45-50 minute session and I probably won’t be able to get through my churn demo, so  I’ll briefly mention it.  I’m also going to talk about hiring decisions and how to build a model for that difficult process.  You can use the churn data as a self-teaching exercise by downloading the Excel file below, and I also included a link to the employment data for your own use.

Employment Data

Churn Data File

If you attended, thanks for being at my 24 Hours of Pass session.  I hope to see you in San Jose in May.