Category Archives: Big Data

Big Data, BI and SQL Saturday in Silicon Valley

In preparing to speak at SQL Saturday in Silicon Valley this week, I was thinking of ways to explain map/reduce in my session.  One way is to look at the parts of map/reduce that are analogous to a SQL query, which most of the attendees will find familiar. A simple SQL statement like ones we write for business intelligence purposes often have a filter, an aggregation, and a group by for aggregations.
Here’s an example of a query that has all 3 of these components to sum sales by region and quarter for the year 2013:

select dg.region, dd.quarter, sum(f.sales_amt) as sum_sales from fact_sales f
join dim_geography dg on dg.dim_geography_id = f.dim_geography_id
join dim_date dd on dd.dim_date_id = f.dim_date_id
group by dg.region, dd.quarter
where dd.dd.year = 2013

When writing the code in a map/reduce environment, the 3 components are still there in a different way.  Any data not needed for the result is filtered out in the mapper function using code, and the the group by values are identified as the key in the mapper function.  In the reducer, the aggregation is explicitly coded to sum by the key value.  All three steps take place in a procedural way in this C# code:

public override void Map(string inputLine, MapperContext context)
{
   //Split the pipe-delimited input into an array
   var aryValues = inputLine.Split(‘|’);

   // This is the filter since the 10th position in the array is year of sale. If it isn’t 2013, discard the input.
   if (aryValues[10] == “2013”) {
      // write the region and quarter (this is the group by), and sales amount (offset 22)
      context.EmitKeyValue(aryValues[11].ToUpper() + ” ” + aryValues[12].ToUpper(), aryValues[22]);
   }
}

//Reducer public class NamespaceReducer : ReducerCombinerBase {
public override void Reduce(string key, IEnumerable<float> values, ReducerCombinerContext context) {
   // Initialize sum of sales for this key (region)
   decimal dSumSales = 0;

   // loop though each sales and sum it
   foreach (string value in values)
      dSumSales += Convert.ToDecimal(value);
   // write the sum of the sales to output
   context.EmitKeyValue(key, dSumSales.ToString());
   }
}

Since we’re working without a schema and the data is semi-structured (or unstructured), the challenge that we have is in finding the sales amount, sales quarter, and the region in each data row.  The example shown assumes the data is pipe-delimited, so that makes it easier to identify.

To see this in action, join me at 11am in Mountain View this weekend for my session Getting Started with Big Data and BI.

Advertisements

SQL Saturday – The Spring Season

This Saturday February 23 I’ll be in Silicon Valley speaking on Big Data for SQL Server DBAs for a SQL Saturday event.  The SQL Saturday events have become so prevalent that it I see them as being grouped together by season.  For me the spring season started a couple of weeks ago in early February when I went to Albuquerque for their inaugural event, which was very well put together.  I was impressed with the participation, considering the size of the city, and the enthusiasm.

I went to the Silicon Valley event at the Microsoft offices in Mountain View, CA last year and it was a first-rate experience too.  Everything was great, including the speakers, the food, and location.  If you’ve been wondering what a SQL Saturday is all about, make it a point to attend one this spring.  If you are in the USA, there will be one near you no matter where you live.   There are also numerous events planned internationally.

I know Mark Ginnebaugh, the organizer of SQL Saturday in Silicon Valley, is putting together another fantastic event this year.  I hope to see you there!