Monthly Archives: December, 2011

Columnstore Indexing at SQL Saturday #104

SQL Server 2012 will give us more index options by adding the Columnstore index. This post is an overview of this new feature, and it’s also a prelude to the presentation I’m delivering on Columnstore indexing January 7, 2011 at SQL Saturday #104 in Colorado Springs, CO.

My presentation, Using Columnstore Indexes in SQL Server 2012, will include demos that cover the important things to know about Columnstore indexing. If you haven’t heard much about this type of index structure, here are a few things to know:

–          Intended for Data Warehousing, Reporting and Ad-hoc Queries
–          Does not use B-Tree structure like row indexing
–          Can only have 1 per table
–          Does not allow table to be updated
–          Best built with columns that have non-unique, repetitive data
–          More advantage is gained from bigger data
–          Column order doesn’t matter
–          Maximum benefit gained with queries that aggregate data
–          Don’t use for single-row type queries
–          Memory-intensive to build and execute

It’s Fast

Why build a Columnstore index? It can be very fast to query, much faster than the typical row-based indexing we are used to. They’ll make reporting faster, and they can be part of the solution for the elusive ad-hoc query challenge that power business users want.

Creating a Columnstore

A Columnstore can and probably should contain every column in the table with a few exceptions.  Unique data columns in some cases should be avoided.  Also, creating a Columnstore index requires a lot of memory, so you can reduce the stress on your system by not including every column.

Using the Index

The SQL Server optimizer will decide if it can benefit from using the index just like it does with other indexes. You’ll be able to check the execution plan to see if SQL Server decided to use it.

Here’s a preview of the type of comparisons, analysis and best practices I’ll be presenting at SQL Saturday #104. I’ll focus on comparing this new type of index with row-based indexes and I’ll help you determine when the right time to use it is.

This query uses the AdventureWorksDW and is a common type of query when using a dimensional model. It aggregates data from a fact table while joining to dimension tables:

SELECT dp.EnglishProductName, dp.ProductLine, dd.CalendarQuarter,
sum(SalesAmount) as TotalSales
FROM FactInternetSales f
inner join DimProduct dp on dp.ProductKey = f.ProductKey
inner join DimDate dd on dd.DateKey = f.OrderDateKey
where dd.CalendarYear = 2004
group by dp.EnglishProductName, dp.ProductLine, dd.CalendarQuarter
order by dp.EnglishProductName, dp.ProductLine, dd.CalendarQuarter

This is the I/O and CPU usage:

CPU time = 200 ms,  elapsed time = 493 ms.
Table ‘DimProduct’. Scan count 0, logical reads 516, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FactInternetSales’. Scan count 1, logical reads 2062, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimDate’. Scan count 1, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

And the execution plan which is showing us that SQL Server is doing a clustered index scan on the fact table:

Columnstore Query

I’ll run the same query after adding a Columnstore index that includes every column in the FactInternetSales table.  Below is an abbreviated version of the DDL:

CREATE NONCLUSTERED COLUMNSTORE INDEX CX_FACT_INTERNET_SALES ON dbo.FactInternetSales
(
ProductKey,
OrderDateKey,
DueDateKey,

)

Executing the same query produces the following results:

CPU time = 80 ms,  elapsed time = 343 ms.
Table ‘DimProduct’. Scan count 0, logical reads 516, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FactInternetSales’. Scan count 1, logical reads 64, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimDate’. Scan count 1, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see from the execution plan, the query optimizer used the Columnstore index this time.  The CPU utilization was reduced from 200ms to 80ms and the logical reads went from 2062 to 64 on the FactInternetSales table. The disclaimer is that this is not a scientific test, but it’s a quick preview of what Columnstore indexes can do.

SQL Saturday #104

This is a snippet of what you’ll see if you attend my presentation in Colorado Springs. I’ll use larger tables to demonstrate how to get the most out of Columnstore indexes, how to use them with a partial set of columns and with partitioning, and I’ll compare them to their equivalent row-based queries.

Advertisements

Query a Star Schema for Trends over Time

A data mart in the form of star schema provides us with a powerful but easy way to develop SQL for metrics and reports.  The typical query will get data from a fact table, join to one or more dimensions, and perform some grouping and aggregations.  Very quickly the foundation for a new report can be developed.

With a good data model we can develop a fair number of reports with queries like this, but there are more complex queries that we need to write.  SCD type 2 dimensions ensure that fact data is related to dimension data in the state the dimension was in when the fact was created.  However, to report accurate sales results requires us to stray from the typical star schema query.

A requirement comes to you asking you to compare 1st quarter sales with 2nd quarter for each business unit in your organization, but to only include business units that have greater than $100K in sales year to date.

The first thing we do is get a list of business units with YTD sales greater than $100K.

Select  bud.bus_unit_id from Sales_Fact sf
Join Business_Unit_Dim bud on bud. business_unit_dim_id = sf. business_unit_dim_id
Join Date_Dim dd on dd.date_dim_id = sf.date_dim_id
Where dd.Year_YYYY = 2012
Group by bud.bus_unit_id
Having sum(sales_amt) > 100000

The Business_Unit_Dim table has SCD type 2 changes but the query doesn’t care.  It will return the right results even if a particular business unit has been changed 100 times this year.  The key is getting the right list of business unit ids and then using them as a subquery in the outer query.

The outer query looks like this:

Select bud.business_unit_name, dd.quarter_name, sum(sales_amt) from Sales_Fact sf
Join Business_Unit_Dim bud on bud. business_unit_dim_id = sf. business_unit_dim_id
Join Date_Dim dd on dd.date_dim_id = sf.date_dim_id
Where dd.Year = 2012 and dd.Quarter in (1,2) and
Bud.business_unit_id in
(Select  bud.bus_unit_id from Sales_Fact sf
Join Business_Unit_Dim bud on bud. business_unit_dim_id = sf. business_unit_dim_id
Join Date_Dim dd on dd.date_dim_id = sf.date_dim_id
Where dd.Year_YYYY = 2012
Group by bud.bus_unit_id
Having sum(sales_amt) > 100000)
Group by bud.business_unit_name, dd.quarter_name
Order by bud.business_unit_name, dd.quarter_name

This example shows us how to trend data over time using a star schema and that it’s the date dimension that give us the ability to do that.  It doesn’t depend on SCD changes and it shows we often need to query a star schema to get the right subset of data we want first, and then we query the star again to get the ‘report’ data.  It also shows us that a healthy library of queries that we can use as subqueries can greatly improve our query development process.