Category Archives: T-SQL

TSQL Tuesday #026 – Second Chances

This month’s T-SQL Tuesday #026 hosted by David Howard asked people to take on a wide open task, to write about any of the previous 25 topics.  Having so many to choose from should have made it easy, but instead it was harder.  I could have done something with each topic since it seems that reading the old blogs only sparked more new ideas.

The Future

Instead of going to the past, I decided to go to the future and write about the new features in SQL Server 2012 that will help us with aggregation, a subject that should be close on any BI developer’s mind.  I chose the topic because of T-SQL #016 by Jes Borland on Aggregations since there are going to be some interesting new functions to help us aggregate better.

Windows in T-SQL

The Windowing functions (which have nothing to do with Windows) in the upcoming version of SQL Server let us do in one query Analytic calculations that include sliding aggregations, standard deviations, and more.  This way we can compare averages, sums, and other numbers using preceding and current rows in a much cleaner, simpler way than we can today.

SELECT OrderDateKey, DailySales,
WeekAvgCurrent = AVG(DailySales)
OVER (ORDER BY OrderDateKey ROWS BETWEEN 6 PRECEDING AND CURRENT ROW),
WeekAvg1WeekAgo = AVG(DailySales)
OVER (ORDER BY OrderDateKey ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING),
WeekDiff = AVG(DailySales)
OVER (ORDER BY OrderDateKey ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) – AVG(DailySales)    OVER (ORDER BY OrderDateKey ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING),
WeekStdDev = STDEV(DailySales)
OVER (ORDER BY OrderDateKey ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
FROM (SELECT f.OrderDateKey, SUM(SalesAmount) as DailySales
                  FROM FactInternetSales f
                  inner join DimDate dd on dd.DateKey = f.OrderDateKey
                  where dd.CalendarYear = 2003
                  group by f.OrderDateKey) as DailyTable
order by OrderDateKey

The query produces this result from AdventureWorksDW. Only the first 12 rows are shown and I modified the column titles slightly for readability:

Order Date Key Daily Sales This Week Avg Last Week Avg Difference Week-to-Week This Week Std Dev
20030101 12444.75 12444.75 NULL NULL NULL
20030102 19703.48 16074.12 NULL NULL 5132.7
20030103 13519.61 15222.62 NULL NULL 3917.584
20030104 18628.97 16074.21 NULL NULL 3623.873
20030105 13497.29 15558.82 NULL NULL 3343.268
20030106 4363.125 13692.87 NULL NULL 5461.919
20030107 14623.17 13825.77 NULL NULL 4998.41
20030108 15733.28 14295.56 12444.75 1850.814 5001.517
20030109 18141.69 14072.45 16074.12 -2001.67 4748.503
20030110 15558.27 14363.69 15222.62 -858.929 4771.408
20030111 13976.53 13699.05 16074.21 -2375.15 4386.78
20030112 13254.69 13664.39 15558.82 -1894.43 4389.597

The results show you that we can get the average sales for the last 7 days, the sales for the 7 days before that, compare the two to get the difference between the last two weeks, and the standard deviation for the week.

All in one fairly simple straightforward query. Very powerful.

T-SQL Tuesday and Best Practices – SQL is a Set-Based Language

This month’s T-SQL Tuesday#20 – T-SQL best Practices is a tall order to write about.  Best practices can cover coding style, documentation, performance, and so many other areas that whole books can and have been written about them.  When I consider the poor practices I’ve seen in T-SQL code I think about the basic flaws that seem to be commonly introduced by developers; the misuse of subqueries when joins are appropriate, the overuse of distinct when it’s not necessary, or the usage of Select * almost by default, among others.

SQL is Set-Based

The worst case of poor T-SQL practice I have come across in the last few years was by a client that had written a stored procedure in a non-set based manner.  The 3000 lines of code in a key SP iterated through a number of select statements that handled one entity at time in a procedural manner.  Data was queried in small chunks by incrementing row numbers in loops, placed in temp tables while other data was acquired, functions were then called and other code logic was performed, until finally the next set of data was processed.  A normal execution of this SP included anywhere from 1k – 10k database calls.  The developer’s rationale for coding it this way was that it would be faster to get smaller pieces of data each time the database was queried and not put as much of a burden on the database engine.

SQL Server likes Joins

The SP in question was completely rewritten to perform 3 large queries each with 15-20 joins and performance improved by a factor of 50.  The SQL Server database engine is tuned to perform joins very well, especially those with good indexing and appropriate foreign keys.  Trying to work around the engine got this company and developer in trouble, and made the code very slow and much harder to work with.

The Moral

Don’t turn T-SQL into a procedural language.  Do make use of it’s really strong capacity to process data in sets.

T-SQL Tuesday and APPLY Knowledge: An Essential tool for BI Development

Business Intelligence solutions require the arrangement of data in ways that that are different from our OLTP databases.  Navigating the source databases can often require advanced T-SQL techniques and the APPLY operator is a great asset to the BI developer’s toolkit.

Here’s an example.  Consider how the APPLY operator can be very useful to you as you move data from one database to the data warehouse. You’re developing ETL and have to perform complex data transformations to populate a star schema and the requirements call for a dimensiontable to have a particular piece of data at a point in time.  Do you want to get the largest purchase made by a customer within the last year?  Would you like to see date of the first time the customer made a purchase larger than $100?  Or the average of their last five purchases?

Below is a Customer Dimension table that I want to populate.  The MaxPurchase column contains the largest purchase made by a customer on a single order within the last year and MaxPurchaseDate has the date of that purchase, and I’ll focus on these columns in my example:

I’ll show how I think most developers would approach this problem, and then I’ll demonstrate how to do it with APPLY.  Often, a developer’s first inclination is to start with a standard join and then tweak it to get the right answer:

SELECT c.CustomerId, c.CustomerName, oh1.OrderDate, oh1.PurchaseAmt  FROM Customer c
INNER JOIN OrderHeader oh1 ON oh1.CustomerId = c.CustomerId
WHERE oh1.OrderDate BETWEEN ’04/01/2010′ AND ’03/31/2011′

Of course, this returns every order for every customer for the year and that’s not what I want.  I only need the order with the maximum purchase amount, so I wrap the original query and do something like this:

SELECT CustomerId, CustomerName, OrderDate, MAX(PurchaseAmt) AS MaxPurchase FROM
(SELECT c.CustomerId, c.CustomerName, oh1.OrderDate, oh1.PurchaseAmt
FROM Customer c
INNER JOIN OrderHeader oh1 ON oh1.CustomerId = c.CustomerId
WHERE oh1.OrderDate BETWEEN ’04/01/2010′ AND ’03/31/2011′) AS tbl
GROUP BY CustomerId, CustomerName, OrderDate
ORDER BY CustomerId

At first this looks right, but then I realize I’m still getting more than one order per customer, since this query only returns the maximum purchase for a single date for a customer, but not the max purchase for the year. Where do I go next? Maybe I start to add row numbers, or I add a join back to the OrderHeader table, or I add a subquery.

Instead, let’s see why using the APPLY operator gives us the most elegant solution:

SELECT CustomerId, CustomerName, p.OrderDate, p.PurchaseAmt
FROM Customer c
CROSS APPLY (SELECT TOP 1 oh.OrderDate, oh.PurchaseAmt FROM OrderHeader oh
WHERE oh.CustomerId = c.CustomerId AND
oh.OrderDate BETWEEN ’04/01/2010′ AND ’03/31/2011′
ORDER BY PurchaseAmt DESC) p
ORDER BY CustomerId

Now I have a concise query that allows me to populate my Customer dimension with the maximum purchase and the date it occurred. The only choice I have to make is whether to use CROSS APPLY or OUTER APPLY. The former returns only customers that had a purchase within the last year, the latter returns all my customers even if they didn’t make a purchase during the time frame (OrderDate and PurchaseAmt would return as NULL, just like a LEFT OUTER JOIN that filed to join).

What about performance? That probably deserves another blog post to compare in detail with other solutions. However, if the OrderHeader table is indexed on CustomerId and OrderDate (descending) then performance would be satisfactory, and might be the best of the alternatives.

The Apply operator: Elegant, concise, essential for Business Intelligence, useful in so many scenarios.