Monthly Archives: March, 2010

The Business Process Workflow Data Warehouse – Part 2

In part 2 of The Business Process Workflow Data Warehouse I will review some of the problems encountered with data quality when developing this type of warehouse. Data quality is often a sizeable problem the data architect needs to solve before deploying a data warehouse. Poor data quality in any data warehouse can be a showstopper that causes major delays and even outright failure. Be prepared for these problems to be worse in the Business Process Workflow-oriented Data Warehouse. The points numbered 4-6 below are the second of six points I made earlier and they address the quality of your data in different ways:

4. Workflow data can be removed or hidden due to incorrect data entry, legal issues, or duplicate data entry.
5. There are no standard practices for data as there are with financial, accounting and sales applications.
6. Data is often much more poorly entered and verified.

The central theme to these three statements is that workflows are vague and imprecise when compared with their transactional counterparts. Businesses are dependent on their transactional data. An order can’t be processed unless the transaction has been entered into a software application. However, a child can be visited by a social worker without entering any data into their computer system. If the social worker visits the child and enters the wrong date for the visit, nothing is missed, except the measure is reported incorrectly. Two rules of thumb to keep in mind when working with workflows:

o Data entry for workflows can be wrong while the work performed was correct.
o Data entry often occurs after the work has been performed.

To support this ‘flexible’ type of data entry requires a change in the data architect’s way of thinking. Your design must allow for data to be entered after the fact. Your design must allow for data to be corrected. Your design must allow for partial data to be entered for an activity, and then updated later with more data for the same activity.

This approach can be hard to accept if you’ve developed a transactional data warehouse before. You will break rules that you previously thought were untouchable, but accepting these ‘broken’ rules will put more accurate measurable data into your data warehouse.

Advertisements

The Business Process Workflow Data Warehouse – Part 1

In my previous post I introduced a particular type of Data Warehouse, one focused on business processes.  I pointed out six differences between a traditional, transaction-based data warehouse and the business process workflow-oriented warehouse.  This section will cover the first 3 items in the list:
1. Business processes span long periods of time, some take hours, but others months or even years.
2. Workflows don’t have definite time periods and some may not end.
3. Workflows almost always require accumulating facts that must be updated after they are stored when more data arrives.
These assertions force us to think about fact tables differently than we are accustomed to.  Business process workflows modeled as fact tables will often be factless, but with a large number of dimensions.  Business processes often don’t have an obvious value that should be measured, so the important value to measure is the ‘fact’ that the business process has occurred or is occurring, and the length of time the workflow has taken to complete.  These types of fact tables become the norm, instead of the exception that they are in the transactional data warehouse.
This type of table is known as an accumulating fact table, as termed by the widely used Kimball methodology for developing data warehouses.  An accumulating fact table is updated as new data emerges when all of the data is not present at the time the fact row is created.  For example, I worked on a data warehouse where children were placed in foster homes by social services.  The placement of a child prompts the creation of a fact row in the data warehouse, but several columns of the fact could not be populated, since they had not yet occurred.  Data such as the end date of the placement and the placement end reason are not known until the placement comes to an end at some indeterminate date in the future.  However, creating the fact at the time the placement begins has value since it can be used to measure the number of children in placement in various ways using a number of dimensions.  The process from beginning to end often takes months and can take years.
Don’t worry about having to update this kind of table several times after it’s been stored.  This will  be a common event in a business process workflow data warehouse.  In the example of the child’s placement, several checkpoints along the way generate more data.  The dates of the checkpoints and any relevant data about the checkpoints is captured and the fact row is updated with the new data.  Before the late arriving data arrives, any dimension pointers for that data will point to a dimensional value called ‘Unknown Value’ or ‘Data not Available’.  This provided our users with a good way to know that the placement hasn’t ended yet.

The Business Process Workflow Data Warehouse

There are a number of different ‘types’ of data warehouses. I say this with trepidation because a data warehouse is usually not referred to as having a type. Perhaps I should rephrase my statement; there are a number of different types of data marts. A data warehouse can contain multiple data marts, each with a different aim. For the purposes of this discussion, I’m talking about a data warehouse as if it has a primary objective, which in my experience most do.

Read any book on data warehousing, look at a blog with a sample star schema, or take an introductory data warehouse design class, and you will see the same pattern. You will see a fact table representing a well-defined transaction surrounded by several dimensions. The fact table will be easy to understand; it usually reflects a common transaction that has been mapped from an application in a straightforward manner. It will look something like this:

In the diagram above, the fact table is derived from a sales transaction that occurred in an OLTP database. This star schema is a good learning tool and it’s a pattern that can be reused for many transactional processes, but let’s talk about business process workflows.

Examples of Business Process Workflows

Business Process Workflows happen over periods of time, are vague in their scope, and acquire new or changing data as the process moves forward. Also, it can be uncertain when they are actually complete. These are a few examples of the kinds of business processes I’m referring to:

– A financial services company receives a request from a customer to open a new 401k account.
– A social worker places a child in a foster home.
– A laptop in sent in to the manufacturer for warranty repair.

What’s different about the Business Process Workflow data warehouse vs. the Transactional data warehouse? The following are significant distinctions:

1. Workflows span long periods of time, some take hours, but others months or even years.
2. Workflows don’t have definite time periods and some may not end.
3. Workflows almost always require accumulating facts that must be updated after they are stored when more data arrives.
4. Workflow data can be removed or hidden due to incorrect data entry, legal issues, or duplicate data entry.
5. There are no standard practices for data as there are with financial, accounting and sales applications.
6. Data is often much more poorly entered and verified.

In the next several days, I will discuss each of these points in more detail, review why they make the Business Process Workflow data warehouse a true challenge, and discuss solutions for each. There is a paradigm shift that must be grasped and some deep-seated data warehouse modeling principles must be relinquished.