This article is the second and final part in the series “How to Solve Common Data Quality problems using Data Quality Services”. In part 1, I covered:

  1. Data Standardization
  2. Identifying and correcting unrealistic or invalid values
  3. Validating and correcting records using Regular Expressions

If you haven’t read part 1; I’ll also encourage you to practice techniques described in part 1 by downloading the data files from here DQSMaterialsPart-1 because to keep Part 2 brief, I’ve not documented detailed steps.

In Part 2, I’ll cover following topics:

  1. Validating and correcting value in a field based on value in another field(s) in the same row
  2. Reduce data duplication
  3. Cleaning records using reference data-set from Windows Azure Data Market

 1. Validating and correcting value in a field based on value in another field(s) in the same row

Scenario:

You have a data set of employee name, employee category and yearly salary. You want to validate the value in the yearly salary column based on the employee category. Here are the business rules:

Employee Category Yearly Salary Greater Than Yearly Salary less Than
A 100000 200000
B 80000 100000
C 60000 80000
D 40000 60000

Read As: If employee category is A then yearly salary should be greater than 100000 and less than 200000.; for the purpose of the demo, every number is a dollar.

Here’s the dataset that we’ll clean using DQS:

Employee Name Employee Category Yearly Salary
Jon V Yang A 127000
Eugene L Huang B 90000
Ruben  Torres C 83000
Christy  Zhu D 70000
Elizabeth  Johnson A 90000
Julio  Ruiz C 65000
Janet G Alvarez D 43000
Marco  Mehta B 81000

*Names are taken from Adventure works database. The values in the names and salary column are purely fictional.

Solution:

It’s just an overview; it’s not covered in step by step fashion:

1. Create a KB > created three domains: Employee Category, Employee Name and Yearly Salary

2. Created a composite domain:

DQS SQL Server Composite Domains

DQS SQL Server Composite Domains

3. under Composite Domain (CD) Rules Tab:

I started out with defining the rules for category A:
2-DQS-Article-SQL-Server

 

And I completed with specifying business rules for all four categories:

3-Data-Quality-Services-SQL-Server

4.  Published KB

5. Created a New DQS project > Selected the KB created above

6.  Selected the data source > Mapped domains

7. I also selected from the list of selected composite domains:

4-Data-Quality-Services-SQL-Server

8. After seeing the cleaning statistics, I switched to the invalid tab to see the records that didn’t match the record:

5-Data-Quality-Services-SQL-Server

9. So by now, we have identified records that do not match the rules. A data steward can now correct them if he/she wants to or leave them as it is. Notice the Approve/reject check boxes.

Note that: Not only can you update the yearly salary but you can also update the employee category. So if you think that the employee has been wrongly categorized, you can change that.

10. After this, you can export the data-set which has records that match the business rules and the data-set would be ready to be consumed!

2. Reduce Data Duplication

In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .

Situation 1:

we’ve a list of movies and we want to identify “matching” movie titles.

Solution 1:

Create a Matching Policy
1) Mapping the Domains:

6-Data-Quality-Services-SQL-Server

2) Configuring the Matching Policy:

7-Data-Quality-Services-SQL-Server

Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.
3) View Matching results:

8-Data-Quality-Services-SQL-Server

4) Publish the KB (which stores the matching policy). Once you have the matching policy, you can use this in a Data Quality Project:

5) See How I ran a Data Quality Project (with matching activity) in the image below.

9-Data-Quality-Services-SQL-Server

Note: You can export the clean data-set via Data Quality Project.

Situation 2:
We’ve a list of Supplier Names and we want to identify “matching” supplier names.

Note that in this situation, you would see how to use more than one domain to create a matching rule.
Solution 2:
Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results
Matching Policy:

10-Data-Quality-Services-SQL-Server

Matching results:

11-Data-Quality-Services-SQL-Server

Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too!

12-Data-Quality-Services-SQL-Server

3. Cleaning records using reference data set from Windows Azure Data Market

Data Quality Services can help you clean your data using a reference data set from Windows Azure Data Market. For the purpose of the demo, we’re going to clean few address records by using the Melissa Data’s Address check reference data set.

1) Configure Data Quality Services to use reference data sets. Tutorial here:

http://msdn.microsoft.com/en-us/library/hh213070.aspx

2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and Canadian Addresses Data” ; we’ll see how to use this in next steps.

3) Note that as of now, we can only have reference data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.

4) Now via DQS client, let’s create a Knowledge Base!

5) I’ve created following domains:

Address Line

City

State

Zip

And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.

31-Data-Quality-Services-SQL-Server

6) And for the composite domain Full Address – this is how you configure reference data-sets:

32-Data-Quality-Services-SQL-Server

7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct. You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:

33-Data-Quality-Services-SQL-Server

Thus you can see that you can use reference data sets to clean your data.

before we conclude this article, here’s a Bonus Tip!

Difference between domain values vs. Term based Relations:

Domain Values Term Based Relations
It allows us to correct the entire value in a domain It allows us to correct a word/term within a value in a domain
E.g.USA -> United StatesUS -> United States E.g.John Senior -> John Sr.John Junior -> John Jr.George Sr -> George Sr.Mister Smith -> Mr. Smith
Note that the entire value in the domain is changed. Note that only PART of the domain value was changed.

Conclusion:

In Part 2, we covered following topics:
1. Validating and correcting value in a field based on value in another field(s) in the same row
2. Reduce data duplication
3. Cleaning records using reference data-set from Windows Azure Data Market