Handling duplicates with SSIS and Kingswaysoft

The article is written in english with a swedish summary. Scroll down for the english version.

Sammanfattning på svenska

Det är inte varje dag man vid en migrering eller integration behöver skapa en dubbletthantering som måste kunna hantera kontroller i flera steg. Men ibland händer det att man springer på ett system som varit i drift under flera år och som helt enkelt inte har någon dubbletthantering.

När ett sådant system ska agera datakälla mot ett CRM måste man reda ut vilken väg man ska välja för att städa upp alla dubbletter. Oavsett val står man inför en ganska stor utmaning räknat i mantimmar.

I de fall där det går faller ofta valet på en manuell genomgång av källsystemet vilket medför många långa timmar av hopslagning samt rättning genom att kopiera och klistra in data. Detta är något som oftast utförs av kunden själv då det är kunden som känner sin data bäst. Man måste därtill vara medveten om att detta arbete aldrig blir klart utan måste fortgå kontinuerligt för att inte datakvaliteten, sakta men säkert, ska bli sämre igen.

Ett annat val är att man läser in all data och sköter dubbletthanteringen i CRM. Detta är dock endast en praktiskt genomförbar lösning om man bara ska läsa över data till CRM en gång och efter det klippa bandet.

I det aktuella fallet jag nu ska beskriva var det inte möjligt att manuellt städa bland dubbletterna i källsystemet och migreringen skulle följas av en integration på dagsbasis. Källsystemet skulle alltså leva vidare precis som vanligt medan CRM skulle ge en lättöverskådlig men komplett bild av varje kund, inklusive information om alla dess dubbletter. Dubbletthanteringen behövde alltså hanteras av själva integrationen.

För ett ge fler personer, som råkar hamnar i samma eller en liknande sits, tillgång till vår lösning skrev jag mitt inlägg på engelska. Alla måste ju dra sitt strå till stacken för att upprätthålla den fantastiska kunskapsdelning som omger Dynamics CRM.

Vill du läsa mer om att städa ditt CRM så kan du läsa Kathrine Hogesths blogg Städa ditt CRM.

The scenario

A customer wants to use CRM, more or less exclusively, as a marketing hub. CRM should be updated with new/updated data from each source system on a daily basis.

To avoid the risk of sending duplicate offers and marketing material to the same person, the person using the system needs to make sure that every physical person only exists as a single Contact in CRM. To make it even more complicated, several source systems will be integrated into CRM and the duplicate detection rules need to be applied both within each source and between different sources.

The task at hand

The source systems contain customer data that includes quite a lot of duplicates, i.e. the same physical person can have several Customer Numbers in each source system with the same, or varying, personal data.

In CRM, all duplicates belonging to the same physical person should be merged into a single Contact. Each Contact will have one record for each Customer Number per source, in an associated entity (called Customer Number). The Contact should always have the latest information available from any source system while the Customer Number entity holds the latest information for that specific Customer Number and source.

The definition of a duplicate was decided as being an exact match for at least one of the following fields/group of fields:

Social security number
First name, Last name, Address1_ line1, Address1_city
E-mail
Mobile phone

Step 1: Creating caches used for lookups

First, we use a Kingswaysoft Dynamics CRM Source to fetch all existing Customer Numbers belonging to the current source system, together with the GUID of the associated Contact and the last modified date. This can easily be done with Fetch XML. If you don’t know how to build a query with Fetch XML, you can always build your query by using advanced find in CRM and download the script.

We then pipe this data into a Cache Transform. This cache will be used both to determine if a Customer Number already exists in CRM and if the post should be updated from the source data. I use the Cache Transform component, instead of querying CRM directly, for performance reasons. Querying against a cache is very quick and since it doesn’t use any threads against CRM we can use all threads for doing creates/updates.

We also need to create one cache for each lookup we will make against non-existing Customer Numbers, in this case the 4 mentioned above. These caches need to contain all the Contact ID’s and the fields we will use in the matching, i.e. not just the data from the same source system we are currently using. The reason we need one cache for each lookup is that they need to be indexed on the respective lookup fields.

Note that all columns of type string should be normalized before the caches are loaded.

Step 2: The source data

Here too we need to normalize the data with Derived Column. For example:

Add new columns with lower case strings that we will use in the matching. In our case: first name, last name, address1_ line1, address1_city and e-mail.
Make sure the e-mail field contain correct data, i.e. only have one “@” etc.
Add information about the source system, both as information in CRM and for error logging.
Clean up telephone number and social security number formatting using Regex in a Script Component

Step 3: Handle updates of Contacts/Customer Numbers that already exist in CRM

First, we filter the source data so that we only include rows updated since the last run. This is done by using a Conditional Split component and comparing the modified date for each row. Then we use the Lookup component with the first cache we created to split the source data depending on if we got a match on the Customer Number or not.

The matching rows are then split again depending on if the data in the source is newer or older than the data on the Contact in CRM. If newer, the Contact is updated with the data from the source since we want the Contact to contain the latest information. The data flows are then merged, and the Customer Numbers are updated with the new data.

Step 4: The non-matching rows from step 3

The non-matching rows are now compared with the Lookup component against the 4 caches containing all Contacts, one after another. If a match is found on social security number, name/address, email or mobile phone, the rows are complemented with the matching ContactId. I will spare you a picture of this since it’s rather complicated.

After that they are merged into one data flow, checked if the modified date of the source data is newer than in the CRM just like above; if the data is newer the Contact is updated and then the new customer numbers are created and linked to the Contact.

All rows that was not matched in this step are stored in a database table called Customers. This table will serve as the source data in the next steps.

Step 5: Handle the non-matching rows from step 4 as new Customer Numbers

Now we are ready to use the first Duplicate Detector component on the column Social security number.

All unique rows are piped to a Kingswaysoft Dynamics CRM Destination that creates the contact. The output from the destination component includes the CustomerId of the new contact and this allows us to create the Customer Number and link it to the new Contact.

The duplicate rows are routed to a database table that will be used as source in the next Data Flow Task.

Step 6: Handle the duplicate rows

The duplicate detector includes an output column that groups the rows it has classified as duplicates. From this data we want to do two things:

For each group we want to create one Contact
For each row we want to create one Customer Number and associate it to the Contact we just created.

We start by adding an OLE DB Source and configuring it with a SQL script that fetches all unique customers from the duplicate database table. The table is called customers, the custno is the customer number and the groupcode is from the duplicate detector. We fetch the data row for the highest customer number from each group with the following script:

select c1.*

from customers c1

inner join (

select distinct c2.groupcode, max(c2.custno) as custno

from customers c2

group by c2.groupcode

) as dist1 on c1.custno = dist1.custno

The output from this script is routed to a Kingswaysoft Dynamics CRM Destination that creates the contact records. The output from the Kingswaysoft Dynamics CRM Destination contains the ContactID for the created contact, so now we Merge Join this output with the rest of the duplicate rows in the database by joining on the groupcode. For the output of the Merge Join we choose the data from the rows in the duplicate database table, so that each Customer Number have their unique data, and add the ContactId from the first data flow.

For each of these rows we create a Customer Number and by using the ContactId we link them to the correct Contact.

When we have created all the customer numbers, we need to remove these rows from the Customers database table since we don’t want to process them again. For this I use an OLE DB Command. Unfortunately, this isn’t as straight forward as you would think since you must rely on parameters. I add the following SQL command “delete from customers2 where custno=?” on the Component Properties tab and then map the custno to the variable “Param_0” on the Column Mapping tab. This means that the “?” will be substituted with the custno for each row and the delete command will work as intended.

Now we are ready to repeat the same thing for the next set of matching criteria.

As long as you remove all the rows that has been imported into CRM from the Customers database table, you can add as many matching criteria steps as you like. Once the last criteria have been run, you use the remaining rows to create new unique Contacts and Customer Numbers.

Step 7: How to avoid creating duplicates Contacts within the same integration run

After the first Duplication Detection package, we need to make sure that we don’t create duplicate Contacts in the following steps. Since the check for existing Contacts has only been done against a pre-filled cache, any new Contact created during the integration run must be considered as well. I solved this by adding all new Contacts created to a database table and before any new Contact is created, I perform a Lookup against this table with the matching criteria I’m currently looking for. I.e. in the e-mail step I check if any contact with the same e-mail has been created in any of the previous steps and if so, I only update the Contact instead of creating a new one (depending on the Last Modified Date of course) and then create the Customer Number and associate it with the Contact.

Summary

It might not be every day you need to perform a very extensive duplication handling before loading data from a source system into CRM, but it does happen. By handling the duplicate matching within the integration layer, we are keeping the Contact entity as clean and up-to-date as possible while still retaining all details from each source system in the associated Customer Number entity.

Although Kingswaysoft delivers a wide variety of excellent SSIS tools that will most likely cover everything you need to do, the solutions to complex problems can become a bit difficult to work with. A prime example of this is if you need to add a new source field or change a data type in the beginning of a long flow of operations. However, if you are looking for a no-code/low-code option to migrate data or integrate systems, it really does the job.

Would you like to know more about managing duplicates? Feel free to contact us