BizTalk Siebel Adapter Error Resolution

At work we’ve been seeing the following error with the BizTalk Siebel Adapter (BizTalk Server 2006) quite frequently:

No connection could be made because the target machine actively refused it

The first time it happened, I immediately presumed that the error was accurate, and notified the Siebel system administrator that something seemed to be wrong with the Siebel server. After checking things out he told me I was mistaken, and that everything was fine.

We have a host instance dedicated to the Siebel adapter, so I simply restarted it and found that things returned to normal again… at least for a few more hours. The error would then reoccur, and so forth. This particular error worried me quite a bit because it meant that a crucial business process was halted as a result. In consequence, a great deal of work was required by the business users to manually redo the process that had failed. Ouch.

Over a course of two weeks I tried all sorts of things. I couldn’t find any articles on the internet that described my problem, and I ended up setting up a scheduled task to restart the SiebelHost every 30 minutes. This worked for the most part, but of course was very sloppy fix. While this was happening I put in a ticket to Microsoft for help.

After 2-3 of days I got a call from Microsoft’s support team (Sajid is really helpful by the way), with an idea to try and help. Here’s the substance of the email:

To fix this issue, create a DWord registry value in the registry for the key HKLM\software\Microsoft\BizTalkAdapters\New Reg Value : StartAgentSleep
Type: DWord
Value : 1000 (Decimal) measured in Milli secondsThe value is configurable and can change according to machines and to different users. 1000 is the value which Microsoft had tried with another customer and seemed to work. 1000 Ms of time = 1 sec.

Setting this causes the adapter to wait longer than the default timeout for the browsingagent.exe or runtimeagent.exe process to prepare itself to talk to the adapter.

So I gave it a whiz and guess what?  Problems solved. We’ve now been up for about five consecutive days without seeing the problem reoccur.

Leave a Comment

Health and Activity Tracking Inaccurate?

Just the other day we had a new BizTalk application move into production. However, shortly after, problems began where the orchestration that was running encountered an error. I opened up Health and Activity Tracking (HAT), and saw this:

Health and Activity Tracking Image

This error was found in the suspended orchestration and in the Application Log:

System.Data.OracleClient.OracleException: ORA-01017: invalid username/password; logon deniedat System.Data.OracleClient.OracleException.Check(OciErrorHandle errorHandle, Int32 rc)
at System.Data.OracleClient.OracleInternalConnection.OpenOnLocalTransaction(String userName, String password, String serverName, Boolean integratedSecurity, Boolean unicode, Boolean omitOracleConnectionName)
at System.Data.OracleClient.OracleInternalConnection..ctor(OracleConnectionString connectionOptions)
at System.Data.OracleClient.OracleConnectionFactory.CreateConnection(DbConnectionOptions options, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningObject)
at System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnection owningConnection, DbConnectionPool pool, DbConnectionOptions options)
at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject)
etc.

Because HAT showed a send shape as being the last thing to start execution, I immediately assumed there was a problem with the Oracle Send Port associated with that send shape. A teammate, also trying to solve the problem, assumed the same. We were both frustrated because this problem hadn’t happened in dev or test – what was different now? We tried many things to try and get the production problem solved, but our efforts were in vain. The system ended up being rolled back and the old infrastructure put back in place – a true disaster.

A day later, after some of the pain and sorrow subsided, I contacted Microsoft. I was put in contact with a great support person, who looked at the problem and started asking some questions. I was a little frustrated with his questions at first, but now I know why he was asking them. He kept asking about database connections we were opening in C# code, as opposed to what seemed obvious to me, that this was related to the Oracle Adapter Static One-Way Send Port. He eventually explained his stubbornness for not accepting the problem at face value: 1) the error didn’t indicate anything about the Oracle Adapter and 2) HAT is not always accurate! He mentioned that he has seen many instances where HAT does not actually show the exact node where the problem is occurring; furthermore, he has even seen HAT debug values to be incorrect!

I couldn’t believe what I was hearing! Once he said this, I started to wonder… well, if the Oracle Adapter send port is not the problem (which would make sense since an earlier call to the same database via a solicit-response Oracle send port had worked), what could it be? I began examining the next node after the send, and found the node SiebelUpdate, which was trying to update an Oracle database via C# code. Immediately things started to click – right before deployment, I had been asked to use a new username/password for the Oracle Siebel connection. I had tested all of the previous credentials to avoid this kind of problem (there were 4 data sources being connected to in about 7 or 8 different ways), but I hadn’t tested the new one that had been given to me. I tried opening up the Siebel database using the credentials I had been given, and guess what, same error.

So here’s what I learned:

1. Don’t trust HAT. If we would have known this, I’m pretty sure we’d have figured this out prior to the rollback.
2. Look carefully at the error message – careful inspection showed that it wasn’t related to the Oracle Adapter.
3. Microsoft DOES use the ODBC connection for the Oracle Adapter (we had enabled logging but didn’t see anything – now I know why).
4. Don’t mistrust good old Oracle errors. I had, because there seemed to be no another explanation based on what was shown in HAT.

Comments (3)

BizTalk 2006 Performance

At work we have a BizTalk farm consisting of 2 servers, both of which are relatively fast (they have dual processors with dual-cores). The various BizTalk SQL Server databases sit on an even faster server with 4 processors, each with a dual-core. The databases are clustered to provide fail-over. Okay, great. So why I am sharing all this…

We started having performance problems at work. The production load was not all that great, and the servers were basically idling. I was quite disappointed when BizTalk, the service I manage, took the blame for ERP performance issues (SAP interfaces with the rest of the company by sending out messages from XI to BizTalk, who then fans out messages to all other downstream systems… when XI could not deliver messages to BizTalk fast enough, a backlog started building up until all XI communications came to a halt). So what did I do?  A few things…

First,  someone on the XI team pointed out that if I tried to visit the WSDL of the service sitting on the BizTalk box it took a very long time to open.  Based on this, I then looked at how things were set up in IIS (run ‘inetmgr’ command) and found that the 5 or 6 web services being called from XI were all sharing a single application pool, with the “web garden” setting set to 1.  Apparently, if not set otherwise.  I added new application pools, one for each web service and then also bumped up the “web garden” setting to 5 (# of processors + 1).

Web Garden Setting in IIS

Once the heyday of directors arguing over what to do next subsided and someone accidentally turned the XI interface on again without permission (whoops!), the problem of getting messages into BizTalk from XI immediately went away.  I thought my problems were solved, but then soon found out that many of the downstream subscribers weren’t getting messages!  I checked and sure enough, there were lots and lots of messages queued up, although “Active”, for each downstream system.  I then looked into throttling.

Like most of us out there, I started to read the manuals when trouble started (and too often not before then, unfortunately). The first “manual” to read in my opinion so far as I am concerned is Professional BizTalk Server 2006. After reading on throttling and finding the MSDN site on Host Throttling Performance Counters (which describes things well), I found that we hadn’t made all of the registry changes we had started (and planned in the beginning) as new hosts were added to the system.  Pages 460-461 of Jefford’s book describes those registry settings.  I made those changes and then took a look at throttling again.  The system was still throttling, in State 6, which is due to database size (basically after fixing the IIS settings I was now getting messages in faster than I could send them out).  I couldn’t believe it – here the server is idling and yet we’re throttling.  I added a counter on database size and found that the database was huge and that this accounted for the throttling.

Database Size

So I changed this:

Throttling Thresholds

I found that I had to bounce the host instance for the change to take effect, but once I did that it worked like a charm.  This particular throttling setting is kind of funny because you’d think that if you’re database size is growing too fast you’d want NOT to throttle (so the messages could get out and have the size go down), right?  Well, I guess someone thought otherwise (probably someone who thought about this a lot more than me, I might add).  Nonetheless, in my situation, this is exactly what I needed – to stop throttling.

Comments (1)

Siebel Adapter for BizTalk Server 2006

After working on a project that uses the Siebel Adapter heavily, I learned a lot. I found so little documentation on the adapter and best practices, that I’m hoping the post will help someone out there with a similar problem. I suppose some of the lessons learned might very well apply to the Oracle Adapter or another adapter.

So, what was my problem? Inconsistency. For starters, here’s a UML activity diagram to give you an idea of what my orchestration looked like:

Orchestration with Many Siebel Adapter Calls

And this goes on and on for something like 8 Siebel calls (I didn’t bother to draw this).

I had the basic functionality working. However, like I mentioned, I’d get inconsistent results. Sometimes I’d get decent performance; sometimes I’d get horrible performance; sometimes I’d get cryptic Siebel errors. This post will hopefully explain what I ended up finding after days of frustration.

Let’s start with the cryptic Siebel errors. Here’s what I saw a couple of times:

E-SBL0030: Call to failed; Siebel error msg- Business object handle could not be got from the name.The task 41080 has either been shutdown or timed-outFailed to get business object from business object name : Customer Complaint Case BO; Siebel error code- 106

This error was always accompanied by a similar error:

E-SBL0030: Call to LogOff failed; Siebel error msg- The task 41080 has either been shutdown or timed-out; Siebel error code- 106

Fortunately, our Siebel guru and fellow BizTalk Developer, Prashant Kulkarni, recognized the error enough to provide a simple solution: restart the Siebel Applications Server. I guess as a custom they typically bounce the Enterprise Server and Gateway Server as part of that.

Okay, the next problem I had was inconsistency:

After fixing the infinite loop that occurred in a fringe case (oops!), the main issue with consistency I faced were timeouts from what I believed to be Siebel. I assumed the Siebel server was under heavy load when I got these errors at first. It wasn’t until I started monitoring the throttling state constantly, that I found the real problem: BizTalk had started throttling my orchestration and the Siebel calls, so consequently messages would take too long before being sent out to/from Siebel. The result: timeouts. Finding this problem was a little tricky at first. Here’s why:

When inconsistency and poor performance were first detected, I moved the host instance to two servers, thinking that doubled resources would solve the problem. Well, this assumption was naive. The equation was a little more complicated than that.

Also, the BizTalk Admin Console by default doesn’t show you the processing server, so unless you’re careful, you might monitor the wrong machine and find nothing. Fortunately, in my case, I was used to this problem and I always add the “processing server” as soon as I start looking at performance issues. Here’s how if you’ve never done this before. From the Admin Console right click on the shown columns displayed from any query. You’ll then see:

Add/Remove Columns

After adding the processing server you should see something like:

Processing Server Shown

Before I get started into the meat of this let me first say that if you have no idea on how to check throttling you might want to run a Google search on “BizTalk Performance Monitoring” or something like that. I used the graphical Windows tool called “Performance” in the Administrative Tools section for my work on this…

I first started monitoring the “Message Delivery Throttling State” on the first server and found nothing (throttle state of zero). Then, I took a look at server #2: nothing. At this point I probably sat frustrated for a few minutes and complained to myself. After the complaining was over, I looked for more throttling monitors. I found “Message Publishing Throttling State”. I didn’t think this was my problem, but I added the monitor to server #1: nothing. I then added it to server #2, and lo and behold: throttling in state 4 (throttling due to process memory usage)!

Once I found this, things immediately started looking brighter. I was able to adjust the throttling settings so that it wouldn’t throttle until I began using 1000 MB of memory for the host instance (I tried some other values first but found this to be about right). This immediately helped. After that, I found that monitoring memory usage and the throttling state was very helpful while I checked on performance. Professional BizTalk Server 2006 by Jefford, Smith and Fairweather has a good section on performance, which I referenced quite a bit too.

The irony of the whole thing is that the Siebel Adapter starts using more memory for its processing, and BizTalk starts throttling, which causes BizTalk (or me, rather) more problems than it started with. In conclusion I’d say that I think I found the Siebel Adapter’s worse enemy for my kind of situation: throttling. Why? Because of timeouts. Querying Siebel Business Objects can take a while – the last thing you want is to start timing out as a result. Performance went from 30+ mins for processing of my orchestration down to 60 seconds! Amazing. And, my orchestration started getting consistent results.

Comments (2)

Ordered Delivery Trick in BizTalk Server 2006

The other day at work I needed to change the URL and proxy class of a send port that was using ordered delivery, but that was already suspended with lots of messages queued up in the instance.  So, I did what most of us would have tried, I changed the URL and proxy class, and then I resumed the send port.  Did this work?  No.  The send port kept trying to send the messages to the old URL, using the old proxy class.

So I started trying to see if there was anyway to resolve this without losing messages.  I did what most administrators would have done next: I bounced the host instance.  Did that work?  No.

Then I got creative.  I decided to change the host instance of the send port to another available host instance.  I made the change and pressed ok, but got an error indicating that I couldn’t change the host instance since ordered delivery was on.  So I put things back the way they were and tried once more to resume the instance.  At this point, you’re probably thinking this didn’t make any difference (as I did).  But guess what?  It made all the difference.   Suddenly messages began flowing with the new URL and proxy class!  I couldn’t believe it.  I guess the attempted change of the host instance triggers something in BizTalk, causing it to read in the URL and proxy class again…

I presume this problem exists if you are only trying to change the URL as well (and not the proxy), but I didn’t take the time to verify this.  Hope this information helps someone out there!

Comments (2)

Refactoring the Namespace of Files in a BizTalk Project

Most of us have probably started a BizTalk project with limited knowledge of what the end solution will look like. This often means that we start out with silly names for things, including our project and solution. Then, after BizTalk has autogenerated namespaces (the C# namespace for the file – not the xsd namespace) for many items (such as xsd’s), we decide to improve our naming, only to find out that we’re kind of stuck, unless we want to start all over again. Just to be really clear, I’m referring to the namespace shown in the properties window of a file before it’s opened. I know this can be confusing since there is another namespace, e.g. the target namespace, which is shown once an xsd is opened.

Well, something similar happened to me the other day at work – I was handed some code developed by another person who is now on vacation. I realized that the project was created in a very agile fashion, which is usually a great thing, but the downside of this iterative development was that I now had namespaces for the XSDs such as “TestSiebelAdapter” and “http://SiebelPOC”. Well after a colleague, Richard Seroter, and I did some looking in the Visual Studio project files, Richard found the the .btproj file is an xml file that contains a lot of this information. So, if you’re careful, you can search and replace within this file to fix some of your horrid naming. This will cause orchestrations to break – but you can search and replace within those as well. Of course any time you’re going to do something like this I highly recommend making a backup of your solution first.

Here’s a sample of the xml you’d be changing in the .btproj file:

<File
RelPath = “XYZ_To_ABC.btm”
BuildAction = “Compile”
Namespace = “TestSiebel
Type = “Transform_1″
/>

Leave a Comment

Suspending a BizTalk Orchestration with Delay Shapes

Okay, here’s a question for you: what happens to an orchestration that is processing a delay shape when suspended? Well, let’s find out.

First, I built a simple schema to kick off my orchestration (nothing surprising here). Then, I built a simple orchestration:

Timer Orchestration Example

Write_Time simply prints out the time to the debugger, e.g.

System.Diagnostics.Debug.WriteLine("Time: " + System.DateTime.Now.ToLongTimeString());

The Delay shape causes a delay of 1 minute. I built my project and deployed it, and then opened up the Debugger. Next, I dropped in a message to kick off my orchestration. I got my first debug statement:

[3404] Time: 10:30:34 AM

I then suspended the orchestration at 10:30:46. I waited until 10:31:35, and then resumed the message. So, here’s the question, will it A) print the current time as soon as resumed, B) wait about 48 seconds to print, or C) wait another minute to print? Let’s see our next debug statement:

[3404] Time: 10:31:35 AM

Interesting. While suspended, it appears that the Delay shape keeps track of time. Let’s do some more tests. This time I left the orchestration running until 10:32:30 and then I suspended it again. I resumed it at 10:33:25 this time. My next message in the debugger was:

[3404] Time: 10:33:25 AM

Hmmm… will it print out every time I resume the orchestration or did this happen because it had been greater than a minute? Let’s do some more tests. I suspended the orchestration at 10:34:00. I then resumed it at 10:34:05. My next print out was:

[3404] Time: 10:34:26 AM

Interesting! No, it isn’t just printing out when I resume (it’s not just going back to the first shape after the Delay). Let’s do another (after I do some other stuff). New message kicks off an orchestration and we get our first debug:

[3404] Time: 11:31:39 AM

Then, I suspend the message at 11:31:50. I then wait a couple of minutes, until 11:34:00, and resume the orchestration. Here’s what we get for the next few printouts:

[3404] Time: 11:34:00 AM
[3404] Time: 11:35:00 AM
[3404] Time: 11:36:00 AM

So in conclusion, if the orchestration is resumed before the Delay time specified, it will continue to wait.  If more time has passed, the orchestration picks up right after the Delay shape, and continues.  In my case this means that I will now be on a new 1 minute interval, that ending right on the minute (00 seconds).

You might be asking, so what? Well, here’s why this is important. Let’s say you are using a Delay shape in your orchestration to handle retries… this means that if you suspend your orchestration for any reason, your retry time is running out! This is definitely something to account for when you’re using the Delay shape in building an orchestration.  The good news is that if your Delay shape runs out of time, the orchestration will pick up right after the Delay shape.  Good luck developing!

Leave a Comment

BizTalk Server 2006 SQL Adapter Lock Resolution

We just fixed a rather hairy problem here at work that I’m sure will benefit someone out there. Here’s a brief description:

Using BizTalk Server 2006 (R1) we had a problem with a particular application that used the SQL Server Adapter for polling. It would run a very simple stored procedure every 30 seconds. The stored procedure basically queried a table and then truncated it. The problem we were facing was that the SQL Server DBA was finding the database table exclusively locked after some amount of time (usually within a minute or so). We tried a bunch of stuff to no avail and then decided to call Microsoft.

The Microsoft gentleman tried a few things here and there, but the problem persisted until he called back with a new idea. We had 2 Receive Handlers associated with the SQL Adapter.  However, only one was in use and so I was asked to delete the second Receive Handler. I reluctantly agreed after realizing that no one else was using that Receive Handler. Well, guess what – that fixed the issue. I wish I could have shared some obscure registry key that we flipped or something, but no. I guess the moral of the story is to only use a single Receive Handler for your SQL Adapter.

Comments (4)

Send Port Groups

This probably won’t be anything new for most experienced BizTalk developers, but I learned something so I figured I’d write it down to make sure I don’t forget.

At work we have several systems that subscribe on message type. Before moving to our Test and Prod environments I figured it’d be nice to simplify maintenance by using a Send Port Group. So I read the section on Send Port Groups in my book written by Darren Jefford, Kevin B. Smith, and Ewan Fairweather. It didn’t quite address the question I had in the way I was expecting an answer so I figured I’d just try things out.

What I had in mind worked as I suspected for my simple cases. For example, when I had

  • SendPort1 – subscribing on MessageTypeA
  • SendPort2 – subscribing on MessageTypeA
  • etc…

I created a Send Port Group and added the filter of MessageTypeA. Then I removed the filters from SendPort1 & SendPort2 and sure enough things worked fine. So then I moved to my more complex case (simplified here).

  • SendPort5 – subscribing on MessageTypeA
  • SendPort6 – subscribing on MessageTypeA and CustomFilter1
  • etc…

I figured I could have a new Send Port Group filtering on MessageTypeA and add both SendPort5 and SendPort6, and that I could then add an additional filter to SendPort6 (that of CustomFilter1). Want to guess what happened?

I got two messages in my SendPort6 destination for an incoming message of MessageTypeA meeting CustomFilter1’s criteria. This is when I realized that I hadn’t understood the book well. As you probably know, the subscription for the Send Port Group is met, and then a separate subscription is met by the filter on SendPort6, causing two messages in my output. I hadn’t expected this.

Leave a Comment

What is DataFlux?

It’s been a while since my last post, so I thought I’d share something on DataFlux.

DataFlux Logo

So what is DataFlux? Yes, a leader in data quality, it’s both a company and a product; better stated, DataFlux (the company) provides a suite of tools (often simply called DataFlux) that provide data management capabilities, with a focus in data quality.

DataFlux’s tools can do a lot of really neat things; I’d say it’s a must-have for Sales & Marketing, and it’d benefit most enterprises out there in other ways. To see what all of this pomp is about, let’s use an example. Think of these entries in your company’s xyz system:

Name Address City,State,Zip Phone
Mr. Victor Johnson 1600 Pennsylvania Avenue NW Washington, DC 20500 202-456-1414
Victor Jonson, JD 1600 Pennsylvania Avenue Washington, DC 456-1414
VICTOR JOHNSON 255 DONNA WAY SAN LUIS OBISPO, CA 93405 (805) 555-1212
Bill Shares 1050 Monterey St SLO, CA 93408 8055444800
Doctor William Shares 1052 Monterrey Ave San Luis Obispo, California n/a
william shares, sr 1001 cass street omaha, nebraska, 68102  

In this example, a human could probably pretty easily figure out that the first two Victors are probably one and the same and that Bill in SLO and William in San Luis Obispo are also the same person. The other records might be a match, but most of us would agree that we can’t be sure based on name alone. Furthermore, it is obvious that some data inconsistencies exist such as name prefixes and suffixes, inconsistent casing, incomplete address data, etc.; DataFlux can’t (and shouldn’t try) to fix all of these quirks, but it should at least be able to reconcile the differences, and, if we choose, we should be able to do some data cleanup automatically. So let’s get started. I’ll open up dfPower Studio.

dfPowerStudio Main Window

This interface is new in version 8 and helps provide quick access to the functions one would use most often. This change is actually helpful (as opposed to some GUI changes made by companies) by combining a lot of the settings into a central place.

In my case, I’ll start Architect by clicking on the icon in the top left, where most design takes place. On this note I guess I should say that Architect is the single most useful product in the suite(in my opinion anyway), and it’s where I’ll spend most of my time in this posting.

DatFlux Architect Initial Screen

On the left panel you’ll see a few categories. Let me explain what you’ll find each one (skip over this next section if you want):

Data Inputs – Here you’ll find nodes allowing you to read from ODBC sources, text files, SAS data sets (DataFlux is a SAS company), and more. I’ll cover one other data input later…

Data Outputs – Similar to inputs, you’ll find various ways of storing the output of the job.

Utilities – Utilities contain what many would refer to as “transformations”, which might be helpful to know if you’ve worked with Informatica or another ETL (Extract, Transfer, Load) tool.

Profiling – Most nodes here help provide a synopsis of the data being processed. Another DataFlux tool is dedicated to profiling – in some ways these nodes are a subset of the other’s functionality, but there’s one primary difference. Here the output of profiling can be linked to other actions.

Quality – Here’s where some of DataFlux’s real magic takes place, so I’ll go through the task of describing each node briefly: Gender Analysis (determine gender based on a name field), identification analysis (e.g. is this a person’s name or an organization name?), parsing (we’ll see this), standardization (we’ll see one application of this), Change Case (although generally not too complicated, this gets tricky with certain alphabets), Right Fielding (move data from the “wrong” field to the “right”), Create Scheme (new in Version 8 – more of an advanced topic), and Dynamic Scheme Application (new in Version 8 – another advanced topic)

Integration – Another area where magic takes place. We’ll see this in this post.

Enrichment – As the name suggests, these nodes help enrich data, i.e. they provide data that’s missing in the input. This section includes: address verification (we’ll see this), geocoding (obtaining demographic and other information based on an address) and some phone number functions (we’ll see one example).

Enrichment (Distributed) – Provides the same functionality as I just described, but distributed across servers for performance/reliability gains.

Monitoring – Allows for action to take place on a data trigger, e.g. email John if sales fall under $10K.

Now that we’ve gone through a quick overview of Architect’s features, let’s use them. I’ll first drag my data source on to the page and double click on it to configure its properties. For my purposes today I’ll read from a delimited text file I created with the data I described at the beginning of the article. I can use the “Suggest” button to populate the field names based on the header of the text file.

Text Input Properties

What’s nice here is I can have auto-preview on (which by the way drives me crazy), or I can turn it off and press F5 for a refresh, which shows the data only when asked. Either way, the data will appear in my preview window (instant gratification is one of the great things about Architect).

Preview of Input Data

Next, I’ll start out my data quality today by verifying these addresses. I do this by dragging on the Address Verification (US/Canada) node. After attaching the node to Text File Input 1 and double-clicking on the node, in the input section I map my fields to the ones expected by DataFlux and in another window I specify what outputs I’m interested in. I’ve selected a few fields here but there are many other options available.

Address Verification Properties

Preview
You’ll notice here I’ve passed through only the enriched address fields in the output. I could have also kept the originals side by side, plus I could have added many more fields to the output, but these will suffice for now (It’d be tough to fit on the screen here). Already you can see what a difference we’ve made. I want to point out just two things here:

1. There is one “NOMATCH”. This is likely to have happened because too many fields are wrong and the USPS data verification system is designed not to guess too much…

2. 1052 Monterey St is an address I made up and consequently the Zip-4 could not be determined. The real address for the courthouse in San Luis Obispo is 1050 Monterey St. If I would have used that, the correct Zip-4 would have been calculated. So why did we get a US_Result_Code of “OK”? This is because the USPS system recognizes 1052 as an address within a correct range.

Nonetheless, pretty neat, eh? I’d also like to point out that the county name was determined because I added this output when I configured the properties. At our company we’ve configured DataFlux to comply with USPS Publication 28, which among other things, indicates that addresses should always be uppercased. For this reason you see this here. Having said this, you have the option to propercase the result set if you’d like.

Moving on, let’s clean up the names. It’d be nice if we could split the names into a first & last name. First, I reconfigured the USPS properties to allow additional outputs (the original name & phone number). Next, I dragged the Parsing node onto the screen and configured its properties to identify what language & country the text was based on (DataFlux supports several locales and in version 8 supports Unicode). After that, I can preview as before. Note how well DataFlux picked out the first, middle and last names, not to mention the prefixes and suffixes.

DataFlux Parsing

For simplicity, I’ll remove the Parse step I just added and use a Standardize node instead. Here in the properties I’ll select a “Definition” for the name and phone inputs. There are many options to choose from including things like: Address, Business Title, City, Country, Name, Date, Organization, Phone, Postal Code, Zip, and several others. Let’s see what this does…

DataFlux Standardization

You might be wondering how DataFlux does this. After all, if the input name were “Johnson, Victor” would it have correctly standardized the name to “Victor Johnson”? The answer here is yes. DataFlux utilizes several algorithms and known last names, first names, etc. to analyze the structure and provide a best “guess.” Of course this means that with very unusual names the parsing algorithm could make a mistake; nonetheless I think that most users would be surprised how good this “guessing” can be, especially with the help of a comma. By that I mean that the placement of a comma in a name greatly enhances the parser ability to determine the location of the last name. If you’re interested in learning more about this let me know and perhaps I’ll write another blog to go into the details. All in all, it’s pretty neat stuff and of course the good part is that it is customizable. This helps if someday you want to write a standardization rule for your company’s specific purpose.

Let’s move on. I’m next going to make “Match Codes.” Match codes allow duplicate identification (and resolution). For example, often times (perhaps most of the time), nothing can be done about data in a system once it is entered. For example if a name is Rob, we can’t assume the real name is Robert yet we may have a burning desire to do something like that to figure out that 1 record is a potential duplicate of another… this is where match codes come in. Here’s the section of the Match Codes Properties window where we assign the incoming fields to the Definition. This step is important because intelligent parsing, name lookups, etc. occur based on the data type.

DataFlux Match Codes Properties

Let’s preview a match code to see what this does.

Match Codes

I couldn’t get the whole output to fit on the screen here, but I think the match codes seen in the name and the address will get my point across. Here you can see that match codes ignore minor spelling differences, take into account abbreviations, nicknames, etc. Why is this so significant? We now have an easy way to find duplicates! Match codes could be stored in a database and allow quick checks for duplicates! Let’s move on to see more… I’m now going to use Clustering next to see how duplication identification can be done. First, I’ll set the clustering rules in the Properties window (note that I use the match code instead of the actual value for the rule):

Cluster Conditions

And let’s preview…

Cluster Preview

Note that the cluster numbers are the same for records that match, based on the clustering conditions I set a moment ago. Pay special attention to the fact that our Bill & William Shares didn’t match. Why? Well because of the clustering conditions I set. We could modify our Quality Knowledge Base (QKB) to indicate that SLO = San Luis Obispo or I could remove the City as a clustering condition, together with lowering the sensitivity on the address match code (sensitivities range from 50-95) and the two would match. Let’s do this to be sure:

Cluster Preview Final

There are a lot of really neat things that DataFlux can do. I’ll try to post a thing or two out here now and again if I see anyone interested…

Comments (10)

« Newer Posts · Older Posts »