A Problem with BizTalk Server 2006 Performance – 100% CPU Utilization

Home > BizTalk Server > A Problem with BizTalk Server 2006 Performance – 100% CPU Utilization

A Problem with BizTalk Server 2006 Performance – 100% CPU Utilization

May 31, 2008 Fehlberg Victor Leave a comment Go to comments

A few weeks ago, a new application was deployed at work that was taking about 45-60 seconds to run through a cycle of a particular orchestration that was set up as a singleton. As time went by, the process seemed to take longer – instead of 60 seconds, it seemed to be taking 3 minutes. This bothered me a bit, but I wasn’t sure just what to do. Around this time, I noticed that the host instance process, which used to take only a few CPU cycles on the server, began taking 25% of the 4 CPU cores (100% of one core). Although I know BizTalk uses more than one thread in its processing, I couldn’t help but be suspicious – it sure seemed like a process was out of control.

Nonetheless, the process was finishing, and I had lots of other pressing matters to worry about, so after trying a couple of simple things to speed things up (which all failed), I left the application alone. Yesterday, I noticed that many of the processes that were now running came from messages received over 24 hours prior. This caused an alarm; a meeting was held to figure out what to do. At the end of the day the problem still wasn’t solved – which interpreted means that Victor was going to work on Saturday until he figured things out. I searched on the internet and found many articles on performance. Most of them I had read already, or had been taken into account when setting up the BizTalk Server, but there were a few that were new to me. Each “new” issue I found seemed promising – I crossed my fingers as I implemented each fix, but to no avail. Nothing seemed to help.

So I called Microsoft to get help. I chatted with Sajid, one of their good support persons, on the phone and described the issue. He ran through a handful of possible causes (similar to what I had seen in other articles), to see if he could quickly isolate the problem. No luck. I might add that it was comforting that he too suspected an issue with the SQL Server database – but we found nothing. He didn’t have internet access with him at the time (after all, it was a weekend), so he called back later. He looked at the issue, and asked how many messages were in queue for the process. Unfortunately the admin console doesn’t have a count feature, but after limiting the query to show 5000 results, and having the tool report that there were still more that weren’t being shown, Sajid became suspicious. We terminated the orchestration that was running (the singleton) since I could recreate the input messages, and whala! Problem solved.

So, you may be curious as to what was the problem in the first place. That is, why were there so many messages? Well, as you may know, messages that are consumed by an orchestration still show up in the message holder of the orchestration until the orchestration finishes. A singleton that doesn’t terminate, will eventually cause such a large queue of messages (even if they are marked “consumed”) to appear, that apparently the host instance runs wild while simply trying to pick up the messages that it needs for its own processing. A more sophisticated singleton design would cause the orchestration to complete if no new messages had been received after a given time, say 15 minutes.

I hope as a follow up to this post I’ll have time enough to show some better singleton designs. I know posts without pictures can be boring. =)

Categories: BizTalk Server

Comments (13) Trackbacks (1) Leave a comment Trackback

Richard Seroter

May 31, 2008 at 5:50 pm

Reply

Sheesh, good fix. I guess I have a singleton design to update on Monday …
Sandra Sequiera

May 31, 2008 at 7:47 pm

Reply

Thanks Victor for being so diligent, guess you pointed a few opportunities!
Felix

May 31, 2008 at 8:58 pm

Reply

Technically speaking, I didn’t understand the solution… “We terminated the orchestration that was running (the singleton)…” – but what about the functionality of the orchestration? I realize it caused the problem, but I am assuming it had a “baby” along with the water. Did you throw it out as well?
Fehlberg Victor

May 31, 2008 at 9:24 pm

Reply

Yes, by terminating the orchestration we lost all of the messages that were being processed. I did this however knowing that we could find out which messages these were by having the business users run a reconciliation report. They have done so and those messages have been reprocessed. If this were not possible, you’re right, this might not have been a very good option.
Mike Stephenson

June 1, 2008 at 4:20 pm

Reply

Hi

Nice article, can I add the following comments for anyone reading this as they may also help anyone new to these kind of issues.

1. This is one of the many examples of why monitoring software such as MOM with a good BizTalk management pack is so important. If you had MOM in place here I would expect it to detect this problem situation (probably from the spool size?) and warn you that this was happening

2. When troubleshooting these kind of problems one of the first things to check are the Message Agent and Host/Messagebox performance counters. This can give you a decent picture of what is happening in your BizTalk environment and help you work out what is happening over a period of time. Very useful in these cases

3. There is lots of info around these days relating to performance and ways of optimising BizTalk, but probably the first place to check is your orchestration/code. Also in my experience if you have done some benchmarking on the environment before you start you can mitigate a lot of the potential problems with your infrastructure setup.
Bembeng Arifin

June 1, 2008 at 11:14 pm

Reply

Hi Victor,

I actually had a quite similar singleton issue a month before.

My Orchestration acts as a message distribution controller from several receive locations to some orchestrations using direct binding to MsgBox.

However, same like yours, the messages were consumed but never released by the singleton orchestration which I guess by design the messages will only be cleaned up after the instance has been completed.

In load test, we found that by using singleton orchestration with this kind of scenario where we have high freq and large messages, the performance kept decreasing (hourly) and affecting the other orchestrations as well.

In the end, we didn’t proceed with this singleton orchestration and found other solution for our requirements which is having more consistent processing performance.

I have thought a logic to exit (complete) the singleton orchestration, however there is a slight chance that if you have one or more queued messages for the singleton orchestration and we exit (complete) at the current instances. The queued messages may not be consumed until a new instance is created, I may be wrong at this, but the queued messages are not able to summon a new orchestration instances, only new messages can do that.
Fehlberg Victor

June 1, 2008 at 11:52 pm

Reply

To Mike’s comments – yes, we use MOM. I also used the Performance Monitoring tool quite a bit when trying to figure this problem out. I looked at spool size, but it didn’t seem outrageous – it was at 50K whereas we normally operate in production at 40K or less. Because there are so many other applications running on the server, it’s hard to detect an issue looking at the spool size unless it’s very large.

I looked at many of the other performance monitors hoping for other insight, but didn’t really find much – the host instance wasn’t throttling and other counters seemed normal.

The environment generally operates pretty well – unfortunately most benchmarking and testing done doesn’t always involve running the application for long times with the same load that production will encounter. I’ll admit that this could have been done better. Thanks for the comment Mike.

To address Bembeng’s point, I believe a white paper was written on how to finish an orchestration without risking losing messages that might be queued up – I agree that this is something I’d definitely be sure of before putting a system in production. I’ll find that white paper and add a link. Thanks for the comment.
Bembeng Arifin

June 2, 2008 at 2:26 am

Reply

Hi Victor,
Sure, no problem, I’m just glad to share this with you since I was also unable to find much information about this at that time 😉
Jayesh Amarnani

August 3, 2010 at 7:56 pm

Reply

I think all the fixes to BizTalk 2006 R2 issues are now available in new service pack release SP1. Below are two of the fixes that are related to 100% CPU utilization –

943125 – The XLANG Scheduler Engine enters an infinite loop, and a BizTalk Server 2006 process uses 100 percent of the CPU resources if a time-out exception is handled by a non-transactional scope

950456 – FIX: In BizTalk Server 2006 or in BizTalk Server 2006 R2, the transformation may stop responding, and the CPU utilization may be 100 percent
- John
  
  August 16, 2010 at 7:31 pm
  
  Reply
  
  Thankx Jayesh.. good fix! This is helpful for my R2 setup. I am also having a non R2 BizTalk setup where I cannot apply SP1. suggestions if any?
  - Anuj
    
    August 18, 2010 at 11:12 pm
    
    John – you can check fix number 943125 and 944234. these are for non r2. apply only these hotfixes. hope u resolve your CPU issues with this.
vamshi

August 12, 2010 at 11:58 pm

Reply

nice for sharing your experince with us
Pratik

March 26, 2012 at 9:23 pm

Reply

i have issue installing sp1. it throws memory error. has anyone seen this before?