Archive

Archive for April, 2008

Health and Activity Tracking Inaccurate?

April 30, 2008 3 comments

Just the other day we had a new BizTalk application move into production. However, shortly after, problems began where the orchestration that was running encountered an error. I opened up Health and Activity Tracking (HAT), and saw this:

Health and Activity Tracking Image

This error was found in the suspended orchestration and in the Application Log:

System.Data.OracleClient.OracleException: ORA-01017: invalid username/password; logon deniedat System.Data.OracleClient.OracleException.Check(OciErrorHandle errorHandle, Int32 rc)
at System.Data.OracleClient.OracleInternalConnection.OpenOnLocalTransaction(String userName, String password, String serverName, Boolean integratedSecurity, Boolean unicode, Boolean omitOracleConnectionName)
at System.Data.OracleClient.OracleInternalConnection..ctor(OracleConnectionString connectionOptions)
at System.Data.OracleClient.OracleConnectionFactory.CreateConnection(DbConnectionOptions options, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningObject)
at System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnection owningConnection, DbConnectionPool pool, DbConnectionOptions options)
at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject)
etc.

Because HAT showed a send shape as being the last thing to start execution, I immediately assumed there was a problem with the Oracle Send Port associated with that send shape. A teammate, also trying to solve the problem, assumed the same. We were both frustrated because this problem hadn’t happened in dev or test – what was different now? We tried many things to try and get the production problem solved, but our efforts were in vain. The system ended up being rolled back and the old infrastructure put back in place – a true disaster.

A day later, after some of the pain and sorrow subsided, I contacted Microsoft. I was put in contact with a great support person, who looked at the problem and started asking some questions. I was a little frustrated with his questions at first, but now I know why he was asking them. He kept asking about database connections we were opening in C# code, as opposed to what seemed obvious to me, that this was related to the Oracle Adapter Static One-Way Send Port. He eventually explained his stubbornness for not accepting the problem at face value: 1) the error didn’t indicate anything about the Oracle Adapter and 2) HAT is not always accurate! He mentioned that he has seen many instances where HAT does not actually show the exact node where the problem is occurring; furthermore, he has even seen HAT debug values to be incorrect!

I couldn’t believe what I was hearing! Once he said this, I started to wonder… well, if the Oracle Adapter send port is not the problem (which would make sense since an earlier call to the same database via a solicit-response Oracle send port had worked), what could it be? I began examining the next node after the send, and found the node SiebelUpdate, which was trying to update an Oracle database via C# code. Immediately things started to click – right before deployment, I had been asked to use a new username/password for the Oracle Siebel connection. I had tested all of the previous credentials to avoid this kind of problem (there were 4 data sources being connected to in about 7 or 8 different ways), but I hadn’t tested the new one that had been given to me. I tried opening up the Siebel database using the credentials I had been given, and guess what, same error.

So here’s what I learned:

1. Don’t trust HAT. If we would have known this, I’m pretty sure we’d have figured this out prior to the rollback.
2. Look carefully at the error message – careful inspection showed that it wasn’t related to the Oracle Adapter.
3. Microsoft DOES use the ODBC connection for the Oracle Adapter (we had enabled logging but didn’t see anything – now I know why).
4. Don’t mistrust good old Oracle errors. I had, because there seemed to be no another explanation based on what was shown in HAT.

Advertisements
Categories: BizTalk Server

BizTalk 2006 Performance

April 2, 2008 1 comment

At work we have a BizTalk farm consisting of 2 servers, both of which are relatively fast (they have dual processors with dual-cores). The various BizTalk SQL Server databases sit on an even faster server with 4 processors, each with a dual-core. The databases are clustered to provide fail-over. Okay, great. So why I am sharing all this…

We started having performance problems at work. The production load was not all that great, and the servers were basically idling. I was quite disappointed when BizTalk, the service I manage, took the blame for ERP performance issues (SAP interfaces with the rest of the company by sending out messages from XI to BizTalk, who then fans out messages to all other downstream systems… when XI could not deliver messages to BizTalk fast enough, a backlog started building up until all XI communications came to a halt). So what did I do?  A few things…

First,  someone on the XI team pointed out that if I tried to visit the WSDL of the service sitting on the BizTalk box it took a very long time to open.  Based on this, I then looked at how things were set up in IIS (run ‘inetmgr’ command) and found that the 5 or 6 web services being called from XI were all sharing a single application pool, with the “web garden” setting set to 1.  Apparently, if not set otherwise.  I added new application pools, one for each web service and then also bumped up the “web garden” setting to 5 (# of processors + 1).

Web Garden Setting in IIS

Once the heyday of directors arguing over what to do next subsided and someone accidentally turned the XI interface on again without permission (whoops!), the problem of getting messages into BizTalk from XI immediately went away.  I thought my problems were solved, but then soon found out that many of the downstream subscribers weren’t getting messages!  I checked and sure enough, there were lots and lots of messages queued up, although “Active”, for each downstream system.  I then looked into throttling.

Like most of us out there, I started to read the manuals when trouble started (and too often not before then, unfortunately). The first “manual” to read in my opinion so far as I am concerned is Professional BizTalk Server 2006. After reading on throttling and finding the MSDN site on Host Throttling Performance Counters (which describes things well), I found that we hadn’t made all of the registry changes we had started (and planned in the beginning) as new hosts were added to the system.  Pages 460-461 of Jefford’s book describes those registry settings.  I made those changes and then took a look at throttling again.  The system was still throttling, in State 6, which is due to database size (basically after fixing the IIS settings I was now getting messages in faster than I could send them out).  I couldn’t believe it – here the server is idling and yet we’re throttling.  I added a counter on database size and found that the database was huge and that this accounted for the throttling.

Database Size

So I changed this:

Throttling Thresholds

I found that I had to bounce the host instance for the change to take effect, but once I did that it worked like a charm.  This particular throttling setting is kind of funny because you’d think that if you’re database size is growing too fast you’d want NOT to throttle (so the messages could get out and have the size go down), right?  Well, I guess someone thought otherwise (probably someone who thought about this a lot more than me, I might add).  Nonetheless, in my situation, this is exactly what I needed – to stop throttling.

Categories: BizTalk Server