BizTalk Disaster Recovery Planning
I agree with Nick Heppleston’s post entirely in that you had better test your DR plan in advance of the real disaster if you really want it to work! I can speak from experience here – my company’s previous plan sounded perfectly fine to me when I worked on it. In fact, when I discussed the plan with Tim Wieman and Ewan Fairweather they also agreed that it sounded reasonable (to their credit they didn’t get to see a written plan and they both cautioned that I had better test things to be sure). But, as you can probably already guess, when tested, the plan didn’t work. In this post I’ll talk about what I did (that didn’t work) so that you can avoid doing the same thing, and then point out a shortcoming in the MSDN documentation, as I see it.
Here’s what I attempted to do. Since our test/QA environment resides in another data center, far away from our production environment, we decided to use the test environment as our DR environment in the case of a lasting failure in the normal production site. In a few words, we were planning for a scenario where we’d lose both the BizTalk application servers and the SQL Server databases. We thought this was reasonable since many disasters could easily prevent both sets of hardware to stop functioning for an extended period of time.
Here’s a high-level overview of the steps we had in place before the disaster:
- We had configured regular backups with transaction log shipping on the destination (DR) database server.
- We had the IIS web directories and GAC of the BizTalk application servers backed up to the DR BizTalk application servers.
- We had a backup of the master secret available on the DR BizTalk application servers.
Here’s a high-level overview of the steps we had after the disaster:
- Recover the production binaries and production web directories onto the test servers.
- Restore the production database using the DR instance.
- Unconfigure the test BizTalk application servers.
- Run the UpdateDatabase.vbs script and UpdateRegistry.vbs script on the test BizTalk application servers.
- Reconfigure the test BizTalk application servers, joining the SSO and BizTalk Groups.
- Recover Enterprise Single Sign-On.
In early November, we tested the scenario. To make a long story short, it didn’t work. Microsoft, also uncertain why things didn’t work (at first anyway), suggested a “hack” to get things working. That hack would then be reviewed by the product team, which would hopefully provide their final blessing. The hack involved editing the SSOX_GlobalInfo table (part of the SSODB), updating gi_SecretServer with the “new” test server that would become the master secret server. It appeared that it worked at first, but a deeper look into things showed that we were wrong (it only appeared to be working because the real production master secret server had been brought up by that point – since the DR exercise was taking too long – so that both BizTalk groups were running at the same time).
The problem that we faced can be summarized as follows: the MSDN documentation makes an assumption, which wasn’t true in our case. Specifically, Microsoft expects the new master secret server (the test BizTalk app server in our case), to have the same name as the previous master secret server OR for the new master secret server to already be a part of the group prior to the disaster (of course it wouldn’t have been the master secret server prior to the disaster). These limitations don’t really have anything to do with BizTalk, but rather with the Enterprise Single Sign-On product. I think they are unrealistic expectations, but limitations nonetheless.
In conclusion, you had better test out your DR scenario if you haven’t already, or be prepared for some ugly surprises. Please share your experiences with me. What does your DR plan assume? Have you had a nasty experience?
RSS Feed
Hi Victor,
Some good points here that I hadn’t picked-up – thanks for the post!
Nick.
sso is the weak link in biztalk… Forces you to think of one node as special.
S
SSO continues to be a pain to work with. After discussions with Microsoft, we setup 2 different Biztalk groups. One in each datacenter and was going to log ship from one to the other. As we were implementing this, we started to raise questions on how this could work. We spoke to different Microsoft people who told us that both sets of servers in both datacenters must be in the same Biztalk group, including all in the same SSO group.
Otherwise the encryption key is different and you have issues. So, now we have 4 application servers (two in each datacenter) pointing to 1 sql cluster in the primary DC. We log ship to the backup SQL cluster in to 2nd DC. To recover from a disaster, we recover the DB’s on the backup sql cluster and point the 2 remaining application servers in the same DC to the backup sql cluster.
I am trying to test recovery from a disaster, but feel the SSO part is lika Catch 22. I cannot get it to work.
We have a DB cluster where SSO is also a clustered service, just like the documentation recommends.
I want to be able to recover from a DB/SSo cluster disaster. So the we have a third DB server for log shipping. On that server we also have a “waiting” SSO service to take over as Master.
The problem is to make that server the SSO Master.
We cannot run “ssomanage -updatedb NewServer.xml” if the old DB is down. We cannot “ssoconfig -restoresecret secret.bak”, since it is not the Master.
How did you do step 6, 6.Recover Enterprise Single Sign-On? Is it possible to have our setup?
Regards
Martin Bring
I have sent you an email with some additional information.
We are also not able restore the SSO in DR using “ssoconfig -restoresecret secret.bak”.
do you able to confiigure SSO in DR?
Hi Fehlberg,
Thanks for this great tip. However, we have issues in planning for Step 6. Recover Enterprise Single Sign-On? as mentioned by Martin in earlier post. Could you please share me how this is done?
Thanks,
Andy Dang
6.Recover Enterprise Single Sign-On.
Hi Andy,
I’ll forward you the message I sent to Martin.
Thanks,
Victor
Hi Fehlberg,
Our client machine configuration has 2 PROD App servers(Load Balancing Cluster), 2 PROD SQL servers(Failover Clustered), 1 DR App server, 1 DR Sql Server.
1.Can we use the standard Log shipping steps to configure Log shipping in this kind of set up? When we run the bts_ConfigureBizTalkLogShipping stored proc , what should we give for nvcMgmtServerName parameter ? is it SQL cluster name ?
2. My ex-colleague has already Log shipping with all the host instances created on PROD copied to DR site. We want to bring DR online , shut down PROD servers and test the DR.We stopped the Log shipping jobs, restored the databases, changed the SSO master secret server, changed SSO server name in all databases and registries. Still we see the host instances pointing to the PROD server names.
Please suggest us on how to bring DR online when PROD goes down in case of disasters.
Hope to see your reply.
Thanks
Venu
I’ll send an email to you with more information that will help.