BizTalk Disaster Recovery Planning
I agree with Nick Heppleston’s post entirely in that you had better test your DR plan in advance of the real disaster if you really want it to work! I can speak from experience here – my company’s previous plan sounded perfectly fine to me when I worked on it. In fact, when I discussed the plan with Tim Wieman and Ewan Fairweather they also agreed that it sounded reasonable (to their credit they didn’t get to see a written plan and they both cautioned that I had better test things to be sure). But, as you can probably already guess, when tested, the plan didn’t work. In this post I’ll talk about what I did (that didn’t work) so that you can avoid doing the same thing, and then point out a shortcoming in the MSDN documentation, as I see it.
Here’s what I attempted to do. Since our test/QA environment resides in another data center, far away from our production environment, we decided to use the test environment as our DR environment in the case of a lasting failure in the normal production site. In a few words, we were planning for a scenario where we’d lose both the BizTalk application servers and the SQL Server databases. We thought this was reasonable since many disasters could easily prevent both sets of hardware to stop functioning for an extended period of time.
Here’s a high-level overview of the steps we had in place before the disaster:
- We had configured regular backups with transaction log shipping on the destination (DR) database server.
- We had the IIS web directories and GAC of the BizTalk application servers backed up to the DR BizTalk application servers.
- We had a backup of the master secret available on the DR BizTalk application servers.
Here’s a high-level overview of the steps we had after the disaster:
- Recover the production binaries and production web directories onto the test servers.
- Restore the production database using the DR instance.
- Unconfigure the test BizTalk application servers.
- Run the UpdateDatabase.vbs script and UpdateRegistry.vbs script on the test BizTalk application servers.
- Reconfigure the test BizTalk application servers, joining the SSO and BizTalk Groups.
- Recover Enterprise Single Sign-On.
In early November, we tested the scenario. To make a long story short, it didn’t work. Microsoft, also uncertain why things didn’t work (at first anyway), suggested a “hack” to get things working. That hack would then be reviewed by the product team, which would hopefully provide their final blessing. The hack involved editing the SSOX_GlobalInfo table (part of the SSODB), updating gi_SecretServer with the “new” test server that would become the master secret server. It appeared that it worked at first, but a deeper look into things showed that we were wrong (it only appeared to be working because the real production master secret server had been brought up by that point – since the DR exercise was taking too long – so that both BizTalk groups were running at the same time).
The problem that we faced can be summarized as follows: the MSDN documentation makes an assumption, which wasn’t true in our case. Specifically, Microsoft expects the new master secret server (the test BizTalk app server in our case), to have the same name as the previous master secret server OR for the new master secret server to already be a part of the group prior to the disaster (of course it wouldn’t have been the master secret server prior to the disaster). These limitations don’t really have anything to do with BizTalk, but rather with the Enterprise Single Sign-On product. I think they are unrealistic expectations, but limitations nonetheless.
In conclusion, you had better test out your DR scenario if you haven’t already, or be prepared for some ugly surprises. Please share your experiences with me. What does your DR plan assume? Have you had a nasty experience?