Friday, 15 February 2013

Importance of monitoring a database mirroring session

Last week I kicked off a survey about how you monitor your database mirroring sessions. .
Here are the results of the survey:
The "Other" values were:
  • 6 people monitor mirroring session state changes
  • One uses a 3rd-party product to do monitoring
  • One monitors response time for the application
  • One made a sheep joke – very good
  • 8 people don't use mirroring, which isn't relevant here
In my opinion, the only people who are monitoring mirroring correctly are those who picked:
  • All of the above metrics
  • SEND and REDO queue sizes
  • Comprehensive monitoring
The problem with database mirroring is the hype around it providing instant failure detection and instant failover.
These properties are mythical.
In reality, the time it takes to detect a failure depends. The time it takes to failover depends. 
It is absolutely essential that you monitor at least the SEND queue size and the REDO queue size of a mirroring session.
  • The SEND queue size shows how much transaction log has been generated on the principal server but has not yet been sent to the mirror server. If it is non-zero, means the mirroring state is not SYNCHRONIZED, meaning that an automatic failover cannot occur. Furthermore, the SEND queue size is an indication of the data loss that will occur if the principal database were to suffer a disaster. You need to monitor this to ensure the size of the SEND queue does not exceed your maximum allowable data loss SLA (or RPO) for the database being mirrored.
  • The REDO queue size shows how much transaction log exists in the mirror database that has not yet been replayed on the mirror database. (Remember that log records just have to be hardened on the mirror database's log drive, not replayed – that is done as an ongoing process on the mirror server.) If a mirroring failover occurs, the mirror database cannot be accessed until all transaction log records in the REDO queue have been replayed on the mirror database – essentially crash recovery has to occur. The larger the REDO queue, the longer a failover will take. (Remember that in Enterprise Edition, fast recovery comes into play and the database becomes available after the REDO phase of recovery has completed and before the UNDO phase begins.) You need to monitor this to ensure the size of the REDO queue does not exceed your maximum allowable downtime SLA (or RTO) for the database being mirrored.
The oldest unsent transaction is another way to monitor the instantaneous amount of data loss you would suffer if the principal database suffered a disaster. It applies in all modes of database mirroring, because even if you are using synchronous mirroring, the principal and mirror can become disconnected, or you may pause mirroring.
The mirror commit threshold is good to monitor to see what kind of delay is being added to transactions waiting to commit on the principal because their log records have not been acknowledged as written to disk on the mirror.

There are a bunch of performance counters for database mirroring but unfortunately the Books Online entry for them is very sparse. The best place these are documented online that I know if is a blog post of mine from 2008 (which the KB article about configuring database mirroring actually references too). 
For those of you who don't know how to monitor database mirroring, there is a half-decent tool in SSMS called the Database Mirroring Monitor. It allows you to easily configure alert thresholds the four metrics I discuss above.
Here's a picture of it running inside the VPC I use to demo mirroring for the MCM prep videos.
It's very easy to use and you can read more about it in Books Online here.

Bottom line: you must monitor database mirroring to ensure the mirroring session is preserving your downtime and data-loss SLAs.
Happy monitoring!

No comments:

Post a Comment