Once upon a time, on my first day of work, the Exchange cluster died. A two-node Exchange server for some 2000 mailboxes just keeled over and died.
This was happening randomly a few times the week before. When the active node went down, the passive node did not come up. Thereafter, the mail service went TITSUP. The server engineer was rebooting it to do a temporary fix, but that is not a stopgap measure.
Now that they have fresh eyes, me. I dug through the logs and found the culprit. There was a permission flag that was wiped by Active Directory replication. This happened approximately on an 11-hour interval. That caused the active Exchange node to go down. This sort of fixed the Exchange issue.
Now the passive node. I asked the server engineer why did the passive node did not come up. The reply was it was unable to come up for some months now. So I tasked him to hunt down the root cause. It was due to a config mismatch in the Microsoft clustered services. All nodes should be running the same number and type of service. Someone configured a POP3 service on the active node, but not on the passive node.
Now with the Exchange cluster cleaned up, I asked them how long they have been on one node. More digging through the cluster.log file. It seems that the previous IT crew ran the active on Node 1 all the time. Node 2 was never ever used.
I implemented a policy to run the active node on the odd node on odd months, even node on even months. We never have a problem with the cluster ever again.
I tried doing the same thing for the SQL cluster. Somebody took the word “passive” too literally. That node has a slower CPU and less RAM! The RAM part was fixed quite easily. Can’t do anything about the slow CPU.
What you don’t use, you lose it. somebody important said that.
This article first appeared in Medium. link