SyntaxHighlighter

Friday, May 30, 2014

SQL Server: Huge Log Files and Recovery Pending Status

It's been an interesting morning so far.

We all got an email from some phantom IT department member last night (who knows who it was because it was a system account) letting us know that there was going to be some reconfiguration going on at the data center (which is hosted). It basically said, "There's nothing to see here. Move along."

So we moved along.

Last night around 8:33, I got a notification from my main production SQL Server (2008 R2 Enterprise) that the server had been restarted and moved to a different cluster node. I logged in remotely, checked it out, all looked good. I figured it was a failover or a failover test. (Incidentally, for some strange reason, our IT department sees fit not to allow the database administrator -- me -- any access to the clusters). Then at 10:45PM I got another reboot message -- same thing, but this time it was flipping back to the other node. Interesting, but I was able to get to all the data and nothing seemed amiss in the logs.

This morning, there is basically nobody from IT in the building except for myself and one single system admin. This SA has been working with a contractor on a System Center database and noticed that he couldn't get any data from System Center. He logged into SSMS and noticed that the databases were all in a state that said "Recovery Pending" and he couldn't get any information about the databases whatsoever -- no properties, no file information, nothing. So he restarted the server. Still, "Recovery Pending."

He came to me and asked me to help troubleshoot. Right away I knew that there were only a couple of paths that could cause this error message. Basically the log files for these database was missing or corrupt. However, the system databases were all online.

Step 1. Check for properties on the database. Not available. Curious.
Step 2. Check the logs. Opened up an archived log. Or tried. SSMS started choking. Okay, the log must be too big?
Step 3. Check the physical log on the disk. It was 1.2GB. Yes! You read that correctly! It was 1.2GB! No wonder SSMS was choking.
Step 4. Check the data files and the log files.

CRASH! My PC (formerly my computer) was showing a single drive -- the system drive. There used to be 3 LUNS there mapped to three different drive letters (E:, F:, G:).

Step 5. Open up the disk management tool. The LUNs were still there, they just offline. We flipped the switch and they all came online.
Step 6. Restart SQL Server.
Step 7. Verify in SSMS that the databases were online. They were.

I am not a SAN administrator not am I a dedicated Windows Server Admin, but I have not to this date seen a SAN LUN completely go offline. I'll leave that work to the the Sysadmins ...

What's the lesson here? If your databases are in "Recovery Pending" mode, one of the first things to check is whether the files actually exist -- in fact, if the drives they are on actually exist.

UPDATE: I think the log was so large because the LUN volumes went offline while SQL Server was still running. I haven't had a chance to look at the log file yet, but that's my guess.

No comments:

Post a Comment