A little while ago, disaster struck. What seemed like a normal day at work, suddenly turned into a frenzy I have yet to experience anything similar to.
What happened? We realized something was wrong when we lost contact with one of our non-virtualized servers.
I couldn’t contact it at all; it had just vanished from the face of our network.
My natural reaction was to run into our server room, to check what had happened. I figured it would be a power supply failure, or NIC failure.
Boy, was I wrong.
It turns out that a plastic pipe going through the wall, providing shielding for the power cables that provide power for the outdoor unit of the cooling system, led water straight into the server room. When I entered the server room, and heard splish-splashy sounds as soon as my feet hit the floor, I immediately grabbed a bucket and held it under the aforementioned pipe. While I stood there, trying to do some damage control, several other people rushed to my assistance.
As soon as there were enough hands on deck trying to get rid of the water, I grabbed the file server and brought it downstairs for some open heart surgery.
It’s a well known fact that water and servers don’t really mix that well. Even less so when the water in question flows down the walls in your server room, right on top of your main file server. That’s right; water meet server.
Of course, the very last of our non-rack based servers was located in a straight line below the pipe. Everything else was fine; the rack servers aren’t located directly on the floor, nor is anything else. We did have a good 2cm of water on the floor, but that wasn’t enough to hit the rack servers or UPS’s.
So, what was the end result? One pretty dead server. It did try to get our hopes up, and initially it did.
At first, things looked good. I removed the HDDs and the power supplies, opened the cabinet and looked for water damage. The power supplies seemed to have gotten a bit wet, which is probably why the server went MIA in the first place. Other than that, everything looked good. I still had some hope that the data on the HDDs was undamaged. Considering that I had removed the HDDs, I tried powering on the server. Any, yay, it started up, went through the BIOS OK and generally seemed like a happy little server again.
I let it run for a while with no apparent errors or hiccups, so I decided to try and boot it with disks in it again. At first, the RAID controller complained that its logical drive(s) was missing, but that was expected after I had started it without the drives in it. I tried setting the logical drive to online, but then it complained about missing information. My next move was to copy the RAID/Logical Drive information from the drives to the controller, and that worked perfectly. The server rebooted, and started without problems. I let it run for a while, no problem what so ever, it seemed we caught a lucky break and could continue running.
Sadly that was not the case, as it only lasted a good 20 minutes before the server died completely, breaking the RAID as a result. The drives died, the power supply died, and our inventory is now one physical file server smaller.
Next, restore from backup. As most small companies/IT-depts. we do backups to tape. We even have a pretty decent LTO3 based changer, and we run Tivoli Storage Manager as out backup software. As this was a physical server that was due to be replaced with a VM, we decided to restore its data to a new pre-provisioned VM. That should be a breeze, right?
As anyone that has attempted to restore large amounts of data from a tape library will attest to, things can, and will, fail. Tapes can go bad, drives can go nuts and changers can decide that they don’t want to change anymore. We experienced two of the above;
- Bad Tape
One of the tapes we were going to recover data from was broken, and we could not recover data from it. Thankfully TSM lets us have a copypool of tapes, so we did work around it by collecting the replacement tape from that pool. - Nutty Drive
Drive 2 in the changer decided that after the initial restore job, a small subset of critical data, it wouldn’t play ball anymore. Now, TSM only uses one drive at a time to restore data with, but it does use the other drive in the changer to prepare the next tape with. So, we were reduced to all the action happening on one drive, which of course means that the restore time was significantly increased.
In the end, we were 100% successful in recovering the data from our latest backup set. We restored nearly 1 000 000 files (which also increased the restore time by a huge amount), but the entire restore process took us close to 56 hours in total.
Of course, in hindsight this whole mess could pretty easily have been avoided, on several different levels:
- The pipe should not have been able to lead water directly into the server room.
When we do risk assessments, do we identify problems like this? I for one did not see this one coming, and I’ve practically lived in that server room the last few years. - We should have installed some sort of water detection system in the server room.
This might not have prevented the server crash, but we could potentially have identified that water was present and been able to shut down the server before it fried. - Why was the server still located on the floor?
The fileserver should have been virtualized a long time ago, and plans were in place to do so. In fact, the VM that should replace it was already provisioned and semi-configured.
The most significant thing we could have done, before disaster struck, was to have a proper disaster recovery site in place. Irony has it that we got the quote on the hardware from HP, and software from Veeam, on Tuesday, two days before “the incident”. We have the DR location in place, and the lease contracts have been signed. We even have 100Mbit direct access to the DR site being installed as we speak. If this had happened a month or two from now, we would have been up and running through the whole ordeal. Of course, it could not have happened at a worse time, but when would something like this be well timed, really?
Now, we were already in the process of getting a DR site in place, so both IT and Management knew about the need for a secondary location. What surprised us though, was the sheer amount of files we had to restore from tape, and how much time it took. 56 hours is an extremely long time, especially when you are looking at restore jobs...
This means that our DR site setup, won’t be based on tape based backups. We can’t rely on tape medium as a primary medium for restore processes, it simply takes too long and is too error prone for us to base our business on. The fact of the matter is that even small businesses now have so many files and so much critical data floating around, that tape just isn’t feasible anymore. Don’t get me wrong, I’m glad we had tape backups, as we don’t really have the storage space available to do disk based backups right now.
As soon as the DR site is up and running, tape is dead as far as I’m concerned.
I’ll outline our DR site setup later, when we have it in place, but I’m definitely looking into using Virtual Tape Libraries (VTL) with dedup built-in for the new setup. And of course, snapshot based VM backups using Veeam Backup and Replication to the DR location, you know, for those really critical VMs that we can’t live without.
I for one will have backups everywhere from now on.







