Tape is dead, long live tape or how water can ruin your weekend

A little while ago, disaster struck. What seemed like a normal day at work, suddenly turned into a frenzy I have yet to experience anything similar to.

What happened? We realized something was wrong when we lost contact with one of our non-virtualized servers.
I couldn’t contact it at all; it had just vanished from the face of our network.

My natural reaction was to run into our server room, to check what had happened. I figured it would be a power supply failure, or NIC failure.

Boy, was I wrong.

It turns out that a plastic pipe going through the wall, providing shielding for the power cables that provide power for the outdoor unit of the cooling system, led water straight into the server room. When I entered the server room, and heard splish-splashy sounds as soon as my feet hit the floor, I immediately grabbed a bucket and held it under the aforementioned pipe. While I stood there, trying to do some damage control, several other people rushed to my assistance.

As soon as there were enough hands on deck trying to get rid of the water, I grabbed the file server and brought it downstairs for some open heart surgery.

It’s a well known fact that water and servers don’t really mix that well. Even less so when the water in question flows down the walls in your server room, right on top of your main file server. That’s right; water meet server.

Of course, the very last of our non-rack based servers was located in a straight line below the pipe. Everything else was fine; the rack servers aren’t located directly on the floor, nor is anything else. We did have a good 2cm of water on the floor, but that wasn’t enough to hit the rack servers or UPS’s.

So, what was the end result? One pretty dead server. It did try to get our hopes up, and initially it did.
At first, things looked good. I removed the HDDs and the power supplies, opened the cabinet and looked for water damage. The power supplies seemed to have gotten a bit wet, which is probably why the server went MIA in the first place. Other than that, everything looked good. I still had some hope that the data on the HDDs was undamaged. Considering that I had removed the HDDs, I tried powering on the server. Any, yay, it started up, went through the BIOS OK and generally seemed like a happy little server again.

I let it run for a while with no apparent errors or hiccups, so I decided to try and boot it with disks in it again. At first, the RAID controller complained that its logical drive(s) was missing, but that was expected after I had started it without the drives in it. I tried setting the logical drive to online, but then it complained about missing information. My next move was to copy the RAID/Logical Drive information from the drives to the controller, and that worked perfectly. The server rebooted, and started without problems. I let it run for a while, no problem what so ever, it seemed we caught a lucky break and could continue running.

Sadly that was not the case, as it only lasted a good 20 minutes before the server died completely, breaking the RAID as a result. The drives died, the power supply died, and our inventory is now one physical file server smaller.
Next, restore from backup. As most small companies/IT-depts. we do backups to tape. We even have a pretty decent LTO3 based changer, and we run Tivoli Storage Manager as out backup software. As this was a physical server that was due to be replaced with a VM, we decided to restore its data to a new pre-provisioned VM. That should be a breeze, right?

As anyone that has attempted to restore large amounts of data from a tape library will attest to, things can, and will, fail. Tapes can go bad, drives can go nuts and changers can decide that they don’t want to change anymore. We experienced two of the above;

  • Bad Tape
    One of the tapes we were going to recover data from was broken, and we could not recover data from it. Thankfully TSM lets us have a copypool of tapes, so we did work around it by collecting the replacement tape from that pool.
  • Nutty Drive
    Drive 2 in the changer decided that after the initial restore job, a small subset of critical data, it wouldn’t play ball anymore. Now, TSM only uses one drive at a time to restore data with, but it does use the other drive in the changer to prepare the next tape with. So, we were reduced to all the action happening on one drive, which of course means that the restore time was significantly increased.

In the end, we were 100% successful in recovering the data from our latest backup set. We restored nearly 1 000 000 files (which also increased the restore time by a huge amount), but the entire restore process took us close to 56 hours in total.
Of course, in hindsight this whole mess could pretty easily have been avoided, on several different levels:

  • The pipe should not have been able to lead water directly into the server room.
    When we do risk assessments, do we identify problems like this? I for one did not see this one coming, and I’ve practically lived in that server room the last few years.
  • We should have installed some sort of water detection system in the server room.
    This might not have prevented the server crash, but we could potentially have identified that water was present and been able to shut down the server before it fried.
  • Why was the server still located on the floor?
    The fileserver should have been virtualized a long time ago, and plans were in place to do so. In fact, the VM that should replace it was already provisioned and semi-configured.

The most significant thing we could have done, before disaster struck, was to have a proper disaster recovery site in place. Irony has it that we got the quote on the hardware from HP, and software from Veeam, on Tuesday, two days before “the incident”. We have the DR location in place, and the lease contracts have been signed. We even have 100Mbit direct access to the DR site being installed as we speak. If this had happened a month or two from now, we would have been up and running through the whole ordeal. Of course, it could not have happened at a worse time, but when would something like this be well timed, really?

Now, we were already in the process of getting a DR site in place, so both IT and Management knew about the need for a secondary location. What surprised us though, was the sheer amount of files we had to restore from tape, and how much time it took. 56 hours is an extremely long time, especially when you are looking at restore jobs...

This means that our DR site setup, won’t be based on tape based backups. We can’t rely on tape medium as a primary medium for restore processes, it simply takes too long and is too error prone for us to base our business on. The fact of the matter is that even small businesses now have so many files and so much critical data floating around, that tape just isn’t feasible anymore. Don’t get me wrong, I’m glad we had tape backups, as we don’t really have the storage space available to do disk based backups right now.

As soon as the DR site is up and running, tape is dead as far as I’m concerned.

I’ll outline our DR site setup later, when we have it in place, but I’m definitely looking into using Virtual Tape Libraries (VTL) with dedup built-in for the new setup. And of course, snapshot based VM backups using Veeam Backup and Replication to the DR location, you know, for those really critical VMs that we can’t live without.

I for one will have backups everywhere from now on.

March 18, 2010 at 12:27am | 1 Comment
Tagged: , , , , , and

ESXi No More Must Have Have!

Maish Saidel-Keesing has revisited his previous post "Hot Add and "Need have have"" where he (like I did) pokes some fun at a rather strange error message in ESXi 4.0. Now that Update 1 is out, Maish tries again, this time with better results.

Read the whole post: "Need have have" - revisited.

I'm glad to say we don't need have have any more!

November 25, 2009 at 10:17pm | 0 Comments
Tagged: , , and

Howto: Using ExtPart to Expand Windows Server 2003 VM Boot Volume

Over time the boot partition on a Windows Server 2003 installation might just turn out to be too small. There can be various reasons for this, but the fact remains that over time you will accumulate data on the boot drive that you didn't take account for when you set it up initially.

Luckily I run almost all of my servers in a VMware based virtualized environment, where it's easy to expand the the virtual disks. The problem is that Windows Server 2003 doesn't let you easily expand the boot volume, at least not without downtime. I've previously talked about using tools like GParted to expand the boot volume but there are easier ways to do it and prevent downtime at the same time!

All you need is love. No,wait, that's something else entirely! All you need is ExtPart. ExtPart is a lovely little 36KB tool that Dell has provided to expand partitions on Dell based servers and storage systems. It is a little known fact that ExtPart can do the job in any 32 bit Windows Server 2000 or 2003 based install (no 64 bit support, sadly), and in Server 2008 there are other methods of doing this.

Enough talk, lets get down to the business at hand.

  1. Download ExtPart from the Dell download site
  2. Expand your boot volume, either via the Virtual Infrastructure Client or via vmkfstools
  3. Run ExtPart inside your VM to expand your boot volume to the new size

Thats it. The following screenshots outline the process very well, without having to guide you through each step. Have a look!

It can't get much simpler that this, honestly.

October 28, 2009 at 2:28pm | 1 Comment
Tagged: , , , , , , , and

Does your ESXi Need Have Have?

RequiresNeedHaveHave.png

Nice little error message shown when trying to hot add a new HDD to a VM running on ESXi 4.0.

How much need have have do you need?

Addendum:

Clearly I'm not the first to notice this rather peculiar wording in ESXi 4. Maish Saidel-Keesing posted the same screenshot back in May 2009 in his post called Hot Add and "Need have have".

Read that post instead of mine, it also highlights what ESXi 4 is missing as well as poorly worded error messages.

Funny thing is that I can even remember reading Maish post back when it was published, but I don't remember seeing that weird error message. Oh well. :)

Thanks to Jase McCarty for pointing this out to me.

September 23, 2009 at 12:56pm | 1 Comment
Tagged: , , and

ESX PSOD or Purple Screen of Death.

vmwarewolf.com has posted ESX PSODs

If you happen to search Google for one of the following phrases you might expect Google to return a list of official VMware Knowledgebase articles on the topic.
  • crash debug screen
  • machine crash screen
  • ESX Server PSOD
  • Purple screen crash report
  • Decode purple screen error

I know this is a direct copy of some of that article, but it's an attempt to help out getting ESX Server PSOD ranked in Google. I'm sure I'll be forgiven for the verbatim copy/paste job.

September 23, 2009 at 12:26am | 0 Comments
Tagged: , , , and

 1 2 3 … 17 Next →

Recent Comments