Tape is dead, long live tape or how water can ruin your weekend

A little while ago, disaster struck. What seemed like a normal day at work, suddenly turned into a frenzy I have yet to experience anything similar to.

What happened? We realized something was wrong when we lost contact with one of our non-virtualized servers.
I couldn’t contact it at all; it had just vanished from the face of our network.

My natural reaction was to run into our server room, to check what had happened. I figured it would be a power supply failure, or NIC failure.

Boy, was I wrong.

It turns out that a plastic pipe going through the wall, providing shielding for the power cables that provide power for the outdoor unit of the cooling system, led water straight into the server room. When I entered the server room, and heard splish-splashy sounds as soon as my feet hit the floor, I immediately grabbed a bucket and held it under the aforementioned pipe. While I stood there, trying to do some damage control, several other people rushed to my assistance.

As soon as there were enough hands on deck trying to get rid of the water, I grabbed the file server and brought it downstairs for some open heart surgery.

It’s a well known fact that water and servers don’t really mix that well. Even less so when the water in question flows down the walls in your server room, right on top of your main file server. That’s right; water meet server.

Of course, the very last of our non-rack based servers was located in a straight line below the pipe. Everything else was fine; the rack servers aren’t located directly on the floor, nor is anything else. We did have a good 2cm of water on the floor, but that wasn’t enough to hit the rack servers or UPS’s.

So, what was the end result? One pretty dead server. It did try to get our hopes up, and initially it did.
At first, things looked good. I removed the HDDs and the power supplies, opened the cabinet and looked for water damage. The power supplies seemed to have gotten a bit wet, which is probably why the server went MIA in the first place. Other than that, everything looked good. I still had some hope that the data on the HDDs was undamaged. Considering that I had removed the HDDs, I tried powering on the server. Any, yay, it started up, went through the BIOS OK and generally seemed like a happy little server again.

I let it run for a while with no apparent errors or hiccups, so I decided to try and boot it with disks in it again. At first, the RAID controller complained that its logical drive(s) was missing, but that was expected after I had started it without the drives in it. I tried setting the logical drive to online, but then it complained about missing information. My next move was to copy the RAID/Logical Drive information from the drives to the controller, and that worked perfectly. The server rebooted, and started without problems. I let it run for a while, no problem what so ever, it seemed we caught a lucky break and could continue running.

Sadly that was not the case, as it only lasted a good 20 minutes before the server died completely, breaking the RAID as a result. The drives died, the power supply died, and our inventory is now one physical file server smaller.
Next, restore from backup. As most small companies/IT-depts. we do backups to tape. We even have a pretty decent LTO3 based changer, and we run Tivoli Storage Manager as out backup software. As this was a physical server that was due to be replaced with a VM, we decided to restore its data to a new pre-provisioned VM. That should be a breeze, right?

As anyone that has attempted to restore large amounts of data from a tape library will attest to, things can, and will, fail. Tapes can go bad, drives can go nuts and changers can decide that they don’t want to change anymore. We experienced two of the above;

  • Bad Tape
    One of the tapes we were going to recover data from was broken, and we could not recover data from it. Thankfully TSM lets us have a copypool of tapes, so we did work around it by collecting the replacement tape from that pool.
  • Nutty Drive
    Drive 2 in the changer decided that after the initial restore job, a small subset of critical data, it wouldn’t play ball anymore. Now, TSM only uses one drive at a time to restore data with, but it does use the other drive in the changer to prepare the next tape with. So, we were reduced to all the action happening on one drive, which of course means that the restore time was significantly increased.

In the end, we were 100% successful in recovering the data from our latest backup set. We restored nearly 1 000 000 files (which also increased the restore time by a huge amount), but the entire restore process took us close to 56 hours in total.
Of course, in hindsight this whole mess could pretty easily have been avoided, on several different levels:

  • The pipe should not have been able to lead water directly into the server room.
    When we do risk assessments, do we identify problems like this? I for one did not see this one coming, and I’ve practically lived in that server room the last few years.
  • We should have installed some sort of water detection system in the server room.
    This might not have prevented the server crash, but we could potentially have identified that water was present and been able to shut down the server before it fried.
  • Why was the server still located on the floor?
    The fileserver should have been virtualized a long time ago, and plans were in place to do so. In fact, the VM that should replace it was already provisioned and semi-configured.

The most significant thing we could have done, before disaster struck, was to have a proper disaster recovery site in place. Irony has it that we got the quote on the hardware from HP, and software from Veeam, on Tuesday, two days before “the incident”. We have the DR location in place, and the lease contracts have been signed. We even have 100Mbit direct access to the DR site being installed as we speak. If this had happened a month or two from now, we would have been up and running through the whole ordeal. Of course, it could not have happened at a worse time, but when would something like this be well timed, really?

Now, we were already in the process of getting a DR site in place, so both IT and Management knew about the need for a secondary location. What surprised us though, was the sheer amount of files we had to restore from tape, and how much time it took. 56 hours is an extremely long time, especially when you are looking at restore jobs...

This means that our DR site setup, won’t be based on tape based backups. We can’t rely on tape medium as a primary medium for restore processes, it simply takes too long and is too error prone for us to base our business on. The fact of the matter is that even small businesses now have so many files and so much critical data floating around, that tape just isn’t feasible anymore. Don’t get me wrong, I’m glad we had tape backups, as we don’t really have the storage space available to do disk based backups right now.

As soon as the DR site is up and running, tape is dead as far as I’m concerned.

I’ll outline our DR site setup later, when we have it in place, but I’m definitely looking into using Virtual Tape Libraries (VTL) with dedup built-in for the new setup. And of course, snapshot based VM backups using Veeam Backup and Replication to the DR location, you know, for those really critical VMs that we can’t live without.

I for one will have backups everywhere from now on.

March 18, 2010 at 12:27am | 1 Comment
Tagged: , , , , , and

HP Proliant ML 115 G5, Windows Server 2008 and nvstor.sys

I initially bought a HP Proliant ML 115 server as a cheap test/lab server for VMware vSphere and miscellaneous rollout projects at work, but all of a sudden I needed it for some other project that required that I install Windows Server 2008 directly on the hardware itself.

As is the story with most HP Proliant servers, you should install it with the tools that HP provides. In the case of the ML 115, you can't use the normal SmartStart setup, but it's little cousin Easy Set-up CD.

The installation started fine, after running through the initial HP wizard, but when the time came to actually get the installation started it went all blue screened on me, complaining about nvstor.sys.
I knew that the Windows 2008 installation medium doesn't include support for the built-in nVidia NFP3400 SATA storage controller in RAID mode, but I wasn't running a RAID based setup on it anyway so that shouldn't cause the problem.

Next I tried installing Windows Server 2008 without using the Easy Set-up CD, in other words just plain old booting of the Windows Server 2008 installation CD and initially it seemed like it was running ok. Thats until it just stopped at 0% progress at the "Expanding files" section of the installation.

So, there I was. Using the HP tools, the installation ends in a big old BSOD, using "native" Windows Server 2008 installation it just stops without any indication on what might be wrong.

As it turns out, the solution was pretty weird. The HDD shipped with the server causes the problem (160GB NHP SATA). I have no idea how, but replacing it with another SATA drive and starting the installation again, with the Easy Set-up CD, fixed it.

The HDD shipped with the server makes the installation of Windows Server 2008 crash, replacing it with a "generic" Western Digital AV-GP 1.5TB SATA drive lets me install without problems.

Obviously the nvstor.sys driver shipped with Windows Server 2008 has problems with some drives, but not all. Imagine that a cheap server, that can run VMware ESX/ESXi right out of the box, can't run Windows Server 2008 with the HDD it came shipped with.

Now, how weird is that? Note that that wasn't tested with Windows Server 2008 R2, so the nvstor.sys file shipped with that version might not have the same problem. Also, I did not try loading newer nVidia drivers during the Windows installation procedure, because a) when using the Easy Setup CD you don't get the option to load third party drivers, and b) because after I figured out that changing the HDD helped I didn't want to try another manual installation.

Remind me again, why don't we just virtualize everything? In this instance, it would actually be easier (and quicker!) to install ESXi on the bare metal hardware, create a VM and install Windows Server 2008 in that instead of installing Windows Server 2008 on the hardware directly. How the world has indeed changed.

Update 10. March 2010:

After finishing the installation, I did run into another problem that quite possibly is also related to the nvstor.sys driver. Windows would fail in creating partitions, of the amount of space used by the partitions exceeded approximately 1TB in total.

Upgrading the server to Windows Server 2008 R2 fixed this issue, and I was able to utilize the full disk. This leads me to think that had I installed Server 2008 R2 from the get-go I would not have seen the installation issues with the original drive at all.

March 9, 2010 at 10:33pm | 1 Comment
Tagged: , , , , , , and

ESXi No More Must Have Have!

Maish Saidel-Keesing has revisited his previous post "Hot Add and "Need have have"" where he (like I did) pokes some fun at a rather strange error message in ESXi 4.0. Now that Update 1 is out, Maish tries again, this time with better results.

Read the whole post: "Need have have" - revisited.

I'm glad to say we don't need have have any more!

November 25, 2009 at 10:17pm | 0 Comments
Tagged: , , and

Howto: Using ExtPart to Expand Windows Server 2003 VM Boot Volume

Over time the boot partition on a Windows Server 2003 installation might just turn out to be too small. There can be various reasons for this, but the fact remains that over time you will accumulate data on the boot drive that you didn't take account for when you set it up initially.

Luckily I run almost all of my servers in a VMware based virtualized environment, where it's easy to expand the the virtual disks. The problem is that Windows Server 2003 doesn't let you easily expand the boot volume, at least not without downtime. I've previously talked about using tools like GParted to expand the boot volume but there are easier ways to do it and prevent downtime at the same time!

All you need is love. No,wait, that's something else entirely! All you need is ExtPart. ExtPart is a lovely little 36KB tool that Dell has provided to expand partitions on Dell based servers and storage systems. It is a little known fact that ExtPart can do the job in any 32 bit Windows Server 2000 or 2003 based install (no 64 bit support, sadly), and in Server 2008 there are other methods of doing this.

Enough talk, lets get down to the business at hand.

  1. Download ExtPart from the Dell download site
  2. Expand your boot volume, either via the Virtual Infrastructure Client or via vmkfstools
  3. Run ExtPart inside your VM to expand your boot volume to the new size

Thats it. The following screenshots outline the process very well, without having to guide you through each step. Have a look!

It can't get much simpler that this, honestly.

October 28, 2009 at 2:28pm | 1 Comment
Tagged: , , , , , , , and

Does your ESXi Need Have Have?

RequiresNeedHaveHave.png

Nice little error message shown when trying to hot add a new HDD to a VM running on ESXi 4.0.

How much need have have do you need?

Addendum:

Clearly I'm not the first to notice this rather peculiar wording in ESXi 4. Maish Saidel-Keesing posted the same screenshot back in May 2009 in his post called Hot Add and "Need have have".

Read that post instead of mine, it also highlights what ESXi 4 is missing as well as poorly worded error messages.

Funny thing is that I can even remember reading Maish post back when it was published, but I don't remember seeing that weird error message. Oh well. :)

Thanks to Jase McCarty for pointing this out to me.

September 23, 2009 at 12:56pm | 1 Comment
Tagged: , , and

 1 2 3 … 23 Next →

Recent Comments