Announcement

Collapse
No announcement yet.

There went my weekend.

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • There went my weekend.

    I heard an interesting tale from my boss this morning.

    I little over a month ago, the landlord for the building that our data center is in, contacted us about some electrical repairs that they would like to do. They wanted to replace the building's main switch (Note: I don't know the exact details of this repair. I just know the end results of this action). Not a problem. As long as it is done in X minutes as that is how long our UPS will last.

    So, three weeks ago, on a Saturday, they replace the switch. They find that they have a “ground fault” (not sure if that is what actually happened. Just what was relayed to me). They pull the switch and put the old one back. The spend some time looking for the problem. Can't find it.

    They tell us that they are going to try again the next Saturday. Same thing. They put the old one back.

    Last Saturday, they try again. This time, they decide that they are not going to be able to find the problem unless the leave the switch in. So they cut the power to the building, and didn't tell us.

    We found out around 10:00 AM when the data center went dead. Eight hours later, they fixed the problem (I heard that they never found the problem, it just “went away”) and restored the power.

    We were lucky. Our 240TB of disk arrays all came back online. Lost a number of drives, but none of the RAID sets had multiple failures. We did lose one network switch, two Load Balancers (servicing our most important application), a blade in one of our SAN switches, 4 or 5 internal hard drives on servers and the main logic board in our tape library. You have to remember that some of this equipment has not been powered off in over six years. I have only been there 5.5 years.

    I had all my servers up by 10:00 PM Saturday. Spent Sunday deploying bandages to applications.

    We had all of the customer facing applications running again by 6:00 PM Sunday. Spent most of today “bracing up” the bandages we put in place, getting Development and QA systems back online and restarting services that failed because the target servers were not online yet.

    So...
    How was your weekend?
    Life is too short to not eat popcorn.
    Save the Ales!
    Toys for Tots at Rooster's Cafe

  • #2
    Oy yoy yoy.

    Why didn't you have full power failure handling in place? Stupid landlords aren't the only thing that can cause long-term power outages... Your UPS should be able to tell your network that power's about to fail, and give you a chance to shut down gracefully...

    Comment


    • #3
      When you have 240TB of disk drives, you will lose some of them on any power cycle (remember, these are likely enterprise drives, which are usually at most 1TB, often less). Same thing when you have a lot of switches.

      We have a scheduled 8-hour power station remount every year in the building I work (it usually takes 1.5-2 hours, but they schedule it for 8 hours just in case there are problems). On the day of remount, I remotely shut down everything (I could let the UPS shut down everything, but there's no point in wasting the batteries), and then go and power things back on when the remount is finished. In the past 8 years we lost 3 switches, 1 server power supply (redundant) and one hard disk - and we're a small operation - 5 switches, 6 servers.

      Comment


      • #4
        So, one way to look at it is that the landlord cost your company a lot of money. I would hope that your company is looking to recover that money.

        Comment


        • #5
          On a personal note, my own weekend wasn't much better. Wrapped up my EQ raids on Friday and went into Mass Effect for a bit, only to have my desktop lock up. Rebooted and it started doing a chkdsk, which failed with an unexpected error. Reboots just got in a cycle. Safemode reboots flashed a brief BSOD before cycling as well.

          So at 12:01AM Saturday morning, I was tearing my condo apart looking for my install disks to no avail.

          After a rough night's sleep, I booted the desktop up with a Linux USB stick, and managed to back up stuff from both drives to my laptop. The data drive was fine, but the main drive was really sluggish. Some of the files failed but nothing important. Continued my search for my Vista disks, but I could only find the ones for my laptop. (Dell 32-bit install disks, not the 64-bit OEM disk I have for the desktop). Finally gave one last search before I went to bed, checking a chest of drawers I'd checked before, and noticed a white DVD-case shaped box, that turned out to be a cardboard sleeve that had MS Legalese written on it. Inside the sleeve was my Vista 64 OEM disk.

          Sunday, I tried a repair install, but it didn't work. Tried to format the C drive and reinstall and that seemed to work, long enough to install the updates, but once I tried to get the last few reboots done, it started reboot cycling again.

          Finally, I decided to do a reinstall on the second drive. Went to BIOS to switch the drive boot order only to see the first drive was no longer listed. Vista Install doesn't see it any more either, nor does Vista itself, so the first drive is dead as a doornail now. Thankfully the data drive is working fine.

          In hindsight, I remember hearing what I thought was a loud fan from the desktop system, but didn't think much of it. Since the first drive gave up the ghost, I haven't heard anything , which makes me think that the 'fan' was that first drive starting going through its death throes.

          I'll eventually disable that old first drive, but for now it doesn't seem to be causing any problems. I'm going to go get a new Hard drive soon anyways now, so I can take the broken one out then. Having most of your data (well Games in my case) on a second drive is a life saver to say the least.

          But Friday and Saturday (and even most of Sunday) were more than a little stressful, trying to figure out how I could get back online for raids and stuff, considering I don't have any other Windows boxes any more.

          Comment


          • #6
            Do you need an alibi?
            The High Priest is an Illusion!

            Comment


            • #7
              Today, the load balancer was replaced. We are now back to full capability now. Replaced the SAN switch and also got the tape library back online (not fully repaired, but functioning).

              Quoth Pedersen View Post
              So, one way to look at it is that the landlord cost your company a lot of money. I would hope that your company is looking to recover that money.
              Most lease contracts will have a clause protecting the landlord. On the other hand, we have been looking to get out of that lease.

              Quoth ender View Post
              When you have 240TB of disk drives, you will lose some of them on any power cycle (remember, these are likely enterprise drives, which are usually at most 1TB, often less). Same thing when you have a lot of switches.
              Spoke with the SAN Admin. Turns out we only lost one drive in all the RAID arrays. Did lose a controller and three power supplies (all redundant). These are also six years old. We have 144GB FS and 250GB & 320GB SATA.

              Quoth Jetfire View Post
              On a personal note, *snip*
              Been there, done that! Spent two days tearing up the house trying to find the WIndows 2K disk a several years back.
              Life is too short to not eat popcorn.
              Save the Ales!
              Toys for Tots at Rooster's Cafe

              Comment


              • #8
                .................................................. ..................wow............................. .................................(brain breaks)
                Crono: sounds like the machine update became a clusterf*ck..
                pedersen: No. A clusterf*ck involves at least one pleasurable thing (the orgasm at the end).

                Comment


                • #9
                  Quoth csquared View Post
                  Spoke with the SAN Admin. Turns out we only lost one drive in all the RAID arrays. Did lose a controller and three power supplies (all redundant). These are also six years old. We have 144GB FS and 250GB & 320GB SATA.
                  So 900-1000 drives then? Impressive that only a single one died.

                  Quoth csquared View Post
                  Been there, done that! Spent two days tearing up the house trying to find the WIndows 2K disk a several years back.
                  Haven't we all been there? Nowadays I just keep the install disk images ripped on my fileserver (which has RAID6 and two spare drives).

                  Comment


                  • #10
                    Pretty sure they breached something on the contract by not telling you they were cutting power. Our oncall people would've had a heart attack if our data center did that
                    Out of retail!

                    Comment


                    • #11
                      Our legal team would be all over that one. Loss of servers = loss of revenue. (Of course, that's why we use two datacenters, on opposite coasts.)
                      I will not be pushed, stamped, filed, indexed, briefed, debriefed, or numbered. My life is my own. --#6

                      Comment

                      Working...
                      X