Announcement

Collapse
No announcement yet.

Dead in the water

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Downtime: When the system is safe from the users. (Like CS today!)
    I am not an a**hole. I am a hemorrhoid. I irritate a**holes!
    Procrastination: Forward planning to insure there is something to do tomorrow.
    Derails threads faster than a pocket nuke.

    Comment


    • #17
      Quoth Crossbow View Post
      Well, we're mostly back to abnormal. (We're never normal around here.)

      Apparently there was a firmware update to the SAN controller that caused it to stop recognizing the iSCSI hardware accelerator. Once that ball got rolling, there was pretty much no stopping it.

      250 virtual servers had to be rebuilt.
      Fuck. Did anyone "check"...wait. If they'd checked there wouldn't be a hosed site then..
      Curious. Does the SAN do it's own updates?...or was this sent in and the admins pushed it up?

      Either way, shiiiiit
      In my heart, in my soul, I'm a woman for rock & roll.
      She's as fast as slugs on barbituates.

      Comment


      • #18
        Quoth Der Cute View Post
        Fuck. Did anyone "check"...wait. If they'd checked there wouldn't be a hosed site then..
        Curious. Does the SAN do it's own updates?...or was this sent in and the admins pushed it up?

        Either way, shiiiiit
        To be honest, I don't know. The vendor is insisting it wasn't the firmware upgrade (of course it wasn't).

        I'm blaming it on the CFO for not being willing to pony up the money for a failover solution. Our disaster recovery plan seems to be "When in danger or in doubt, run in circles, scream and shout."
        "If your day is filled with firefighting, you need to start taking the matches away from the toddlers…” - HM

        Comment


        • #19
          Over the years, I've only gotten one disaster recovery plan actually implemented. Seems some actuarial type in Accounting always convinced Pointy-Haired Bosses the odds of a system failure were remote and didn't justify the expense of having duplicate systems, mirrored databases, tertiary backups, etc. The one company that did implement the DRP was, oddly enough, a Real Estate Brokerage.

          Comment


          • #20
            I've worked at several companies that have had intelligent, extensive, and successful DRP in place. One had multiple emergency locations with mirrored hardware and nightly off-site backups. In the event of a major issue at our main site, we could be up and running again at the backup facility with all available personnel in less than an hour. And this isn't a tech company.

            Another one actually had a minor disaster. A large pipe in the ceiling burst over the second floor, spewing thousands of gallons of water into the cube farm. Directly below said break, on the first floor? The server room. Not good. We were running again in the same facility with about 75% functionality in 15 minutes.


            This place? "Huh? What? La la la! I can't hear you!"
            "If your day is filled with firefighting, you need to start taking the matches away from the toddlers…” - HM

            Comment


            • #21
              Can you send in a "Hey you idiots this cost us $!#$#@$% dollars, if you put the DRP in It WONT COST THAT MUCH"
              In my heart, in my soul, I'm a woman for rock & roll.
              She's as fast as slugs on barbituates.

              Comment


              • #22
                Quoth Samardnaz View Post
                Seems some actuarial type in Accounting always convinced Pointy-Haired Bosses the odds of a system failure were remote and didn't justify the expense of having duplicate systems, mirrored databases, tertiary backups, etc.
                As has been noted elsewhere:

                Quoth Terry Pratchett
                ... million-to-one chances crop up nine times out of ten.
                "I don't have to be petty. The Universe does that for me."

                Comment


                • #23
                  at the last big corp type IT job I had I planned out the DRP for two systems that I was resopncilble for. this included extensive daily backup tapes, proceedures for switching the virtual backbone from the corp HQ to an outside facility, data, program and system reload procs and machine availability

                  Spent months planning and revising then actually implemented a crash simulation so we could switch over to the 3d party off-site backup facility.

                  You can not be too careful with a $500 - $750M company
                  I'm lost without a paddle and headed up SH*T creek.
                  -- Life Sucks Then You Die.


                  "I'll believe corp. are people when Texas executes one."

                  Comment


                  • #24
                    I actually have a little clip-on screwdriver that I never work with as my St. Vidicon "rosary." Even after over thirty years in what is arguably one of the most stringent of scientific disciplines, computing, it never ceases to amaze me how superstitious we are. I resisted acknowledging Murphy, and never anthropomorphisized, but eventually decided, "Hey, it couldn't hurt."

                    Glad they seem on track toward good uptime for you Crossbow, I know it's a trial.

                    Comment


                    • #25
                      Quoth Terry Pratchett View Post
                      ... million-to-one chances crop up nine times out of ten.

                      According to one proffessor from back in college, "RAID arrays have such a tiny chance of two drives going bad at the same time, that you should never have a problem as long as you replace bad drives promtly."

                      I've seen multiple drives go bad at the same time twice (that I can confirm) at different companies. Maybe I should be playing the lotto.
                      The Rich keep getting richer because they keep doing what it was that made them rich. Ditto the Poor.
                      "Hy kan tell dey is schmot qvestions, dey is makink my head hurt."
                      Hoc spatio locantur.

                      Comment


                      • #26
                        Quoth Geek King View Post
                        According to one proffessor from back in college, "RAID arrays have such a tiny chance of two drives going bad at the same time, that you should never have a problem as long as you replace bad drives promtly."
                        Technically, he's both right and wrong at the same time.

                        The chance of two drives going bad in the same minute is virtually zero. Provided you get the drives replaced in that minute, and the rebuild happens rapidly, you should never have a problem. Of course, then reality sets in, and things don't work so well.

                        In reality, the drives in the RAID were likely manufactured in the same building, within hours of each other, suffering the same manufacturing defects from the same batch of materials, shipped together, installed together, and will therefore have similar MTBF. As a result, if one drive goes bad, and its not replaced immediately, you have a much higher chance of losing the whole RAID when another goes bad shortly thereafter.

                        So, are you likely to see two drives go bad in so small a time window that you could not have replaced one of them and saved the RAID? No. Are you likely to see two of them go bad on the same day? Yes. Are you likely to have responded quickly enough to the first failure to prevent the loss of the whole RAID? That depends. Are you working in a 9 to 5 shop? Did it go bad after hours, or on the weekend? Do you have sufficient monitoring that you would know it went bad in time to do the replacement?

                        In reality, you're likely to miss that window of opportunity. It's not malice, nor even incompetence. It's just being overworked, and not necessarily having the time and/or tools to have noticed the failure before it became critical.

                        So, yes, your prof was both right and wrong at the same time with the same words

                        Comment


                        • #27
                          In-fact

                          Quoth Pedersen View Post
                          In reality, the drives in the RAID were likely manufactured in the same building, within hours of each other, suffering the same manufacturing defects from the same batch of materials, shipped together, installed together, and will therefore have similar MTBF. As a result, if one drive goes bad, and its not replaced immediately, you have a much higher chance of losing the whole RAID when another goes bad shortly thereafter.
                          In-fact it the designer/site admin has enough pull s/he will usually buy drives from different manufacturers (which usually means the raid system is smaller to meet the common factors) or if from the only one manufacturer order batches every few months so that they have a mix of drives that did not come off the same production run.

                          Comment


                          • #28
                            I need to dig it out of the box in the barn it is in, but once when I was bored I caligraphed a lovely Prayer to St Vidicon that I kept on my cube wall at my last job. Oddly I was one of the few people that rarely had computer issues
                            EVE Online: 99% of the time you sit around waiting for something to happen, but that 1% of action is what hooks people like crack, you don't get interviewed by the BBC for a WoW raid.

                            Comment


                            • #29
                              Quoth Pedersen View Post
                              Technically, he's both right and wrong at the same time.
                              All that you've said is true. There is also the failures caused by factors outside of the realm of the drives or RAID array itself--heavy power surges; low-flow, up-and-down brownouts; ect. All really good reasons to argue for a multiple recovery DRP.
                              The Rich keep getting richer because they keep doing what it was that made them rich. Ditto the Poor.
                              "Hy kan tell dey is schmot qvestions, dey is makink my head hurt."
                              Hoc spatio locantur.

                              Comment

                              Working...
                              X