Announcement

**dalesys** · 01-30-2012, 09:37 PM

Downtime: When the system is safe from the users. (Like CS today!

)

**Der Cute** · 01-31-2012, 02:51 AM

Quoth Crossbow View Post

Well, we're mostly back to abnormal. (We're never normal around here.)

Apparently there was a firmware update to the SAN controller that caused it to stop recognizing the iSCSI hardware accelerator. Once that ball got rolling, there was pretty much no stopping it.

250 virtual servers had to be rebuilt.

Fuck. Did anyone "check"...wait. If they'd checked there wouldn't be a hosed site then..
Curious. Does the SAN do it's own updates?...or was this sent in and the admins pushed it up?

Either way, shiiiiit

**Crossbow** · 01-31-2012, 12:43 PM

Quoth Der Cute View Post

Fuck. Did anyone "check"...wait. If they'd checked there wouldn't be a hosed site then..
Curious. Does the SAN do it's own updates?...or was this sent in and the admins pushed it up?

Either way, shiiiiit

To be honest, I don't know. The vendor is insisting it wasn't the firmware upgrade (of course it wasn't).

I'm blaming it on the CFO for not being willing to pony up the money for a failover solution. Our disaster recovery plan seems to be "When in danger or in doubt, run in circles, scream and shout."

**Samardnaz** · 02-02-2012, 04:01 AM

Over the years, I've only gotten one disaster recovery plan actually implemented. Seems some actuarial type in Accounting always convinced Pointy-Haired Bosses the odds of a system failure were remote and didn't justify the expense of having duplicate systems, mirrored databases, tertiary backups, etc. The one company that did implement the DRP was, oddly enough, a Real Estate Brokerage.

**Crossbow** · 02-02-2012, 09:09 PM

I've worked at several companies that have had intelligent, extensive, and successful DRP in place. One had multiple emergency locations with mirrored hardware and nightly off-site backups. In the event of a major issue at our main site, we could be up and running again at the backup facility with all available personnel in less than an hour. And this isn't a tech company.

Another one actually had a minor disaster. A large pipe in the ceiling burst over the second floor, spewing thousands of gallons of water into the cube farm. Directly below said break, on the first floor? The server room. Not good. We were running again in the same facility with about 75% functionality in 15 minutes.

This place? "Huh? What? La la la! I can't hear you!"

**Der Cute** · 02-02-2012, 11:16 PM

Can you send in a "Hey you idiots this cost us $!#$#@$% dollars, if you put the DRP in It WONT COST THAT MUCH"

**Ironclad Alibi** · 02-03-2012, 01:42 AM

Quoth Samardnaz View Post

Seems some actuarial type in Accounting always convinced Pointy-Haired Bosses the odds of a system failure were remote and didn't justify the expense of having duplicate systems, mirrored databases, tertiary backups, etc.

As has been noted elsewhere:

Quoth Terry Pratchett

... million-to-one chances crop up nine times out of ten.

**Racket_Man** · 02-03-2012, 08:18 AM

at the last big corp type IT job I had I planned out the DRP for two systems that I was resopncilble for. this included extensive daily backup tapes, proceedures for switching the virtual backbone from the corp HQ to an outside facility, data, program and system reload procs and machine availability

Spent months planning and revising then actually implemented a crash simulation so we could switch over to the 3d party off-site backup facility.

You can not be too careful with a $500 - $750M company

**sms001** · 02-03-2012, 09:53 AM

I actually have a little clip-on screwdriver that I never work with as my St. Vidicon "rosary."

Even after over thirty years in what is arguably one of the most stringent of scientific disciplines, computing, it never ceases to amaze me how superstitious we are.

I resisted acknowledging Murphy, and never anthropomorphisized, but eventually decided, "Hey, it couldn't hurt."

Glad they seem on track toward good uptime for you Crossbow, I know it's a trial.

**Geek King** · 02-03-2012, 01:25 PM

Quoth Terry Pratchett View Post

... million-to-one chances crop up nine times out of ten.

According to one proffessor from back in college, "RAID arrays have such a tiny chance of two drives going bad at the same time, that you should never have a problem as long as you replace bad drives promtly."

I've seen multiple drives go bad at the same time twice (that I can confirm) at different companies. Maybe I should be playing the lotto.

**Pedersen** · 02-03-2012, 02:44 PM

Quoth Geek King View Post

According to one proffessor from back in college, "RAID arrays have such a tiny chance of two drives going bad at the same time, that you should never have a problem as long as you replace bad drives promtly."

Technically, he's both right and wrong at the same time.

The chance of two drives going bad in the same minute is virtually zero. Provided you get the drives replaced in that minute, and the rebuild happens rapidly, you should never have a problem. Of course, then reality sets in, and things don't work so well.

In reality, the drives in the RAID were likely manufactured in the same building, within hours of each other, suffering the same manufacturing defects from the same batch of materials, shipped together, installed together, and will therefore have similar MTBF. As a result, if one drive goes bad, and its not replaced immediately, you have a much higher chance of losing the whole RAID when another goes bad shortly thereafter.

So, are you likely to see two drives go bad in so small a time window that you could not have replaced one of them and saved the RAID? No. Are you likely to see two of them go bad on the same day? Yes. Are you likely to have responded quickly enough to the first failure to prevent the loss of the whole RAID? That depends. Are you working in a 9 to 5 shop? Did it go bad after hours, or on the weekend? Do you have sufficient monitoring that you would know it went bad in time to do the replacement?

In reality, you're likely to miss that window of opportunity. It's not malice, nor even incompetence. It's just being overworked, and not necessarily having the time and/or tools to have noticed the failure before it became critical.

So, yes, your prof was both right and wrong at the same time with the same words

**earl colby pottinger** · 02-04-2012, 03:45 PM

In-fact

Quoth Pedersen View Post

In reality, the drives in the RAID were likely manufactured in the same building, within hours of each other, suffering the same manufacturing defects from the same batch of materials, shipped together, installed together, and will therefore have similar MTBF. As a result, if one drive goes bad, and its not replaced immediately, you have a much higher chance of losing the whole RAID when another goes bad shortly thereafter.

In-fact it the designer/site admin has enough pull s/he will usually buy drives from different manufacturers (which usually means the raid system is smaller to meet the common factors) or if from the only one manufacturer order batches every few months so that they have a mix of drives that did not come off the same production run.

**AccountingDrone** · 02-05-2012, 02:50 AM

I need to dig it out of the box in the barn it is in, but once when I was bored I caligraphed a lovely Prayer to St Vidicon that I kept on my cube wall at my last job. Oddly I was one of the few people that rarely had computer issues

**Geek King** · 02-06-2012, 01:25 PM

Quoth Pedersen View Post

Technically, he's both right and wrong at the same time.

All that you've said is true. There is also the failures caused by factors outside of the realm of the drives or RAID array itself--heavy power surges; low-flow, up-and-down brownouts; ect. All really good reasons to argue for a multiple recovery DRP.

Announcement

Dead in the water

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment