Today, I get an instant message from my manager, letting me know that we have an incoming "hot" problem from a major bank. Their storage loses connectivity to a pair of redundant cost-more-than-I-make-in-five-years switches. Given that the switches don't have anything to do with each other, other than being in the same data center, combined with the fact the storage didn't lose connectivity anywhere else is odd, to say the least.
Before I know it, I have three levels of mgmt. in the chat window, all wanting answers, and wanting them yesterday. Some helpful soul at the customer site then pastes in the log messages they are looking at: (logs modified to protect the guilty)
Switch X
Date/Time Port # Message
11:07:25 112 Port Down
11:07:14 96 Port Down
11:07:13 72 Port Down
11:07:12 65 Port Down
11:07:11 80 Port Down
11:07:11 64 Port Down
Switch Y
Date/Time Port # Message
11:07:55 80 Port Down
11:07:51 65 Port Down
11:07:48 72 Port Down
11:07:46 96 Port Down
11:07:40 112 Port Down
11:07:37 97 Port Down
11:07:35 104 Port Down
Note the oddly sequential nature of those timestamps. This is not, to say the least, the normal failure pattern for hardware. When a bunch of things are going to die, they usually fail all at once. As in, simultaneously. And they don't spread to another separate piece of hardware a short time later, like some poisonous cloud.
In fact, I would say this is about exactly how fast it would take a cable monkey to unplug a bunch of cables in a patch panel, one after another.
The ports trickle back online a minute and a half later, at roughly the same speed.
Let's just say that this one is not exactly going to tax my years of experience troubleshooting enterprise storage equipment.
SirWired
P.S. A few minutes later, someone asks on the conference call who is going to ask the customer to look into staff climbing all over the storage box. He didn't mean that metaphorically. He meant that literally, data center staff are using their $1M-ish piece of hardware as a freakin' stepladder. WTF?
Is this a data center or a frat house?
Before I know it, I have three levels of mgmt. in the chat window, all wanting answers, and wanting them yesterday. Some helpful soul at the customer site then pastes in the log messages they are looking at: (logs modified to protect the guilty)
Switch X
Date/Time Port # Message
11:07:25 112 Port Down
11:07:14 96 Port Down
11:07:13 72 Port Down
11:07:12 65 Port Down
11:07:11 80 Port Down
11:07:11 64 Port Down
Switch Y
Date/Time Port # Message
11:07:55 80 Port Down
11:07:51 65 Port Down
11:07:48 72 Port Down
11:07:46 96 Port Down
11:07:40 112 Port Down
11:07:37 97 Port Down
11:07:35 104 Port Down
Note the oddly sequential nature of those timestamps. This is not, to say the least, the normal failure pattern for hardware. When a bunch of things are going to die, they usually fail all at once. As in, simultaneously. And they don't spread to another separate piece of hardware a short time later, like some poisonous cloud.
In fact, I would say this is about exactly how fast it would take a cable monkey to unplug a bunch of cables in a patch panel, one after another.
The ports trickle back online a minute and a half later, at roughly the same speed.
Let's just say that this one is not exactly going to tax my years of experience troubleshooting enterprise storage equipment.
SirWired
P.S. A few minutes later, someone asks on the conference call who is going to ask the customer to look into staff climbing all over the storage box. He didn't mean that metaphorically. He meant that literally, data center staff are using their $1M-ish piece of hardware as a freakin' stepladder. WTF?
Is this a data center or a frat house?

I AM the evil bastard!


Oh, and it just sucked
Comment