Go Back   Customers Suck! > The Heart of the Site > Unsupportable

Reply
 
Thread Tools Display Modes

In which IT stands for Ignorant Twits
  #1  
Old 09-25-2018, 10:13 PM
otakuneko's Avatar
otakuneko otakuneko is offline
^w^
 
Join Date: Oct 2007
Location: In the anime store
Posts: 1,420
Default In which IT stands for Ignorant Twits

It's been a long time, so a bit of background: I work for a data security vendor. Our product is a multifunctional mess of capabilities, mainly centered around protecting websites from attack, and monitoring database activities. You can get it as a virtual machine, an Amazon Web Services instance, or a real honest-to-god piece of hardware of varying sizes and beefiness. This first story deals with the latter.

Infinite RCA-gress
A customer experienced an outage in their network, but when they opened the case, they only said that a gateway (the component that does all the work, and through which traffic flows) had disconnected. No mention was made of the network outage occurring at the same time. They did say they removed the gateway from the traffic flow, and demanded a root cause before putting it back inline.

Pretty standard stuff, nothing sucky yet. With only the vague description that the device disconnected, I went ahead and did my thing. As it turned out, there wasn't really much to go off of. As best I could tell, all that happened was our application stopped working, but I couldn't find any reason for the failure. No errors in the logs, nothing suspicious at all, zip. The customer had rebooted it, and after the reboot everything seemed normal in the logs. I presented this to the customer, offered a few possibilities, and advised them to patch. They were a few patches behind, after all. Not unusual but since I couldn't find the cause, might as well rule out some bugs, right?

That's when things took a turn. It was revealed the failure had caused an outage for everything behind the gateway. That bit of information might have been useful beforehand as with few exceptions, the only way that should happen is some hardware failure. Sure enough, there was an entry in the raid controller log calling out a disk error right before the outage. (And yet after they rebooted it, everything was peachy).

Normally a disk error in an appliance with mirrored drives shouldn't be an issue, but over the years I've learned our raid sucks and a single disk issue can indeed bring down a system and even prevent it from booting. When I offered to RMA the hard drives (we always do both at once), things asploded. To sum up:
  • Why wasn't there a system event for the dead drive? (The OS halted, how's it going to send an event?)
  • Apparently they asked for FOUR drives because reasons (talk to your account team, we can only get you two)
  • They also asked for the specs and if they can source their own drives to replace with, because apparently the device is abroad and shipping would take too long.
  • Hey wait a minute, how come we got this system event for a dead drive just now? (With a serial number different from the one in the raid log, cripes now we gotta RMA the whole device because clearly it's the controller at fault)
  • Oh wait our bad, that system event was just one of our techs popping out the drive just for a look-see! (UGH!)
  • We replaced the drives and everything's fine now, but this thing has RAID, why did this happen? (Cus our RAID sucks? Not that I can tell you that)
  • Also, why did it not Fail-Open like it was supposed to? (Because the OS was completely halted you idiot, same reason it didn't send an event about the drive!)

Yes, indeed, in the end, they ended up getting two drives (at least, don't know if they succeeded in getting a second pair) AND an appliance. And after this, even though the RAID controller is still suspect (they did have a tech pop the WRONG drive, and the device kept chugging), they just replaced the drives and called it good. And then they conveniently forgot what I told them earlier and wanted to know why it caused an outage.

When Microsoft is an adjective

In addition to the earlier mentioned product, we also have pieces of software called agents that customers can install on their database servers to help with the monitoring.

So a customer opens a case and says they think our agent crashed their MSSQL server. I hate these kinds of cases. It generally involves a bug hunt, which either ends in me telling the customer to update the agent, escalating to the dev team, or the worst option, a whole lot of nothing.

As soon as I get his logs, I see that it's going to be option one, he's using a version known to have multiple issues, several patches behind. I can't really prove based on the logs that we caused the crash, but even if we didn't, it's a matter of time before we will, so he should patch anyway, and I tell him this.

You would think I pissed on his grandma. This set off a series of events escalating to their CISO, and having him and our VP and Director of Support on a call, and other calls involving me, my manager, , one of our devs, Microsoft reps, and half of the customer's team, all wanting in-the-weeds details about the various issues with the agent and how it works and our relationship with Microsoft, and just update the damn thing, gorrammit!. Yes, I understand these are production servers, and you're running an agent with known issues and refusing to update simply because we can't prove it was the problem THIS time?

Eventually they relented and agreed to try the new version in their test environment and gradually move it to prod if things went well. Could've saved everyone a lot of headaches if they'd just done that two months ago.

Degradation of the Needful Upgradation

Over the years I've come to observe an overwhelming preponderance of data that shows all the good IT people in India....are not in India. They emigrated here (we have several in the office with me), there, and anywhere but India. The ones that stayed in India concentrated into one or two consulting companies (anyone in the industry will know their names) and are, generally, very bad at their jobs. They seem to know this, and get very demanding and pushy about it. Whenever I get a case from one of these companies, I know I'm in for a bad time.

This case came from one such company. They were having an issue with AWS snapshots failing.

I get the logs, and sure enough, the snapshots are failing because the instance has two disk volumes and the volume IDs are getting run together. Amazon has no idea what volume 'XYZ123ABCDEF' is so it fails to take a snapshot of it. Our bad.

But wait, why do you even have two volumes in the first place? This is not a supported configuration, and snapshots work just fine when you have only one. I tell them it's unsupported, and they should redeploy with just one volume.

This was a while back, so I'm not quite sure where this came into play, but they decided to also upgrade the environment at the same time. I provided them the upgrade instructions. And they followed them.

Two problems: one, they got bit by a bug causing the exported config to be unusable on the new version. Not their fault, except that they were the ones running an unsupported configuration, and to get out of that required using the upgrade process, rather than the patch process. Two, they deleted the old instance before verifying the new one was working properly.

They harassed our support team all weekend for that one, and sent an angry email later in which they naturally placed all the blame on us (including screenshots of one of my emails and a colleague's) and dammit, they should've been warned about this bug and they demand we "consider all possible scenarios and provide concrete guarantee!"

Eventually, we did get them fixed up.

And then they went and expanded the existing single volume. I don't know that this caused any problems, I was long off the case by then, but WHY? It's a server for managing the configuration of gateways, it's not supposed to store anything long term. What are you doing to this thing? Our AWS crap is difficult enough to support without you fiddling with things. Stop it!

Aaaand that's enough for now.
__________________
Supporting the idiots charged with protecting your personal information.
Reply With Quote

  #2  
Old 09-25-2018, 11:15 PM
Nunavut Pants's Avatar
Nunavut Pants Nunavut Pants is offline
Warning: He thinks he's funny.
 
Join Date: Nov 2015
Posts: 886
Default

Quote:
Quoth otakuneko View Post
It's been a long time, so a bit of background: I work for a data security vendor.
I think you may work for a competitor of my employer.... At first I thought you worked here, but then a few details diverged. But the issues are pretty familiar!



Quote:
...Yes, I understand these are production servers, and you're running an agent with known issues and refusing to update simply because we can't prove it was the problem THIS time?
I can sort of understand this, actually. Upgrading is something with non-zero risk, and there isn't any guarantee that it will fix the problem that they had. The new version fixes problems, sure, but not necessarily ones they were having.

Of course, those attitudes lead to businesses that still run Windows XP in 2018....
Reply With Quote

  #3  
Old 09-26-2018, 05:20 PM
otakuneko's Avatar
otakuneko otakuneko is offline
^w^
 
Join Date: Oct 2007
Location: In the anime store
Posts: 1,420
Default

InfoSec is a small world.
__________________
Supporting the idiots charged with protecting your personal information.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT. The time now is 10:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.