This week’s featured link:
Southwest CEO: Router failure that grounded flights equated to ‘once-in-a-thousand-year flood’ – Dallas Morning News
Something tells me this is as close as we’re ever going to get to figuring out what really happened when Southwest’s systems melted down. I have a picture of Southwest’s IT systems looking like a giant Jenga with half the pieces already gone. So is it possible that one router went down and killed the airline? Sure. Anyone else believe that?
Two for the road:
Unions want Southwest CEO removed after IT outage
On a related note, this IT meltdown has given the unions even more fodder in their negotiations. They want to show that management is incompetent with this IT problem being the latest example. And now, they’re calling for the ouster of CEO Gary Kelly. Can you imagine that this is the same airline Herb Kelleher ran? It’s just incredible that we’ve reached this point.
Alaska Airlines SVP Joseph Sprague on Virgin America Acquisition – Business Travel News
Here’s an interview with Alaska’s SVP of Communications and External Relations that I found worth a read. I think you’ll agree that the comment about Delta is priceless.
16 comments on “3 Links I Love: Why Southwest Says Its IT Failed, Calls for Southwest CEO’s Head, Talking Alaska-Virgin America”
As an IT guy, I can definitely say that yes, a single router failure (whether it’s due to a mechanical or software failure, or due to human error) can take down an entire network. And the type of issue they had (a failure that caused monitoring agents to not know that they needed to fail over to backup equipment) has certainly happened before. It’s just that a disruption at most businesses, even if it lasts multiple days, doesn’t tend to affect people as visibly as a days-long disruption at an airline is going to have.
I thought there was a principle in critical IT systems like core networkws that there should never be a single point of failure. At least that’s what I’ve always had drummed into me…
After the Target Corp. data breach a lot of heads rolled, starting with IT staff (inc. the CIO) and ended with the CEO leaving. In this day and age when you have a massive SNAFU people want heads to roll. Providing there was no malfeasance I tend to believe people learn from mistakes and will never let it happen again. Should the CEO go just because of this? I don’t think so, but the board will likely want to pin blame somewhere and it’s easiest to blame the guy that no longer works for the company. See it ALL THE TIME.
Sure, it probably was a router in a half failed state, meaning it showed alive to the monitoring software, but it wasn’t routing. Happens way more that you think. The real issue is not the router, but the process. Realize there is an issue, fine the type of issue, fix it or work around. There was probably a ton of IT infighting and finger pointing, so really stressed out network and systems enginerers, someone was to proud to call TAC, etc. Finally somone unplugged the damn thing, fixed the routing, and after that it was catch up.
But I am willing to bet, somewhere in there is a middle manager who does not understand the tech, has denied request from engineers to upgrade the router or get a support contract and is good at blowing smoke to the CIO
I see it all the time
The terse response to “Is Delta still a partner” really shows how strained the AS-DL partnership is
Especially after he went on and on and on about other partners in answer to the previous question. Hilarious.
The DL remark was equal to “I’m not there yet” about political candidates haha. Ouch.
My friends at WN, from management to front line, have said things at the Happy Fun airline have been tense for quite a while. Layers of administrative oversight have grown with the network. Front line people deal with gate keepers who filter access to decision-makers; a huge culture shift from a decade ago. One conion theme I hear is that many middle managers came from Legacy carriers and brought their ‘bad habits’ to WN.
Oh, those darn routers! Didn’t they read the letter from Verizon about (your having the older model router) and while you can pay a $2.80* monthly maintenance charge to keep what you have, why not buy your own Super-Duper certified pre-owned model (not sure why it has to be certified that it was pre-owned) for the one-time $59.99* charge* (* plus taxes, of course).
No , no, not WN, and where are we?
Southwest is learning the high cost of being cheap
Maybe what WN needs to add is cheap Redundancy over the cheaper Redundancy?
I’m a senior software professional, and I agree with Patrick and Jeremy – I’ve seen systems “partially fail” many times. Gary Kelly’s comment comparing the issue to a “once in a thousand year flood” is simply not accurate – these types of failures are actually quite common. (I LUV WN, and hope they learn the right lessons from this incident.)
I love this outage. We @tapjets scored many flights as a result. Keel up the good work in IT.
You don’t fix this type of problem by buying a “better” router or a more reliable router. No such thing. More expensive routers can do more and do it faster- that’s it. In fact, you never want to get into a situation where you have to fix it fast.
You build your IT systems with redundant systems, networks, power, data centers, etc. that you would have just a momentary blip if the power, data center, or router or whatever went down. So in this case, the CIO should have known better and be fired immediately along with other key cronies. And the CEO and CIO were both probably being cheap by refusing to invest in redundancy and disaster recovery. Both are very expensive, but so was this outage.
Hi Cranky, Why the silence over the Delta IT failure?
Donald – I have nothing to add. Just like with the Southwest one, I’ll be linking to articles about the Delta failure tomorrow in the 3 Links I Love feature.
We had a similar failure of an ATT backbone network in caused by a technician screwing up the ATM route propagation. It took 3 days to fix and caused the loss of 20 million of pay-per-view orders on the cable network. Definitely technician error; router failure couldn’t cause a problem like that.