3 Links I Love: Why Southwest Says Its IT Failed, Calls for Southwest CEO’s Head, Talking Alaska-Virgin America

Aug 5, 2016

More Reading on This Topic: Alaska Airlines, Links I Love, Southwest, Technology, Virgin America

This week’s featured link:
Southwest CEO: Router failure that grounded flights equated to ‘once-in-a-thousand-year flood’ – Dallas Morning News
Something tells me this is as close as we’re ever going to get to figuring out what really happened when Southwest’s systems melted down. I have a picture of Southwest’s IT systems looking like a giant Jenga with half the pieces already gone. So is it possible that one router went down and killed the airline? Sure. Anyone else believe that?

Links I Love

Two for the road:
Unions want Southwest CEO removed after IT outage
On a related note, this IT meltdown has given the unions even more fodder in their negotiations. They want to show that management is incompetent with this IT problem being the latest example. And now, they’re calling for the ouster of CEO Gary Kelly. Can you imagine that this is the same airline Herb Kelleher ran? It’s just incredible that we’ve reached this point.

Alaska Airlines SVP Joseph Sprague on Virgin America Acquisition – Business Travel News
Here’s an interview with Alaska’s SVP of Communications and External Relations that I found worth a read. I think you’ll agree that the comment about Delta is priceless.

Brett

16 responses to “3 Links I Love: Why Southwest Says Its IT Failed, Calls for Southwest CEO’s Head, Talking Alaska-Virgin America”

Patrick

Aug 5, 2016 at 4:57 am

As an IT guy, I can definitely say that yes, a single router failure (whether it’s due to a mechanical or software failure, or due to human error) can take down an entire network. And the type of issue they had (a failure that caused monitoring agents to not know that they needed to fail over to backup equipment) has certainly happened before. It’s just that a disruption at most businesses, even if it lasts multiple days, doesn’t tend to affect people as visibly as a days-long disruption at an airline is going to have.
David

Aug 5, 2016 at 5:32 am

I thought there was a principle in critical IT systems like core networkws that there should never be a single point of failure. At least that’s what I’ve always had drummed into me…
A

Aug 5, 2016 at 5:56 am

After the Target Corp. data breach a lot of heads rolled, starting with IT staff (inc. the CIO) and ended with the CEO leaving. In this day and age when you have a massive SNAFU people want heads to roll. Providing there was no malfeasance I tend to believe people learn from mistakes and will never let it happen again. Should the CEO go just because of this? I don’t think so, but the board will likely want to pin blame somewhere and it’s easiest to blame the guy that no longer works for the company. See it ALL THE TIME.
Jeremy

Aug 5, 2016 at 6:44 am

Sure, it probably was a router in a half failed state, meaning it showed alive to the monitoring software, but it wasn’t routing. Happens way more that you think. The real issue is not the router, but the process. Realize there is an issue, fine the type of issue, fix it or work around. There was probably a ton of IT infighting and finger pointing, so really stressed out network and systems enginerers, someone was to proud to call TAC, etc. Finally somone unplugged the damn thing, fixed the routing, and after that it was catch up.

But I am willing to bet, somewhere in there is a middle manager who does not understand the tech, has denied request from engineers to upgrade the router or get a support contract and is good at blowing smoke to the CIO

I see it all the time
Czbb

Aug 5, 2016 at 7:35 am

The terse response to “Is Delta still a partner” really shows how strained the AS-DL partnership is
1. Jimmy
  
  Aug 5, 2016 at 9:09 am
  
  Especially after he went on and on and on about other partners in answer to the previous question. Hilarious.
Eric

Aug 5, 2016 at 8:35 am

The DL remark was equal to “I’m not there yet” about political candidates haha. Ouch.

My friends at WN, from management to front line, have said things at the Happy Fun airline have been tense for quite a while. Layers of administrative oversight have grown with the network. Front line people deal with gate keepers who filter access to decision-makers; a huge culture shift from a decade ago. One conion theme I hear is that many middle managers came from Legacy carriers and brought their ‘bad habits’ to WN.
JayB

Aug 5, 2016 at 9:12 am

Oh, those darn routers! Didn’t they read the letter from Verizon about (your having the older model router) and while you can pay a $2.80* monthly maintenance charge to keep what you have, why not buy your own Super-Duper certified pre-owned model (not sure why it has to be certified that it was pre-owned) for the one-time $59.99* charge* (* plus taxes, of course).

No , no, not WN, and where are we?
Yo

Aug 5, 2016 at 11:06 am

Southwest is learning the high cost of being cheap
Ziggy

Aug 5, 2016 at 12:36 pm

Maybe what WN needs to add is cheap Redundancy over the cheaper Redundancy?
UAPhil

Aug 5, 2016 at 11:53 pm

I’m a senior software professional, and I agree with Patrick and Jeremy – I’ve seen systems “partially fail” many times. Gary Kelly’s comment comparing the issue to a “once in a thousand year flood” is simply not accurate – these types of failures are actually quite common. (I LUV WN, and hope they learn the right lessons from this incident.)
Eugene

Aug 7, 2016 at 8:48 am

I love this outage. We @tapjets scored many flights as a result. Keel up the good work in IT.
IT architect

Aug 10, 2016 at 7:48 pm

You don’t fix this type of problem by buying a “better” router or a more reliable router. No such thing. More expensive routers can do more and do it faster- that’s it. In fact, you never want to get into a situation where you have to fix it fast.

You build your IT systems with redundant systems, networks, power, data centers, etc. that you would have just a momentary blip if the power, data center, or router or whatever went down. So in this case, the CIO should have known better and be fired immediately along with other key cronies. And the CEO and CIO were both probably being cheap by refusing to invest in redundancy and disaster recovery. Both are very expensive, but so was this outage.
Donald

Aug 11, 2016 at 4:01 am

Hi Cranky, Why the silence over the Delta IT failure?
1. CF
  
  Aug 11, 2016 at 8:10 am
  
  Donald – I have nothing to add. Just like with the Southwest one, I’ll be linking to articles about the Delta failure tomorrow in the 3 Links I Love feature.
Rex Mammel

Aug 12, 2016 at 10:04 am

We had a similar failure of an ATT backbone network in caused by a technician screwing up the ATM route propagation. It took 3 days to fix and caused the loss of 20 million of pay-per-view orders on the cable network. Definitely technician error; router failure couldn’t cause a problem like that.

Get Cranky in Your Inbox!

The airline industry moves fast. Sign up and get every Cranky post in your inbox for free.