Thursday, December 20, 2012

Dead Man Walking

This article is about the event that was the greatest disaster of my career. It is also about the event that was the greatest stroke of luck of my career. It was the same event.

In 2005 I was working for a large Telecommunications Equipment Manufacturer. TEM had an impressive product line ranging from pizza-box-sized VOIP gateways to enormous global communications systems. At the time I was a software developer for TEM's Enterprise Communications Product, software that had a code base of millions of lines of code, the oldest sections of which stretched back to the invention of the C language. ECP was the crown jewels that, directly or indirectly, paid all the bills. Although I was an experienced product developer, I was fairly new to this area of the company, having spent the past several years writing software and firmware mostly in C++ closer to the hardware. But I had been working on a team that had developed a major new feature for ECP that was going to be in the upcoming release.

TEM was eager to win a big contract with one of their largest customers, a large, well-known Wealth Management Firm. It is likely that some or all of your retirement funds were under their management. WMF wanted a unified communications system for both their big office buildings, of which they had several, and for their smaller satellite offices scattered all over the country, of which they had many.

TEM was so eager to win this big contract, and the timing for WMF's acquisition was such, that my employer decided to preview the latest release of ECP by sending a team to one of WMF's data centers to install it on one of their servers so that WMF could see just how awesome it was, especially with this new feature that I had helped develop. But this new release of ECP not only wasn't in beta with any other customers yet, it hadn't even passed through TEM's own Quality Assurance organization. It was, at best, a development load, and not a terribly mature one at that. But millions of dollars were riding on TEM convincing WMF that ECP was the way to go.

When they asked me to get on the plane, being the fearless sort, I said yes.

Even given my relative inexperience with this code base, I was probably the logical choice. I had been one of the developers of one of the features in which WMF was interested. And getting this new release of ECP on WMF's server was not the usual process easily handled by one of TEM's technical support people. Because of the immaturity of the release, it wasn't a simple update, but a disk swap that required that I open up WMF's server and do some surgery. I had to back up the configuration and parameter database from the prior ECP release, swap the disks, and restore it to the new release. I was traveling with two disk drives in a hard case and a tool kit.

The conditions under which my work was done at the WMF data center were not optimal. I was in a tightly packed equipment room taking up most of a floor of an office building on a large WMF campus. All the work had to be done in the dead of night outside of normal business hours. I was locked in the equipment room without any means of communicating with anyone. If I walked out even to go to the euphemism, I couldn't get back in without finding a phone outside the room and calling someone. I had a mobile phone, but it couldn't get a signal inside the equipment room. For security reasons, there was no internet access. I had to get the install done quickly so that other two TEM developers that came on site could admin the new features and we could demo them before our narrow maintenance window expired. Security was tight, and time was short. I spent almost all my time at WMF sitting on the floor in a very narrow aisle with tools and parts strewn all around me, and a laptop in my lap connected to the server's maintenance Ethernet port. I got the DB backed up, the disks swapped, and the DB restored.

ECP would not come up.

It core dumped during initialization. I didn't even get to a maintenance screen. The system log file told me nothing that the stack trace didn't already. The ECP log was useless. I swapped the disks again and verified that the prior system came up just fine with the same DB as expected. I tried the spare disk that I had brought with me, to no avail. I desperately needed a clue. The catastrophic failure was in some part of the enormous application that I knew nothing about. Even if I did, I didn't have access to the code base or any of the usual diagnostic tools while sitting in the floor with my laptop. I had saved the DB, the stack trace, and the core dump on my laptop, but had no way to diagnose this level of failure on site, and no way to reach anyone that could. I knew that I was going to have to declare failure and cart everything back home for analysis.

Later, back at TEM, there were lots of meetings, port-mortems, but, remarkably, not a lot of finger pointing. We all knew it was a very immature release. I engaged other, more experienced, ECP developers to diagnose the failure and they set upon fixing it in the development base. Once that was done, I set up an ECP server, sans any actual telecommunications hardware, in my office, and installed WMF's DB on it to verify that it did indeed now come up. In the meantime, TEM's QA organization began testing this new ECP release on their own servers which did have actual hardware. Just a week or two passed before the powers that be decided that the new release had percolated enough that another attempt would be made. WMF would give TEM and ECP another chance.

I said yes, again. In hind sight, I'm a little surprised they asked.

This time I had a copy of the entire ECP code base on my laptop, although I still had no access to any of the complex diagnostic tools used to troubleshoot the system. The circumstances were identical: the same cast of characters, the same cramped cold equipment room, the same DB, the exact same server. Once again, I backed up the DB, swapped the disk, and restored the DB.

ECP came up. But it refused to do a firmware load onto the boards that allowed the server to communicate with any of the distributed equipment cabinets. ECP was up, but it was effectively useless.

We hadn't seen anything like this in our own QA testing of the new release, even though it used the same boards. My intuition told me that it probably something to do with WMF's specific DB. We weren't able to test with that DB in QA because the data in the DB is quite specific to the exact hardware configuration of the system, which involved hundreds if not thousands of individual components that we were unable to exactly replicate. The error didn't appear to be in the ECP software base itself, but in the firmware for the communications board, the source code of which I didn't have. And in any case I was not familiar with the hundreds of thousands of lines of C++ that made up that firmware. I personally knew folks at TEM that were, but even though they were standing by in the middle of the night back at the R&D facility, I had no way to contact them while connected to the server in front of me. After some consulting with the other TEM folk on site, and as our narrow maintenance window was closing, I once again declared failure.

As I got on the plane back east to return home, I knew that this was the end of my career at TEM. I took it as a compliment that they didn't fire me. They didn't even ax me in the inevitable next wave of layoffs. There were lots more meetings and post-mortems, some perhaps harsh but in my opinion well deserved words from TEM management to me, and a lot of discussion about a possible Plan C. But WMF's acquisition timetable had closed. And I knew that I would never be trusted with anything truly important at TEM ever again.

This is not the end of the story.

If you've never worked in a really big product development organization, it may help to know how these things operate.

ECP wasn't a single software product. It was a broad and deep product line incorporating several different types of servers, several possible configurations for some of the servers, many different hardware communications interface boards, and a huge number of features and options, some of which targeted very specific industries or market verticals. Just the ECP software that ran on a server alone was around eight million lines of code, mostly C. The code bases for all of the firmware that ran on the dozens of individual interface and feature boards manufactured by TEM, incorporating many different microprocessors, microcontrollers, FPGAs, and ASICs, and written in many different languages ranging from assembler to C++ to VHDL, added another few million lines of code. As ECP features were added and modified and new hardware introduced, all of this had to be kept in sync by a big globally distributed development organization of hundreds of developers and other experts.

The speed at which new ECP releases were generated by TEM was such that dozens developers were kept busy fixing bugs in perhaps two prior releases or more, while another team of developers was writing new features for the next release. It was this bleeding edge development release that I had hand carried to WMF. So it was not at all unusual to have at least three branches or forks of the ECP base in play at any one time. As bugs were found in the two prior forks, the fixes had to be ported forward to the latest fork. This was not always simple, since the code that the bug fix was in in the older fork may have been refactored, that is, modified, replaced, or even eliminated, in the course of new feature development in the latest fork. While a single code base might have been desirable, it simply wasn't practical given the demands of TEM's large installed user base all over the world, where customers just wanted their bugs fixed so that they could get on with their work and weren't at all interested in new and exciting bugs.

Once I got back home, and got some breathing space between meetings with understandably angry and disappointed managers, I started researching both of the WMF failures. Here is what I discovered: both of the issues I encountered trying to get ECP to run at WMF were known problems. They were documented in the bug reporting system for the prior release, not the development release that I had. The two bug reports were written as a result of TEM's own testing of the prior release. At WMF. At the same data center. On the same server. Those two known bugs had been fixed in the prior release, the very release of ECP that was already running on WMF's test server, but the fixes had not yet been ported forward to the development release that I was using for either of my two site visits. I hadn't known about these issues before; I was new enough to this particular part of the organization that I hadn't been completely conversant with the fu required to search its bug reporting system.

Both of the times I had gotten on the plane to fly to WMF, I was carefully hand carrying disk drives containing software for which it was absolutely known that it could not be made to work. In hindsight, my chances of success were guaranteed to be zero. It had always been a suicide mission.

Here's what keeps me awake some nights thinking about. There was a set of folks at TEM that knew we were taking this development release to WMF. There was a set of folks at TEM that knew this development release would not work at WMF. Was the intersection of those two sets empty? Surely it was. What motivation could anyone have to allow such a fiasco to occur?

But sometimes, in my darker moments, I remember that at the time TEM had an HR policy that included that enlightened system of forced ranking. And someone has to occupy those lower rating boxes. Would you pass up the opportunity to eliminate the competition for the rankings at the more rarified altitudes?

Never the less, I have always preferred to believe that the WMF fiasco was simply the result of the right hand not knowing what the left hand was doing. One of the lessons I carried away from this experience is that socializing high risk efforts widely through an organization might be a really good idea.

Ironically, WMF decided to go ahead and purchase TEM's ECP solution, the very product I had failed to get working, twice, for their main campuses, but go with TEM's major competitor for the small satellite offices. Technically, it was actually a good solution for WMF, since it played off the strengths of both vendors. Sometimes I wonder what my life would have become if WMF had simply gone with that solution in the first place and we could have avoided both of my ill-fated site visits.

WMF itself, once firmly in the Fortune 100, ceased to exist, having immolated under the tinder of bad debt in the crucible of the financial crisis.

Many of my former colleagues are still at TEM, albeit fewer with each wave of layoffs, still working in that creaky old huge C code base that features things like a four thousand line switch statement. It's probably significantly bigger by now.

As for me, a chance to transfer from the ECP development organization to another project came along. The new project was developing a distributed Java-based communications platform using SOA/EDA with an enterprise service bus. I moved to the new project, and worked there happily for over a year, learning all sorts of new stuff, some of which I've written about in this blog. ECP was probably relieved to see me go.

But knowing that I had made a career-limiting mistake, I eventually chose to leave TEM to be self-employed. My decision surprised a lot of people, most of whom knew nothing or only a small part of the WMF story. It was one of the best career decisions I've ever made. I'm happier, less stressed, learned more, and made more money, than had I stayed at TEM.

Funny how these things work out. Would I have ever have followed one of my life-long ambitions had not the WMF fiasco occurred? Or do we sometimes need a little creative destruction in our lives to set us on the right path?

2 comments:

Craig Ruff said...

A case of shoot the messenger? Typical management solution, blame everyone else but yourself. After all you were neither responsible for the lack of bug fixes in the development version, nor for the inability of the relevant parts of the organization to communicate.

Chip Overclock said...

I apologize if I gave the impression that I was trying to deflect the blame for this fiasco from myself. Quite the contrary. In complex situations like this, there is always plenty of blame to go around. Like most catastrophes, there are multiple causes that all line up to allow it to happen; if that weren't the case, existing processes would have prevented the catastrophe in the first place. But I feel that part of being a professional is taking responsibility for your own actions, even when under the circumstances it's hard to see that things could have turned out any differently. In the end, it worked out really well for me, probably more so than for any of the other parties involved. And I learned several valuable lessons that I can now apply to future endeavors. I'm not complaining. My telling of this story is at least in part in the spirit of mentoring and serving as a warning to others, and in part remarking on how something very good can come out of something that at the time seems quite bad.