Friday, April 20, 2012

All the Interesting Problems Are Scalability Problems

In "It's Not Just About Moore's Law" [2006] I presented this graph based on research I had done while I was working in the computing division at the National Center for Atmospheric Research in Boulder Colorado.

Power Curves

The vertical logarithmic axis shows how technologies change over linear time on the horizontal axis. Here are some of the assumptions I used which were believed to be true at the time I mined the data [1997].

Microprocessor speed doubles every 2 years.
Memory density doubles every 1.5 years.

Bus speed doubles every 10 years.

Bus width doubles every 5 years.

Network connectivity doubles every year.

Network bandwidth increases by a factor of 10 every 10 years.
Secondary storage density increases by a factor of 10 every 10 years.
CPU cores per microprocessor chip double every 1.5 years.

This is old enough now that I probably need to revisit it. For example, microprocessor speed has stalled in recent times. But the basic idea is sound: all things change, but they do not change at the same rate. This means that over time the architectural decisions you made based on the technology at hand at the time are probably no longer correct for the technology you have today. The balanced design you started with eventually no longer makes sense. The scalable solution you came up with five years ago may only scale up if every thing you built it from scales up at the same rate. But over time, it doesn't. And it's just going to get worse.

It has been my experience that all the interesting problems are scalability problems. This graph shows that there is a temporal component to scalability.

People that don't work in technology have this idea that artifacts like hardware and software are somehow frozen in time. Those people are running Windows 98 on a 300MHz desktop with Office 97. People that work in technology know nothing could be further from the truth. Technology changes so quickly (see above) that it's a Red Queen's Race just to stay in one place and keep everything running.

In "The Total Cost of Code Ownership" [2006] I presented yet another graph, based mostly on data from Stephen Schach's book Object-Oriented and Classical Software Engineering [McGraw-Hill, 2002], which he in turn based on surveys of actual software development projects.

Software Life-Cycle Costs - Schach 2002

Notice that the by far the largest share of the cost of the software development life cycle is maintenance, that is, making changes to the software code base after initial development has been completed. It amounts to two-thirds of the entire cost of the software base over its life cycle. If you could somehow completely eliminate the entire cost of all the initial development and testing, you would have reduced your software life cycle cost by only a third.

Surprised? Do you think that number is too high or too low? Most people that have never worked with a large code base that supported a broad product line think it's too high. Those are usually the same people that don't have processes with which to measure their actual software life cycle costs. But organizations that do have to support multi-million line code bases that are part of product lines that generate tens of millions of dollars in revenue annually think that number is too low. Way too low. I've heard the number 80% bandied about.

Les Hatton has observed that software maintenance can be broadly classified into three categories: corrective (fixing code that doesn't work), perfective (improving some aspect of working code, such as performance or even cost of maintenance), and adaptive. It's this latter category that brings these two graphs together. Adaptive maintenance is when you have to change your code because something outside of your control changed.

In my work in the embedded domain, adaptive maintenance frequently occurs because a vendor discontinued the manufacture of some critical hardware component on which your product depends, and there is no compatible substitute. (And by the way, for you hardware guys, pin-compatibility just means you can lay that new part down without spending hundreds of thousands of dollars to re-design, re-layout, and re-test your printed circuit board. With complex hardware components today that may have an entire microcontroller core and firmware hidden inside, that don't mean squat. Case in point: surface mount solid state disks, to pick an example completely at random.) I've seen a product abandoned because there was no cost effective solution.

In my work in the enterprise server-side domain, it's not any different. Vast amounts of software are based on commercial or open-source frameworks and libraries. You want to upgrade to the latest release to get critical bug fixes or new features upon which you've been waiting that make a difference between making or missing your ship date, only to discover that the software on which you so critically depend has been improved by the provider roto-tilling the application programming interface. This is the kind of thing that sends product managers running around like their hair was on fire.

So here's the deal. You can't get around this. It's like a law of thermodynamics: information system entropy. Because all things change, but not at the same rate, you always face a moving target. Adaptive maintenance is an eternal given. The only way to avoid it is for your product to fail in the market place. Short lived products won't face this issue.

Or you can learn to deal with it. Which is one of the reasons that I like Arduino and its eight-bit Atmel megaAVR microcontroller as a teaching and learning platform.

Wait... what now?

Close Up of JTAG Pod and EtherMega Board

This is a Freetronics EtherMega, sitting a few inches from me right now, festooned with test clips connected to a JTAG debugging pod. It's a tiny (about four inches by two inches) Arduino-compatible board with an ATmega2560 microcontroller. The ATmega2560 runs at 16MHz and its Harvard architecture features 256KB of flash memory for executable code and persistent data and 8KB of SRAM for variable data.

Mac Mini

This is my Mac Mini on which I'm writing this article right now. It smaller than a manilla file folder. It runs at 2.4GHz, and its von Neumann architecture features 4GB of RAM.

Ignoring stuff like instruction set, bus speed, and cache, my Mac Mini has a processor that runs about 150 times the speed of that on the EtherMega. But it has almost sixteen thousand times the memory. And that's ignoring the Mac's 320GB disk drive.

This means when writing code for the tiny EtherMega, any assumptions I may have been carrying around regarding space and time trade-offs based on prior experience on desktop or server platforms get thrown right out the window. For example, depending on the application, it is quite likely better to re-compute something every time you need it on the EtherMega than it is to compute it once and store it, because on the EtherMega bytes are far more precious than cycles.

I can hear my colleagues from my supercomputing days at NCAR laughing now.

NCAR Mesa Lab (looking East)

I'm not the first to have noticed that the skill sets for developing software for embedded systems are very much the same as those required to develop for high performance computing or big distributed platforms. I built a career around that very fact. There is a trend in HPC (the domain formerly known as supercomputing) to re-compute rather than compute-and-store, or compute-and-transmit, because raw computing power has grown at a much faster rate than that of memory or communications bandwidth (again: see above). It turns out that developing software for itsy bitsy microcontrollers has more in common than you might think with developing for ginormous supercomputers.

Writing software for these tiny microcontrollers forces you to consider serious resource constraints. To face time-space tradeoffs right up front. To really think about not just how to scale up, but how to scale down. To come to grips with how things work under the hood and make good decisions. There is no room to be sloppy or careless.

Working with technologies like Arduino and FreeRTOS on my Amigo project has made me a better, smarter, more thoughtful software developer. I am confident it can do the same for you, regardless of your problem domain, big or small.

No comments: