The Solution is More than Just Process™

Arstechnica has been around since before the dotcom bust days. As a news, editorial and reviews site for most things computing/technology-related, they have more than earned their credibility for readability but also journalistic depth, particularly in reviewing gadgetry and hardware. So it was surprising to see this piece posit on what ails Microsoft Windows 10:

Microsoft’s problem isn’t how often it updates Windows—it’s how it develops it

To oversimplify, the author believes that the bugs that are cropping up in recent updates—with a particularly bad one in the last release causing user data loss—have to do with how the company develops Windows. Specifically, every release includes a number of features, each of which requires a good amount of planning and coding, but whose bulk of development time is actually spent in integration and testing merging the feature into master build. This is contrasted to the modern accepted best practice of continuous integration (CI) and even continuous deployment (CD), whereby the feature is tested thoroughly first locally, and only gets integrated/deployed as a part of the master build after it has been vetted. With CD, the idea is that the master build (i.e., “trunk” in version control system parlance) can and will always be deployed because all committed features should have been well-tested as a pre-condition to getting onto the trunk in the first place.

Fixating on process is attractive, in large part because it keeps the other parameters—the number of people, the breadth of features and technological advancements, the promised timeline—the same. Process is a solution against inefficiency, the waste in development that comes from miscommunication and misalignment and poor project management and a dozen other factors that degrade quality and timeliness. It’s hard to argue against making work more efficient and employees more productive1.

But in comparison to other software projects, Windows features unprecedented scale and legacy support, and as an operating system it serves as the intermediate abstraction for apps on top of a wide set of PC hardware. The counterexample cited, Google Chrome, is indeed more agile and features 6-week release cycles, but the two pieces of software are at least a magnitude apart in complexity2.

If we contrast Windows to other operating systems, it’s not obvious that the quality bar is as abysmal as the article suggests. Every version of iOS, Android, and macOS has had buggy and incomplete major version x.0 launches; it’s now common wisdom to see what bugs need to be ironed out a little before upgrading to a new operating system, if not wait for the x.1 patch altogether. OS vendors now commonly announce their new versions, and then extend a prolonged beta period, precisely to root out issues that may occur across thousands of real users that aren’t possible to manually or automatically test in-house.

With CI and CD, the cost of integration—really, unavoidable at some level when you have multiple streams of development on a single software package—is pushed earlier, to when the code is worked on and committed locally. This is likely better than integrating and testing the entirety of Windows after-the-fact, but it does not magically reduce the cost and time to integrate down to zero. In fact, most of these bugs that slip past development testing and betas are the integration-type bugs that span across multiple systems, worked on by disparate teams.

The Windows bug was certainly embarrassing. But given the overall complexity and number of moving pieces of such a product, I’m skeptical that a new process—even one now considered a best practice—is enough to make up that difference. That’s not to say that there’s no room for improvement3, but selling the adaption of an “agile process” as the panacea feels a little too much like ineffective consulting.

There is a difference.↩
Note that Windows contains a full-fledged web browser in Microsoft Edge.↩
I talked with someone recently who worked at Microsoft for a number of years half a decade ago, and they mentioned that back then, the company had two Software Engineers in Test (SDET) for every Software Engineer (SWE), which does seem mightily broken.↩