The 16 Critical Do’s and Don’ts of Project Management

Antelope Canyon. Page, Arizona.

“There is always a well-known solution to every human problem — neat, plausible, and wrong.” — H.L. Mencken. 1880–1956.

I retired from the computer industry four years ago. I’ve continued to advise a few startups but, truth be told, I’ve spent more hours on a bicycle than behind a desk; and that’s just the way I like it. My days of hands-on management of engineering teams are over. At best, I can offer words of advice to a younger generation.

I had a very successful career and am proud of what we’ve accomplished over the past forty years that I’ve been involved with the industry. Somehow, I managed to work on everything from supercomputers to personal computers to the cloud, from distributed operating systems to embedded devices to the backbone of the internet. Having worked on literally hundreds of projects throughout my career, there are bound to be plenty of success stories as well as failures and mistakes.

I rarely talk about the success stories because I worry it comes across as boastful but I can pretty much guarantee not a day goes by that you don’t use, hundreds of times and without even realizing, some technology that I worked on at some point in my career. When I think back, though, I find I spend more time dwelling on my failures. For some reason, the mistakes are the ones that haunt me; perhaps because they’re the ones that taught me the most lessons.

Every once in a while, I think back to some decision made twenty years ago and admit to myself: you know, I was absolutely wrong on that one. It’s so obvious now. I made the decision based on emotions, based on inter-group politics, based on gut feel, based on inertia, not based on data. Ouch. How stupid of me. If I could only do it all over again…

Let me give at least one concrete example, hopefully without using too much jargon, and follow it up with lessons learned. I hope I don’t offend any of my former colleagues who spent years doing excellent work on these projects. They did exactly what was asked of them and more. It’s not like we had a bad design or a poor implementation; just the opposite. In fact, we had some of our best people working on the project and they all did great work. It’s just that they spent years working on a project that should have never happened. That was at least partly my fault. Mea culpa. I should have known better.

One of my early jobs at Microsoft was to oversee several engineering teams in the Windows division. One of these teams delivered all the core file system functionality to the broader Windows businesses selling solutions to our customers, both consumer editions running on PCs and laptops and enterprise editions running on servers.

One of our success stories was the FAT file system: the same file system that you use every time you plug in a USB thumb drive or an external disk or even an old-fashioned floppy disk! Pretty much every device in the industry, almost regardless of the OS it runs, still uses this simple on-disk format to expose file level functionality to applications. We managed to make it an industry standard for all removable media and for devices like digital cameras.

FAT was originally designed in the early personal computer days and lacked many features. There was no built-in security, it had limited file naming capabilities, the original version supported only small disks, it didn’t have resiliency characteristics and could easily be corrupted during a system crash, etc. But it was simple, fast, and ubiquitous.

The same team was also responsible for NTFS, a modern enterprise-class journaling file system intended for high end scenarios with full security, massive scalability, high reliability, and tons of features.

Both FAT and NTFS projects had been started before my time and I inherited them along with dozens of other projects across multiple teams. And so, for a few years, I got to prioritize investments in these and various other related technologies.

For example, we built a new file system, ExFAT, that extended the capabilities of the FAT file system for larger disks. This was an obvious and, in hindsight, well placed investment. As disks grew in size from megabytes to gigabytes to terabytes, we could continue to support them in new environments and applications. The effort also included working with industry partners like manufacturers of external disks, digital cameras, and embedded devices to ensure compatibility and interoperability. That’s why practically every device on the market today supports FAT and ExFAT.

FAT was, by all accounts, a wild success. NTFS was also a great success, eventually shipping on many billions of Windows PCs and servers. I could stop there, pat myself (and my team) on the back for a job well done, and go off for another bike ride. But I promised to also talk about a mistake.

One of the projects I chose to fund, staff, and build during those years was one called TxF: Transactional File System. This was not an entirely new file system but rather an extension of NTFS. The idea was that we would add database type transactional ACID semantics to NTFS. It was an ambitious project, extremely complex in both architecture and implementation. It could do things that no other file system could accomplish. You could start a transaction, create files and folders, modify them, delete them, rename them, and then eventually “commit” the entire transaction in a single atomic step, making all your changes visible to the rest of the system; or you could just as easily “abort” the transaction and all files and folders would magically revert back to the state they had before the start of the transaction. Such functionality is typical of databases but, to the best of my knowledge, has never been implemented in a file system.

As soon as I heard about TxF, I thought it was a great idea and enthusiastically endorsed it. We had no trouble getting smart engineers to work on the project. We even spawned off three satellite projects, KTM (Kernel Transaction Manager), TxR (Transactional Registry), and CLFS (Common Log File System), that together enabled a whole host of interesting scenarios, including participation in distributed transactions with traditional databases, atomic installation of applications (along with “magic” perfect uninstall with a single “abort transaction” API call), atomic update of websites, etc.

So we built such a thing. And it worked. And it sliced and it diced. And it had an elegant design. And it was implemented well. And it shipped in several releases of Windows to billions of customers. It took years to build, spanning multiple releases, and was even incorporated into the much-maligned WinFS project as part of Longhorn. As far as I can tell, fifteen years later, it’s still included in current Windows releases. Many patents were filed, many engineers wrote many millions of lines of code so that we could…

… and therein lies the problem: In hindsight, some of those scenarios seem contrived; and most of the legitimate ones could have been accomplished with simpler (if less architecturally elegant) solutions. Even the current documentation on Microsoft’s website pretty much admits this while gently deprecating the feature in preparation for future removal: “Microsoft strongly recommends developers utilize alternative means to achieve your application’s needs. Many scenarios that TxF was developed for can be achieved through simpler and more readily available techniques. Furthermore, TxF may not be available in future versions of Windows.”

We were naïve in thinking such a capability would usher in a whole slew of new scenarios. What application developer wants to take a dependency on a future major release of an operating system which will take years to gain broad adoption? The vast majority of systems “in the field” continue to run older versions of operating systems long after new versions are released, for every reason from laziness to app compatibility concerns to corporate compliance. This is less true in the new age of frequent automatic updates in consumer scenarios but is still the case in corporate environments. The right approach is to build services “over the top”, without taking deep OS dependencies, so you can ship fast and often, iterate quickly, and only incorporate functionality into the base platform after it’s proven itself.

My failure, our failure, was that we fell in love with the technology we’d invented and ended up innovating at the wrong level of the stack. We generalized from a few interesting and very hard scenarios to convincing ourselves that we needed a general purpose architectural solution that we, being the file system team, of course, knew how to build deep in the bowels of the operating system.

By the time we shipped even the first version of this ambitious project, the world had moved on. By then, it was clear that applications were being built to not depend on such complex file system functionality and instead performed such tasks at higher levels of the stack. They used distributed APIs across the cloud, storing stream data such as movies, photos, and songs in files but depending on databases for structured data storage, metadata operations, indexing and naming and searching as well as for transactional semantics. Intelligence had moved to higher levels of the application stack, relegating the traditional file system to simple streaming and archival needs.

It’s simple to see, in hindsight, that the entire project was ill-conceived and added significant complexity without sufficient end user value. Mea culpa! We should have never built TxF! Why did we all sit in meetings week after week and convince ourselves that it was the right thing to do? Innovator’s dilemma. Period. We were the file system team and, dammit, we would innovate in the file system!

That’s about as harsh a verdict as I’ve ever delivered on a project. I feel bad because I know a lot of smart people did excellent work on it for many years but, in hindsight, we should have never built it. I take full blame for my role; I should have known better. What still baffles me, though, is how we — everyone involved, dozens of people at all levels of the organization — convinced ourselves again and again, month after month, year after year, that the project continued to make sense, that we should keep funding it. What were we thinking?

“People invest their IQ in buttressing their own case rather than in exploring the whole issue more fully and evenhandedly.” — David Perkins. Learning to reason: The influence of instruction, prompts and scaffolding, metacognitive knowledge, and general intelligence on informal reasoning about everyday social and political issues.

The story of TxF is not a unique one. Over the years, I’ve inherited many projects with similar structural issues and seen other groups make similar mistakes. Now extrapolate from that to the entire computer industry. The information highway is littered with such projects. I wonder how many of you, reading this story, see parallels in it with your own current or former projects. The lessons are obvious in retrospect. Others have pointed them out but I guess you don’t learn them until you personally live through them. I call them, collectively, the common sense rules of project management.

Start with crisp, well-defined customer scenarios and work backwards to a solution. Starting with the solution first, no matter how innovative at heart, is akin to grabbing a hammer and then looking around for nails. As we all know, once you have a hammer in hand, everything looks like a nail. And the more nails you find, the more the requirements morph over time; you end up building a system riddled with compromises and burdened with unnecessary complexities that wants to be all things to all people and doesn’t really address any of the scenarios satisfactorily. We didn’t start in a vacuum, of course; we had a few core scenarios but not enough to warrant a complex generalized solution. Once we’d convinced ourselves that we had to build such a solution, it became our hammer and we started looking for nails.

Beware the opportunity cost. Every resource spent on TxF was a resource not used to build a solution like Amazon S3. Instead of embedding complexity in the base platform and asking it to do more, S3 chose to dramatically simplify the API and services provided by the base storage system and went even so far as to remove age-old capabilities such as folder hierarchies and file sharing semantics, moving all such operations to higher levels of the stack. They zigged while we zagged and the results speak for themselves. We built the wrong thing (and did it really well!), using up resources and brainpower that could have been utilized elsewhere. And we did it because we could, not because we had to. Because we were Microsoft. I would like to think that startups, being resource constrained, are forced to be much more careful. But, then again, …

Keep it simple, stupid! (KISS) Every line of code added to a system increases its complexity, making the system more fragile and error-prone. Yet, most engineers rarely consider this fact when they work on a project. They happily write a million lines of code but rarely question the long term implications, the compatibility and interop issues, the additional test cases, the customer support burden, etc. they’re introducing. As a supposedly senior leader with many years of experience behind me, I was well aware of these facts yet somehow convinced myself that the alternatives weren’t acceptable… not as architecturally elegant, not as unique and differentiated from other platforms, not as fast, etc. All reasonable arguments but at what cost? What we ended up building was an elegant world-class solution and definitely a unique capability for an operating system but it was also unnecessarily complex and took too long to deliver. Many of the purported scenarios could have been accomplished with much simpler solutions at higher levels of the stack.

Don’t fall in love with the technology. This is a tough one for propellerheads like me but it’s the right answer. You’ll thank me later.

Solve problems at the right level of abstraction even if that’s not the level of the stack you own or control. This one is an easy trap to fall into, especially in large organizations with many divisions, overlapping team charters, and widely divergent customer requirements. At least unconsciously, part of the reason we did what we did was because we were the file system team and felt our responsibility was to… guess what… innovate at the file system level. The right answer, of course, is to solve the customer problem but that often gets lost amid cross-group rivalries and turf battles.

Ship early, ship often. By building the functionality into the bowels of the operating system, we forced ourselves to abide by the slow release cycles of Windows (once every three years). Software delivery models have changed dramatically over the past decade, making this rule brain dead simple, with the internet enabling fast and painless downloads and “Call Home” functionality. Compare that to the world we lived in at the time, where we had to ship CDROMs (or worse, floppy disks a few years before that!) This, in turn, forced us to bundle disparate changes into large packages that were infrequently delivered and by necessity disruptive in nature, simultaneously reducing our ability to experiment and to get timely customer feedback. Not to mention the fact that we had zero ability to collect telemetry or perform remote diagnostics. We were driving blind. Perhaps we wouldn’t have made the same choices we made if our hands weren’t tied when it came to delivering the bits. The right answer, especially in this day and age of continuous delivery practices, is to build services and functionality above the OS first, ship frequently, iterate like mad, and then incorporate them into the base platform only once proven.

“Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.” — Samuel Beckett. Westward Ho.

Fail fast. This is excellent advice not just for programmers dealing with error conditions but also for project managers and team leaders. It’s often better to kill a struggling project than to drag it on for another year. Unfortunately, such a decision is also the harder one to make. It’s easy, especially at a time when the company and the product are doing well, to continue kicking the ball down the road than to admit defeat.

Don’t drink your own Kool Aid. All humans suffer from confirmation bias, the tendency to favor information that confirms our existing beliefs and to dismiss data that doesn’t. Build the discipline to question your assumptions on a regular basis. As Mark Twain famously said: “What gets us into trouble is not what we don’t know. It’s what we know for sure that just ain’t so.” One of my other favorite quotes is from How Not to Be Wrong: The Power of Mathematical Thinking by Jordan Ellenberg: “I find it’s a good habit to put pressure on all your beliefs, social, political, scientific, and philosophical. Believe whatever you believe by day; but at night, argue against the propositions you hold most dear. Don’t cheat! To the greatest extent possible you have to think as though you believe what you don’t believe. And if you can’t talk yourself out of your existing beliefs, you’ll know a lot more about why you believe what you believe. You’ll have come a little closer to a proof.” I wish I’d practiced this habit more methodically.

Beware Second System Syndrome. This is a well-known curse of those building a second generation of a successful system. They have a tendency to throw everything in there but the kitchen sink, ignoring the complexity tax. This is often justified internally by claiming that all the bells and whistles in the new version have been requested by customers so we’re doing the right thing by “listening to the customer”. The result is bloat and complexity.

When in doubt, the correct answer is usually “no”. I could argue that this was just a small project among a portfolio of hundreds that I was responsible for over the years, one bet among many, never occupying more than a half dozen resources. I could argue that it added unique functionality to the platform. I could argue that it shipped “successfully” to billions of customers. But those are all rationalizations after the fact, none of them sufficient cause for building something that ultimately failed to deliver enough value to customers. An honest cost-benefit analysis of the possible alternatives would have made this clear at the time: do the benefits of adding this feature justify the complexity (and associated costs) it introduces? And be honest about the costs over the years. TxF did not “cost” the half dozen engineers that we put down in the headcount column in the spreadsheets. By now, some twenty years later, it has cost hundreds of man years. When in doubt, say no.

Sunk costs are just that: already sunk. No use crying over spilled milk. Just move on. Never use the “but it’s almost finished” excuse. 95% finished is only 5% done. Trust me. A good litmus test is to ask yourself whether you’d start this project today if you weren’t already doing it.

Shipping is important; shipping the right thing is even more important. Don’t tie rewards to shipping-at-any-cost.

Don’t be afraid to tell the emperor he has no clothes. It’s not just that I, in my position and at that particular point in time, made a single bad decision on one particular day to pursue this project. What amazes me is that it survived scrutiny in meeting after meeting, in presentation after presentation, month after month, quarter after quarter, through a dozen reorgs or more. We’re talking six years of meetings, project reviews, architecture reviews, budget reviews, … before the first version ever shipped to a customer. By the time it shipped, the team didn’t even report to me any more; yet, TxF kept chugging along. Why didn’t any of us stand up once during those years and say: “Hey, this was a bad idea; let’s kill it”? Or, even, “It was a good idea at one point but not any more.” As opposed to “Let’s continue to fund and support it, let’s ship it to billions of customers in order to support scenarios that can, frankly, be solved in much easier ways.” Which is what we chose to do instead.

For best results, “Move up the stack, young man!” Much like the quote that inspired it, the point of this lesson is that in order to continue to innovate in our industry, you need to constantly “go west”, as it were. Keep climbing up the stack of abstraction; find new land to grab. Sometimes, the best answer is to stop innovating at certain levels of the stack.

SDLC is your friend; and so is the other SDLC. We would have avoided many of the pitfalls mentioned above if we had understood, internalized, and followed the Systems Development Life Cycle paradigm diligently. If you do so, make sure you take each step seriously. I’ve been part of organizations that claim to do so but only pay it lip service and use it to tick check boxes on project management slides. That’s a sin even worse than not using the model. But that’s not enough. The other SDLC, the Security Development Life Cycle, needs to also be internalized by the whole organization in order to deliver not just functional, but also secure, software.

Last but not least, as Marlon Brandon said in Apocalypse Now, “Exterminate… with extreme prejudice!”, but apply it to projects instead of people. As the leader of an organization, if you are handed a budget cut, be brave and really kill a project instead of giving everyone a 10% haircut across the board. This rule applies doubly for platform projects with ecosystems around them.

So many lessons; so little time.

[ Thanks to Brian Dewey, Rob Short, Surendra Verma, and Mark Zbikowski for their valuable feedback on earlier drafts of this post. ]



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ben Fathi

Ben Fathi

Former {CTO at VMware, VP at Microsoft, SVP at Cisco, Head of Eng & Cloud Ops at Cloudflare}. Recovering distance runner, avid cyclist, newly minted grandpa.