Moneyball for Software Teams – An Imperfect Heuristic for Quantifying Dev Performance

Someone on Reddit once asked for “unethical career advice” for software developers. Here’s the most highly rated answer with over two thousand people liking it:

Make a good first impression and you’re set for a while. Something takes longer? They’re a good developer so I guess we under pointed that. It is actually insane to me how bad of an employee I was at some points in my career and not only didn’t get fired but got good reviews. Meanwhile employees who actually did more than me for those months, but had a bad reputation were getting bad reviews.

And a highly rated follow-up comment from another developer:

it used to seem insane to me too until i kicked ass at a job that i still got fired from; then i learned that it’s mostly about whether or not they like you and MUCH less about your skills or experience than i had previously thought.

And lest you think this is just malcontent developers at mediocre companies, here’s a Facebook manager describing their experience:

During the reviews process, managers compete against their peer managers to secure strong grades for their direct reports. Managers are compelled to vouch fiercely for their favorite employees, but don’t speak up for employees they don’t like or who have previously received poor ratings. “There’s a saying at Facebook that once you have one bad half, you’re destined for bad halves the rest of your time there. That stigma will follow you,” said a manager who left in September.

And this isn’t just a problem that affects the occasional unlucky developer – the heuristics companies currently use to evaluate performance is significantly harmful to the business and customers as well. Here’s the former lead of Google Docs talking about how Google’s performance-review-process made it so hard for them to build vital features that users really needed:

we had a lot of small bugs and usability issues, often in areas where we weren’t at parity with Excel. Users wanted us to implement … pretty standard spreadsheet fare, and very reasonable requests. … However it was a constant struggle to prioritize these types of issues vs. “bigger impact” projects. Our engineers cared about the product and wanted to polish it. But they also wanted to be promoted. And so we would deprioritize product polish for projects that looked better to a promotion committee

A ton of maladaptive symptoms in the corporate environment, such as the ones voiced above, can be traced back to one root cause – not having good ways of quantifying engineering performance.

To be sure, quantifying anything is always an error-prone endeavor. And you’re bound to have a million people sagely warning you about the dangers of Goodhart’s Law – “when a measure becomes a target, it ceases to be a good measure”. But every time I hear a warning about Goodhart’s Law, I’m reminded of everything people say above. And all the other downsides of assessing performance on the basis of subjective feelings. Such as:

Top performers not being appropriately rewarded or recognized for the excellent work they’re doing
Significant pay disparities on the basis of gender, race, or personal-connections
Unproductive meetings with too many attendees
Chasing new fads instead of sticking to tried and true solutions
Over-engineering in order to get promoted or pad one’s resume
Toxic managers and team leads that hamstring their team
Poor testing, CI/CD, code-review practices

If we had better ways of quantifying team performance, most of those symptoms would be rapidly addressed. Managers would immediately put a stop to poorly run meetings if they were presented with hard data demonstrating the negative impact on the team’s performance. No one would be underpaid because of “lacking charisma” or having the wrong skin color if their performance could be justified in a compelling and quantifiable manner.

Hence why quantifying performance was a priority for us at the startup I had founded some time ago. As someone who has worked at a number of companies, including a “startup hedge fund” as well as two of the FANGs, I had seen first-hand what worked well and what didn’t. As the founder of our startup, I had the freedom to try out my own ideas.

Despite its present ubiquity, sabermetrics was once considered to be a radical idea. Mainstream consensus held that you couldn’t rely on cold impersonal numbers to evaluate a player’s baseball performance. That attempting to do so would only result in the dreaded Goodhart’s Law. It took decades before sabermetrics was polished and found mainstream acceptance – exploding in popularity with Billy Beane’s success with the Oakland As (popularized by the book and movie Moneyball). We believed that we could do something similar in the field of software.

Our methods certainly aren’t perfect and their flaws are obvious. Despite this, we found it to be a net positive and more useful than the alternative – gauging performance on the basis of subjective feelings. I’m always open to discussing ways of further improving our heuristics. This is a summary of how we quantified engineering performance.

The Boring Stuff

Let me preface this by saying that many of the building blocks of what we did aren’t radically new. Most people have probably encountered these approaches in their own companies. I’m repeating it here only to provide context for the other things we did that are more unusual.

Every task that someone works on, is tracked as a “story” in our tracking tool (eg, Asana)
Each story is assigned to a single owner, and given a fixed number of points
Velocity can then be quantified at both the individual and team level, by calculating at the end of each week the number of points that were completed in that week

Because the above methods are so widespread, we’ll avoid spending too much time talking about them here. If you’re looking for more information regarding the above, I recommend reading other articles such as an introduction to agile/scrum/kanban etc.

Pay Attention to Velocity

One very curious observation that always struck me in my previous teams – even though we were spending so many man-hours on velocity tracking as described above, the output went entirely into a black hole. Never to be seen again or have any impact. It’s almost as though they listened to an Agile™ coach tell them that this is the Agile™ way, but didn’t actually see any real benefit from examining this data.

To some extent, this makes sense. The way most teams calculate velocity is highly flawed, as discussed later in this article. Hence why teams have decided it’s best to mostly ignore them. But at our startup, we recognized the immense value that comes from accurate velocity metrics, and put significant thought into coming up with a process that provides useful velocity metrics. Hence why it is used as the cornerstone of all performance discussions.

Every week during team lunch, I would pull up the team’s UX velocity and show it to the entire team. And use it to both stimulate discussion around team bottlenecks, and to challenge the team to scale new summits in the coming week.

And every two weeks during my 1:1 discussions with each engineer, I would also pull up that individual’s personal velocity chart and use that to guide our performance discussion. If their chart is on an upward trend, I would congratulate them on a job well done. And if their chart is on a downward trend or significantly lagging behind their teammates, I would ask them what I can do to unblock them and help them achieve their full potential. At no point are performance discussions centered around “how I feel about them” or one-off anecdotal datapoints.

No Stack Ranking

One major benefit of using velocity to guide performance reviews is that there is absolutely no reason to stack rank your developers. In fact, when discussing velocity metrics in 1:1 conversations, you should explicitly avoid comparing them against one another. You want to create a culture where everyone is helping one another move as quickly as possible, in order to push the team’s velocity to ever greater heights. And even at an individual level, everyone’s goal should be to boost their own personal velocity. Not looking over their shoulders and hoping for someone else to trip up.

In large companies where no objective performance metrics exist, top-down stack-ranking mandates are understandable. How else would you force managers to confront the fact that not everyone is a high performer? No one wants to be the bad guy that hurts someone’s livelihood or confronts someone with performance problems. If people were left to their own subjective devices, everyone would be “exceeding expectations” and poor performers would simply coast along perpetually. Hence why companies like Google force managers to stack-rank their developers. Even though:

Most people hate it
It destroys team morale and camaraderie, by pitting teammates against each other
Talented developers inevitably get fired just because they joined a team of high performers
Managers end up doing ridiculous things like firing the new guy just because he’s new, or purposely hiring low-performers as a sacrificial lamb

Thankfully in a world where performance metrics do exist, this is completely flipped. It makes no sense to declare one developer as a “top performer” and another as a “below-average performer” if there is only a small difference in their performance. If someone’s velocity is stellar, they certainly should be rewarded proportionally for their contributions. And if someone’s velocity is far lower than others who are currently or have previously worked at that team, they certainly should be given additional coaching and suggestions for improvement. But the idea that developers should be rewarded or punished on the basis of stack-ranking, is an anachronism from a bygone era where no meaningful performance metrics exist.

No Gamifying Lines-of-Code, Commits, or Pull Requests

As surprising as it may sound, I’ve heard of multiple engineering leaders at multiple companies using metrics like lines of code, commits, and pull requests, to quantify engineering performance.

Some do it explicitly and in a heavy-handed manner. To quote an article around Musk’s acquisition of Twitter (based on an alleged insider leak): “Elon Musk just force-ranked Twitter engineers & fired the bottom. Lines of code written in the past year was his metric.”

While others do it implicitly and with a light touch. When I was working at Amazon at the start of the COVID pandemic, I asked our division’s director how we were doing with our new remote-working culture. She responded that she hasn’t really seen any significant difference, neither positive nor negative. That in terms of “lines of code, or commits”, the metrics are mostly unchanged. She didn’t explicitly say it, and I never felt any pressure to gamify these metrics, but her comments revealed that she was using these metrics at least partially as a proxy for engineering performance.

When such metrics are used purely as observational measures, they can indeed be remarkably useful. But let’s be honest – no business leader ever goes to the trouble of collecting and analyzing metrics purely for observational purposes. They do it in order to inform practical decisions. Such as return-to-office mandates. Or performance reviews, bonuses, and salary adjustments. And that’s when the problem arises. In such situations, the above metrics become trivially easy to game, in a way that is also very damaging to code quality.

If you’re an engineer who wants to continue working remotely and leadership is using lines-of-code to decide whether to return to the office, you’re likely to write extremely bloated code, and defer bug fixes which touch very few lines of code. If you’re a senior manager and your vice president is using commit counts as one of the metrics to grade your performance, you’re going to pressure your managers to increase commit counts. Who will in turn pressure their engineers… who will in turn split up a single atomic and cohesive commit into many half-baked commits.

Perhaps you think you can avoid the above problems by … not telling anyone about what you’re doing. After all, if people don’t know that you’re using these metrics, they won’t game them, right? Call me naive, but I think transparency is the best policy and such tactics are bound to leak and backfire. Just see the various discussions around such topics on anonymous forums, such as the following thread:

I think that my manager (at Facebook) is very much biased towards software engineers who write loads of lines of code but says we don’t count it. But when doing psc says the amount of code written is less and tells stories of other developers writing tens of thousands lines. What the fuck is wrong with managers here ? Sometimes they say lines of code doesn’t matter and make you write so many quips and lead team and at psc time start looking at lines of code

I don’t think managers like to admit LOC matters, lest you try to game it. But even at (Google) I’ve seen this cited for Perf so obviously it is looked at. It isn’t everything but don’t neglect it, I’d say.

Ultimately, using lines of code, commit counts, or pull requests, to quantify engineering performance is a fool’s errand. All it does is create a culture where dishonest people write bad code to get ahead, and those with integrity are penalized and eventually quit in frustration.

Complexity – Not Time

Thankfully most teams are wise enough not to score sprint-tasks on the basis of lines-of-code or commits. But there is a different widespread anti-pattern that I’ve seen everywhere. Scoring tasks on the basis of times-estimates.

Deep down, the people running sprint-planning may know that they should be scoring tasks “based on complexity, not time.” But in reality, almost no one actually does this. When I join a new team and ask them what 1 point represents, they invariably respond with something along the lines of “1 point represents 2 days of work. If you think a task will take 2 weeks to complete, you should score it as 5 points.”

There are many reasons why this is bad. For starters, it actually makes it harder for you to predict when a task will be done. Engineers are notorious for underestimating how long a task will take. Suppose you’re a manager and you notice that your engineers consistently give estimates that are too optimistic by a factor of three. So you learn to triple the estimates given to you by engineers. However, one of your engineers has also caught on to the same fact – so he decides to triple all his time estimates as well. Now all of a sudden, you have inadvertently overestimated by a whopping 300%. Scoring tasks based on complexity, instead of time estimates, elegantly solves this problem. If a task is scored as 12 points of complexity, and the assigned developer historically averages 4 points per week, you can figure out that the task will take approximately one month.

In the context of this article, there is another big problem with scoring tasks based on time estimates. It makes it impossible to use velocity for performance reviews.

Suppose developer A has been assigned task B, and estimates that she can complete it in 2 weeks. Hence, the task is marked as 5 points. And once developer A completes the task on schedule, her velocity is calculated to be 2.5 points/week. Developer X has similarly been assigned task Y, estimates that he can complete it in 2 weeks as well, and also ends up with a velocity of 2.5 points/week. Both developers have identical velocities. It might seem as though they are performing equally well. In reality however, task B might be far more complex than Y. The first developer is doing a far better job, hence why she can complete far more complex tasks in the same amount of time! Their velocity metrics are completely out of sync with their performance.

Some may try to workaround the above problem by estimating “time taken for an average member of that team.” But let’s be honest – producing time estimates for yourself is already hard enough. Producing time estimates for a “hypothetical average developer” is a near impossible task. Besides, this approach also suffers from a number of other problems. Perhaps your new manager is handicapping the team by holding an inordinate number of meetings. Or perhaps the new team lead is taking forever to do code reviews, and bottlenecking everyone. In such situations, the team’s velocity graph should show a downward trend, so that people will know there is a problem that needs to be fixed. But if you are simply scoring tasks based on time estimates, not complexity, you will not see any such downward trend in your velocity graph at all.

More broadly, if tasks are scored based on time estimates, then velocity by definition is simply time-estimate-per-week. A completely worthless metric for performance management. Which is exactly why it is imperative that we score tasks based on complexity. Because complexity-per-week is a far more meaningful metric.

Canonical Benchmarks

“Okay, that all sounds great, but how on earth do you actually score tasks based on complexity, without using time estimates?”

The way to do this is to set up canonical benchmarks ahead of time. For example, here is the canonical benchmark that we used for frontend tasks at our startup:

1 point: Minor cosmetic change to an existing feature. Eg: changing button placement
2 points: Small functional change to an existing feature. Eg: adding a new field to an existing form
3 points: New simple feature. Eg: Select an existing customer from a dropdown list
4 points: New feature. Eg: CRUD a customer entity
5 points: New feature with some additional complications. Eg: Reusable “customer-create-or-lookup” UI component

The above benchmarks made sense for us, based on the type of tasks we were frequently working on. Each team’s domain is different, so it would be good for each team to decide ahead of time what benchmarks make most sense for their domain. Maybe you think I’m an idiot for scoring “new feature with some additional complications” as a 5, as opposed to a 7. You’re probably right – pick relative weights that make most sense for your team. The important thing is to:

Consistently score tasks based on which benchmark they are most similar to. Regardless of time estimates, the person assigned to the task, the number of meetings on the calendar, etc
Break down large tasks into smaller subtasks, each of which can be easily scored using the above benchmarks. For example, if a task requires implementing a new CRUD-customer entity, as well as being able to select that customer from a dropdown list on a different form, it should be broken down into two subtasks and scored as 4+3=7 points. Making a list of every single UX change across every single page, will also help immensely in better understanding the scope of the task, thus producing better estimates as well
Keep the benchmarks as consistent as possible over a long time period, so that the team’s velocity can be compared against historical velocities. This is vital to track long-term performance trends. If your new CEO decides to layoff all senior developers, and schedule a million meetings to drive “alignment”, these long-term velocity trends will provide compelling evidence of the impact of his decisions

To be sure, there is still some subjectivity when using these benchmarks to score tasks. There is no magic formula that will tell you whether a task is unambiguously 4 points or 5. The goal here is not to eliminate subjectivity entirely, but to mitigate it as much as possible. I wager that using these benchmarks would still present a far more accurate picture of individual and team performance, as compared to doing it purely on the basis of gut instinct. As evidenced by people’s experiences shared at the start of this article.

Essential vs Accidental Complexity

Here is a common discussion that I often hear during sprint-planning meetings:

Pete: For this task, we want to display the user’s birthday on the settings page. As per the canonical benchmarks, this should be a 2 point task

David: It’s not that simple! To get the birthday, you have to first call the Bingo service, which should call the Papaya service, which uses MBS to get the user-session token, which in turn passes it on to LMNOP for use by Racoon and ……. Hence why this should be a 50 point task!

(You should really watch the linked video if you haven’t already. It is a work of art)

Okay, maybe I’m exaggerating. But the question remains – should you score tasks based on their essential complexity or accidental complexity?

If our goal is accurate time estimates and/or performance-evaluation for junior developers, we should score tasks based on accidental complexity. After all, accidental complexity presents a more detailed view of the effort involved to get something done within the context of the present-day software system.

The problem with using only accidental complexity is that we lose the ability to identify systemic problems that are handicapping the team.

If we make good tech-stack decisions, we should expect to see our velocity improve
If we do a good job with system-design, we should expect to see our velocity improve
If we clean up our tech debt, we should expect to see our velocity improve
If we provide good feedback during code-reviews, we should expect to see our velocity improve

And conversely, if the team is picking overly complicated tech stacks, over-engineered designs, piling on tons of tech debt, and rubber-stamping all pull-requests, we should expect to see the velocity drop. And that will not happen if we score tasks based on their accidental complexity.

Scoring based on accidental complexity will inflate all points and velocities when a team is suffering from bad design decisions. Thus giving the impression that the team is continuing to perform well, when in reality, it is taking the team longer and longer to build simple features of minimal essential complexity.

That said, it is also unfair to penalize the unlucky junior developer who needs to tackle a mountain of accidental complexity due to the bad choices made by others. Hence why we struck a middle ground and did the following:

We assigned each task 2 different scores: One score for its essential complexity, and a second score for its accidental complexity
During individual performance reviews, we reviewed velocity using combined complexity. This way, if an individual overcomes significant accidental complexity in order to ship an important feature, they are recognized for their efforts
During team performance reviews, we reviewed complexity using only essential complexity. That way, if the team is doing a poor job with system design and piling on tons of tech debt, we will see the velocity drop, and we can flag it as a problem to be fixed

This gave us the best of both worlds.

UX Improvements vs Tech

Another debate that often comes up is around prioritizing user-facing changes vs purely-technical tasks such as refactoring, cleaning up tech debt, and investing in infrastructure such as CI/CD.

It is no surprise that product managers are often pushing hard for shipping features “yesterday”, and underinvesting in tech improvements. To my surprise, there are also a significant number of engineering leaders who over-invest in “tech improvements”. This is especially a problem in teams that struggle with velocity. Senior leaders are often frustrated by the low velocity – they want their organization to be fast-paced, and think they can achieve this goal by throwing bodies at ambitious large-scale “tech improvement” projects. Line managers and tech leads for their part are more than happy to indulge in such projects as well – it’s a fantastic way for them to build flashy new systems that sound really impressive and look great in their resume and promo-packet.

But do these projects actually help the org be more agile and fast-paced? Or are they just contributing yet another over-engineered overly-complicated system to the pile of cruft? Who knows – nobody is really tracking velocity in a reliable way anyway.

Ultimately, tech investments are analogous to financial investments. Some investments generate great returns, some investments generate minimal returns, and some investments even generate negative returns. Figuring out which investments are worth pursuing is more art than science. Some leaders are good at this, others aren’t, and yet others don’t care as long as it helps them get promoted.

When done correctly, velocity metrics are a great way to figure out whether your team is on the right track, and to put in place the right incentives for leads. When scoring tasks, we drew a distinction between tasks that produce UX improvements (eg, new features or reduced latency) and purely-technical tasks (eg, cleaning up tech debt or improving CI/CD). For the former, we would assign “UX points” and for the latter, we would assign “tech points”. When evaluating performance for individual-contributors, we would look at their combined velocity, which includes tech points as well. But when evaluating performance for leads, we would look primarily at the team’s UX velocity, and ignore all tech points.

This ensures that leads continue to have an incentive to make wise investments in things like cleaning up tech debt – by definition, these investments will produce good returns by boosting the team’s UX velocity. It also gives leads a strong disincentive against making unwise investments – it will lead to a drop in the team’s UX velocity as a result of the team’s bandwidth being consumed by unwise investments.

Beyond these incentives, it also acts as a monitoring system to flag organizational problems. If the team is not spending any time working on tech investments, and you’re seeing steady decreases in UX velocity, it’s a sign that the team should spend more time on tech improvements and tech debt. Conversely, if the team is spending a large amount of time on tech improvements, and the UX velocity is still dropping, it’s a sign that the team should spend less time on pet projects and more time on the things that users care about.

Full Ownership for Bug Fixes

Early in my career, I heard a funny story about senior management declaring that the project needs to be feature-complete by a specific date. “It is understandable if some bugs exist – they can be fixed later on. But the project should be feature-complete this quarter, no matter what!” To which engineers promptly responded by filing a bug for every single feature that hadn’t yet been built, and declaring that “The project is now 100% feature complete! Only bug-fixes are now remaining.”

I think of the above story every time I see teams give people credit for fixing bugs. For anyone wanting to game their velocity, that is the easiest way to do so. Do a half-assed job to get your tasks done in record time and juice your velocity. And once other people do the hard work of triaging and documenting your bugs, pad your velocity metrics even further by fixing those bugs.

This is precisely the reason why we instituted the following two policies:

The person who was responsible for the bug is also responsible for promptly fixing it
Bug fixes are not assigned any points

This is an easy way to prevent people from doing a half-assed job, and instill a culture of ownership for any task that a person takes on.

In theory, if people are being cavalier about shipping extremely buggy features, you could go a step further and penalize shoddy work. Fortunately we never had to do this because our bug-rate was well below what anyone expected from an early stage startup.

Team Velocity for Leaders

One of the worst work experiences I had in my career was a team lead who had to review every single pull-request, took forever to do so, and gave dribs and drabs of suggestions across countless iterations. The net result was that I spent an order of magnitude more time waiting for code-reviews as compared to actually getting stuff done.

At the time, it was incredibly frustrating. But looking back, his actions made perfect sense. We worked at a company where his performance was being evaluated based on his contributions and his impact. He was not given any credit for the projects that I was working on and delivering. Hence why he made no effort at all to unblock me.

The experience taught me an invaluable lesson – teams succeed when leaders have a stake in the team’s success. It is great when people act altruistically and “do the right thing”, but it is incredibly dangerous to count on that happening consistently. When technical leaders are evaluated in the same way as individual-contributors, they tend to optimize for their individual success instead of the team’s success. And this is exactly what many senior/staff engineers end up doing.

Hence why at our startup, performance evaluations are done very differently for leaders as compared to pure-individual-contributors. Performance evaluations for individual contributors revolve around their individual velocity (using total complexity). But performance evaluations for leads revolve around the team’s velocity (using essential complexity). This ensures that our leads are given a stake in their entire team’s success, and incentivizes them to focus on all aspects of improving the team’s performance – such as retaining our best engineers, doing prompt and detailed code reviews, and cleaning up tech debt.

No Bikeshedding

We talked earlier about wasteful and unproductive meetings. Ironically, one such meeting I used to dread at my previous companies were the various biweekly “sprint-planning” meetings. Meetings where our entire team gathered in one room and did “planning poker” votes in order to prioritize and assign points to every single task. Half our time was spent bringing every single person up to speed on what exactly was or will be done as part of that task. And the other half was spent bikeshedding on how many points should be assigned to each task. As an individual contributor, I walked away from most of these meetings feeling as though the meeting would have gone perfectly adequately with or without my presence.

Hence why we did away with this entirely at our startup. As the founder and manager, I would spend an hour each week reviewing the various stories and assigning tentative points to all of them. I would then send them over on slack to our team’s frontend-lead, and ask for his opinion. We would have a brief discussion on a couple of the stories, and I would then update the points as per that discussion. The points would then be posted on asana, and made publicly viewable by everyone.

I told all our engineers to ping me at any time if they wanted to discuss or adjust any of the points for any task. For the most part, they were perfectly happy with the point assignments made by myself and the frontend-lead. And if they had any suggestions, they knew that I was always open to discussing it.

Is it possible that our velocity metrics would be slightly more accurate if we had the entire team sit in a room for 1-2 hours each week and vote/discuss/debate point assignments for every single task? Maybe. Would it improve our actual velocity? Definitely not. And what impact would it have on morale? Most likely negative. Making point assignments a topic of public debate and voting, in addition to being a massive bikeshed and time-sink, also has a side-effect of politicizing it. I wager that most engineers would be happier with a neutral arbiter doing point assignments as a first pass, with any lingering concerns and suggestions resolved via a 1:1 discussion.

Limitations and Areas for Improvement

The effectiveness of the above system depends immensely on the integrity of the person who ultimately assigns the complexity points for each story. This person doesn’t have to be perfect – the system automatically self-corrects for people who are too optimistic or too pessimistic. But it certainly can’t self-correct for someone who is actively trying to game the system for personal gain.

The most likely and dangerous scenario where things can go really wrong is the following: The manager wants to make the team’s performance look better than it actually is. Perhaps they are doing a lousy job of leading the team, and want to sweep this under the rug. Perhaps they have a financial incentive for improving the team’s performance. Or perhaps they simply want their director to think they are doing a great job.

And so, they intentionally inflate all the complexity scores compared to the historical scores. A task that would have previously been a 2-pointer is now assigned 3 or 4 points instead. They then use the inflated velocity metrics to claim that they have boosted team performance by 50-100%. Or to claim that they have kept the team performance stable, even though their bad leadership is actually holding back the entire team.

If managers are rewarded or punished for their team’s performance metrics, they would have a strong incentive to do this. And in a sufficiently large corporation with no safeguards in place, it is almost guaranteed that a number of managers will do this.

To be honest, this isn’t a problem that I’ve faced or solved in practice. As the founder of an early-stage startup, I was the only engineering manager. Hence, I had no incentive to cheat the system that I myself had built. Ultimately my financial success depended on the success of the startup, not velocity metrics on an excel sheet. But I recognize that this is the exception, not the norm.

One potential solution is having the point assignments be done by someone other than the team’s manager. Perhaps each story can be assigned to a randomly chosen team member, who would decide (with guidance and help) the points for that story. Or perhaps the scoring can be done by a manager or senior engineer in a different team entirely. There are significant pros and cons to both approaches.

Relying on junior engineers within the team is preferable in that they have a stronger contextual understanding of the requirements and intricacies for each task. Relying on managers or leads from other teams is preferable in that they would have more experience in this process, possess better judgment through their seniority, and are less likely to be unduly influenced by the team’s manager who may want to see inflated scores. Relying on the same person for all stories allows for better consistency, whereas relying on a different person for each story prevents any systemic biases or under/overestimates. There are many different ways of structuring this, none of them perfect.

A second solution is to not use the above system for evaluating manager performance at all. This way we remove the primary incentive for a manager to be dishonest. The above system works great for quantifying developer performance, and perhaps that’s where we should draw the line. Manager performance could be evaluated using a completely different approach – perhaps even the status quo. I have some ideas on this topic as well, but I would love to hear others’ thoughts on this matter.

Final Thoughts

We started off this article with a discussion of Goodhart’s Law, and that seems like a good way to wrap this up as well. Here’s a quote from another article that discusses this topic in far more detail:

One of the things that I find perpetually irritating about using data in operations is that even proposing to measure outcomes often sets off resistance along the lines of “oh, but Goodhart’s Law mumble mumble blah blah.” Which, yes, Goodhart’s Law is a thing, and really bad things happen when it occurs. But then what’s your solution for preventing it? It can’t be that you want your org to run without numbers. And it can’t be that you eschew quantitative goals forever!

I think the biggest lesson of this essay is just how difficult it is to be data driven — and to do it properly. A friend of mine likens data-driven decision making to nuclear power: extremely powerful if used correctly, but so very easy to get wrong, and when things go wrong the whole thing blows up in your face.

Which I think is a perfect summary of my beliefs around building a great engineering culture as well. It certainly isn’t easy, and there are many ways to get it wrong. But there are also tremendous advantages in getting it right. And it holds the key to solving so many of the dysfunctions that hold back so many engineering teams and talented engineers.

7 thoughts on “Moneyball for Software Teams – An Imperfect Heuristic for Quantifying Dev Performance”

Pingback: 软件团队的货币球 —— 量化开发绩效的不完美启发式 - 偏执的码农
Pingback: 软件团队的货币化：量化开发绩效 - 偏执的码农
Pingback: 软件团队的货币比尔：量化开发绩效 - 偏执的码农
Pingback: Paying down technical debt won’t necessarily increase your velocity – everything flows
Mark D. says:

2023-07-05 at 4:31 pm

There’s a slight error in the article.

>If a task is scored as 12 points of complexity, and the assigned developer historically averages 4 points per week, you can figure out that the task will take approximately one month.

12 points of complexity @ 4 points per week = 3 weeks of effort not a full month

12 points of complexity only roughs out to a month’s worth of work if the developer averages 3 points per week.

LikeLike

1. RP says:
  
  2023-07-05 at 6:51 pm
  
  Good eye. This was actually intentional. Saying that a task will take “3 weeks” or “15 days” can create a false illusion of precision. Especially since developer velocity can vary quite a bit across time and across different tasks. The last thing you want is your CEO breathing down your neck saying “you said 3 weeks, it has now been 4!” Hence why I prefer to present dev estimates in more vague terms such as “approximately one month”
  
  LikeLike
  
Pingback: Неделя мобильной разработки #496 (3 июня - 9 июля) - Новостной вестник