The Pitfalls of Measurements

Measuring is important but easy to misuse. So many of our decisions are influenced by measurements.

Deciding if we should chase a competitor or do something new.

Validating if we should move onto the next step of development.

Choosing which library to use in our next project.

Measurements are critical and necessary. They can be monitored by automated systems, shared across a company or to the entire world, and are transferrable without an overwhelming amount of context.

Measurements can also be mean and cold. We get rejected when we have 2 years of job experience but need 3. We lose our royalties when our steam rating isn't hit, like Obsidian did when Fallout: New Vegas scored an 84 but the publisher required 85 in their contract.

We want to measure what's important, but is that even possible? Why do so many measures lead us astray? We seem to chase user count over innovation, dollars over fun, delivery dates over user experience. Numbers can justify throwing the baby out with the bathwater, missing the forest for the trees, and sometimes being awful to others.

The Premise of Measuring

Let's look at how measuring starts. We begin our project with nothing but ideas, goals, aspirations. At some point, we have an insight, a recognition of a relationship between a number and a result.

When the assault rifle shot faster, players got excited. When we increased the price, purchases went up, or maybe they went down. When we worked more hours, our bug count went down faster. We then realize that numbers we saw in the past or near present could inform our decisions with some predictive accuracy. We could increase the firing rate of other weapons too. We could increase the price until purchases start to dip, then we've found the optimal price. We add rules, models, and checkpoints around these measures so we can respond to the numbers in real-time as we tweak them. We rely on measurements with the expectation that they will produce the results we predict, with anything from a loose relationship to complete 1:1 reliability.

When measuring goes well, that is how it looks. But measurements have lots of ways to be misused.

A metric is inaccurate

It's easy to miss that a number can just be plain wrong due to how it was collected. The software being used didn't count all the voters. The designer added an extra 0 when entering it in the spreadsheet.

There's also the roman messenger problem. A messenger brings the battle report to the emporer, but in the time it took for the messenger to arrive, the battle situation changed drastically. These days, so many measurements are instantaneous that the delay factor gets missed when it matters.

In these cases, a human or automated sanity check can sound an alarm on something that just "doesn't look right". To do that, the checker needs to understand the context so they know what "sane" looks like.

It's crucial to be "data-informed" instead of "data-driven" because data really shouldn't be at the wheel. Critical minds should understand the data or we open door for garbage in, garbage out.

The model isn’t reality

In addition to misunderstanding the relationship of data (like correlation vs causation), it's also fair to say that we will still miss important things entirely in our models. A model is, by definition, an abstraction or simplification of a larger, more complex system. Any system so simple as to be understandable doesn't need a model, we simply transfer the concept through explanation, word of mouth. Unfortunately, systems very quickly complexify past the point of a single concise phrase, which is why we write papers, custom code, or blogposts to record as much of their essence as we can.

This brings us to a key point, which is that measurements are originally tracked for a reason, and when the original insight; the meaning; the way it fits into a larger model; is lost, the measurement itself often loses its value yet still continues to be used regardless.

A metric's context has been lost

As Jeff Bezos describes at Amazon, a metric is a proxy for something real, but when the inventor of that connection is no longer present people continue to use the metric with the original insight long forgotten. A metric like closed ticket count approximates customer satisfaction. Over time the world shifts or new markets open where the insight doesn't apply, but the metric has an inertia where it has become the de facto substitute for truth. Perhaps the original insight was really expensive to come up with, like how we still use interview questions originating from studies in the 1960's (see Assessment Centers for spotting future managers (1977) and Behavior Interview Questions). Or there's no longer a qualified person to do a sanity check. Or it's as simple as models are complicated and it's hard to see the drift between past and present but we have a really bad habit of continuing without questioning.

It's very common to manage towards metrics that decision-makers don't understand and aren't scrutinizing, and that's where stagnation and mistakes thrive.

Decisions are made using dilutions of dilutions of massive amounts of information. It's critical to understand the characteristics of your sources especially where their usefulness bends and breaks.

I like to think of a measurement as having a pin attaching it to its origin.

The pin is the time, location, and people involved in the original insight.

An exercise I do often when I hear a metric being used in a decision is to ask for information surrounding that pin. Is the originator still working here? Did they write down context on why they used a measurement, why they made a decision, so I can seek out the essential elements to use it correctly? Contextualizing helps me avoid the pitfalls of misuse.

The measurement influenced the source correlation

Some metrics are so useful to become ubiquitous, and because of that their value is the cause of their downfall.

Credentialing

Imagine we're the hiring manager for a prestigious, high paying job in a top company. We try all sorts of requirement filters and tests but are surprised to see mixed success with candidates. Our company audits and finds that in fact only 55% of candidates are working out, barely better than a coinflip (but actually fairly close to the success of an average interview process, as Daniel Kahneman somberly points out in Noise: A Flaw in Human Judgement). One day, our coworker excitedly throws a report onto our desk.

"We've found the Columbia graduate program is the best in the nation! Their candidates are 73% more likely to succeed than average, and 14% better than even the second best school".

Suddenly, the answer seems clear. Prioritize hiring Columbia graduates and deprioritize the other factors. The college degree is a big indicator, a suitable replacement for most other metrics. A valuable insight!

Let's note that the report itself may not have been totally accurate (candidate success rates are themselves a measurement that is notoriously hard to quantify). However, we try it out and find the results to speak for themselves, the Columbia grads are doing amazing! Now, remember we work in a top company, and word gets out quickly. Other hiring managers start to emulate it, and suddenly Columbia students are getting swarmed by recruiters far before graduation.

Columbia grads are getting placed into top companies almost universally, and prospective students' parents start to notice too. They encourage their children to take an admission to Columbia over other schools. Columbia is now being swarmed by applications. They expand their programs and funding. They increase their admission rate. At the same time, students are taking classes specifically to pass the Columbia admissions process, preparing for the entrance process over general education. The application process is heavily scrutinized and criticized for various reasons, and groups demand adjustments. The program starts to look dramatically different than when we first measured, the classes, the professors, the people entering, and the qualifications of those who graduate.

Imagine this happens for decades. The paper degree, the credential that the school issues, stays exactly the same and is likely to remain in high regard. However, the graduates are completely different as a result of the changes to the program due to that prestige. Of course, it's possible that they are as good as ever, and Columbia will claim the scrutiny and changes have in fact improved the program. However, whether that's true is a separate issue. The key point is that with all the changes originating from the prestige, the current graduates have little to no connection to the original insight in the report at our company decades ago. The meaning has been lost because the measurement's inertia resisted factual changes in the results. Over time the value of Columbia in our interview process is a number we don't understand, in fact it could be 0.

This process of erosion is inevitable, not because of evil doers but because of mostly well-intentioned independent actors (students, parents, schools, and companies) all trying to optimize their best interests against a value on a metric. When a metric has a value of association but no intrinsic value, erosion will happen, especially when the outcome has high value to the actors. We all game the system in different ways to get ourselves a better result, at the small cost of devaluing the credential.

Every user of measurements should understand this inevitability and recognize the need to re-evaluate. The foundation of our decisions must be updated to account for the erosion.

Well-Discussed Pitfalls

There’s a few pitfalls which are talked about to death on the internet, so I won’t repeat much here. Misunderstanding correlation vs causation is of course one of the most important. This is in the category of a flawed analysis, when the data is correct but the conclusion isn't.

Also, we can outweigh an unimportant measurement over a critical one. Short-term profits over user satisfaction (which we could argue translates to long-term profits). This is a modeling problem, where metrics can be entirely correct but their weights between each other are wrong. We optimize for A, but B is more important yet is weighted lower. Or we are measuring A but missing B entirely in our model. It's a problem that statistics (see regression methods) and neural networks (see back-propogation) try to solve mathematically, and is super topical these days!

A measurement is being exploited or biased by people using it

We work with a lot of people with different skillsets and use findings from external groups we trust but do not know personally. As people or organizations gain expertise they acquire specialized knowledge that few or no other people have so we must take their findings with some level of trust. While this allows for delegation and greater insights, it also opens opportunities for people in positions of power or trust to fudge the connection between numbers and recommendations for personal reasons or even unconscious biases. They may ignore an inconvenient truth or pick data that fits the solution.

There's something to be said about data that “looks too clean”, a possible sign of manipulation. Again, this requires the reader to do their due diligence and have context to see when data looks strange.

The first step to combatting this is the standards of people in the workplace: hiring great people, keeping them happy, and keeping up with how they are doing. However, it's not realistic to expect that treating people well alone protects our data and decision-making. That is why we should have critical minds auditing our decision-making process.

What seems to work best is to continue delegating the research, data collection, and analysis to one or more people but also have reviewers who are willing to question the findings and ensure its quality. This has a good chance of catching not just personal biases but also the other analysis flaws we've talked about.

For example, if we have a networking expert on the team propose an innovative but risky network model, and their architecture's success is based on data they've collected around network latency, user behavior, and team needs, it's the responsibility of the team to have peers hear their proposal and ask questions about the data collection process, the architectural decisions, risks, and alternative models considered. These people don't need to be as specialized as the networking expert but they do need to absorb some level of context about the decision being made. Having more people look at an analysis reduces the chance of bias or exploitation and has great benefits beyond the validation of measurements.

The Takeaway

Considering the ways metrics can influence decisions matters, especially with the number of subtle ways they can be misused. What I've discussed are just a few of the pitfalls to look out for, but they are some that have led me to better decisions.

Measurements are very powerful when used correctly, use them often but with caution and scrutiny.