Mature Metrics: 7 Trustworthy Criteria (Based on Microsoft STEDII)
A detailed breakdown of the 'mature metrics' concept and 7 criteria (STEDII from Microsoft) that help build a measurement system resistant to manipulation, explainable, and trustworthy.
Mature Metrics: 7 Trustworthy Criteria (Based on Microsoft STEDII)
In a data-obsessed world, it's easy to fall into the trap of "fake" numbers. Metrics can look good on a dashboard, yet lie, hiding real problems and creating an illusion of success. How can you know if your data is trustworthy?
Mature product teams don't just "look at numbers." They build a system of mature metrics—indicators that have a clear definition, protection against self-deception, and a direct link to decisions.
What is a "Mature Metric"?
A mature metric is not just a number. It's a number + meaning + rules of use. It answers the question: "What behavior has changed, and what decision are we making now?"
Such a metric has a "passport" that includes:
- Reflected behavior (
value-event). - Unit of analysis (user, account, order).
- Time window (day, week, month).
- Segment (new, paying, enterprise).
- Quality criteria (what counts as a valid event).
- Link to decisions and thresholds (
success/fail/grey).
STEDII: A Checklist for Metrics from Microsoft
The experimentation team at Microsoft, working with vast amounts of data, developed the STEDII framework. It's essentially a checklist that helps determine if your metric can be trusted.
S - Sensitivity
- Question: How sensitive is the metric to small but important changes in the product?
- Why it matters: If your metric "doesn't notice" successful A/B tests, it's useless for decision-making. It must be able to capture even a weak, but statistically significant signal.
T - Trustworthiness
- Question: How much do we trust the data and the way it's calculated? Are there any bugs in tracking, are events being lost?
- Why it matters: If there are "holes" in the data, you are drawing conclusions based on noise. Trustworthiness is the foundation. Microsoft invests enormous effort in
trustworthy experimentationto ensure that the effect is not an artifact of logs or data shifts.
E - Efficiency
- Question: How quickly and cheaply can we calculate this metric?
- Why it matters: If calculating the metric takes several days and a team of data scientists, you won't be able to make quick decisions. Efficiency is especially important for
guardrailmetrics, which should trigger almost in real-time.
D - Debuggability
- Question: If the metric unexpectedly drops or rises, how easy is it to find the cause?
- Why it matters: A metric that is a "black box" doesn't help with learning. You must be able to "drill down" into the data and understand what caused the change: a specific segment, region, app version, etc.
I - Interpretability & Actionability
- Question: Is it clear to the team what this metric means and how it can be influenced? Does its change lead to specific actions?
- Why it matters: If a metric is difficult to understand, it won't become the
North Star. The team should intuitively understand: "If we do X, metric Y should change."
I - Inclusivity & Fairness
- Question: Is the growth of the metric a success for one group of users at the expense of a degraded experience for another?
- Why it matters: Optimizing for the "average" user can lead to the product becoming unsuitable for people with disabilities, users with old devices, or from regions with slow internet.
How to Use This?
Before making any metric a key one (North Star or Guardrail), run it through the STEDII checklist. This exercise will force you to think about the quality of your data and what decisions you can actually make based on it.
Mature metrics are not about beautiful dashboards. They are about creating a reliable feedback system with reality that allows for quick learning and building products that truly work.