Borrowing From Clinical Research
The surrogate endpoint problem is well-understood in pharmaceutical research. A drug that lowers cholesterol (surrogate endpoint) does not necessarily reduce heart attacks (clinical endpoint). Regulators learned this the hard way after approving drugs based on biomarker improvements that later showed no mortality benefit -- or even increased harm.
UX research has its own surrogate endpoint problem, and we have not yet learned the lesson. Task completion rate, time-on-task, error rate, and satisfaction scores are surrogate endpoints -- they are measurable in controlled settings and correlate somewhat with product success. But they are not product success. The gap between what we can measure in a lab and what determines whether users adopt, retain, and value a product is enormous and systematically underestimated.
Why Lab Metrics Diverge From Real Adoption
The Motivation Gap
In a usability test, participants are paid to complete tasks. They have no choice but to engage. Their motivation is external -- they will work through friction because that is the job. In real life, users encounter friction and leave. They have alternatives. They have limited patience. They have no obligation to persist.
A 94% task completion rate in a test tells you that your interface is learnable under conditions of forced engagement. It tells you nothing about whether real users -- who can abandon at any moment -- will bother to complete the same task. The gap between can-complete and will-complete is where most product failures live.
This connects directly to why the observer effect in UX research is not just a methodological footnote but a fundamental validity threat: observed behavior under test conditions predicts unobserved real-world behavior less reliably than we assume.
The Context Vacuum
Usability tests strip away the context that determines real behavior:
- Competing priorities: Lab participants have nothing else to do. Real users are multitasking, distracted, and time-pressured.
- Information overload: Tests present one task. Real users face your product among dozens of tools, notifications, and demands.
- Social dynamics: Tests are solo. Real usage involves colleagues, managers, and workflows that create friction no lab can simulate.
- Switching costs: Tests evaluate single tasks. Real adoption requires sustained engagement across sessions, days, and weeks.
A product can be highly usable in isolation and still fail in context. The lab cannot simulate the full competitive attention landscape in which your product must survive.
The Novelty Effect
Usability test participants encounter your product fresh. Their reactions reflect novelty -- the interest and attention humans give to new things. Real adoption must survive the novelty wearing off. A product that is engaging on first use but tedious on tenth use will score well in tests and fail in production.
Diary studies reveal what interviews miss precisely because they capture the post-novelty reality: what users actually do with your product after the initial engagement fades. Single-session lab tests are structurally incapable of this measurement.
The Metrics That Actually Predict Adoption
Unprompted Return Rate
The strongest predictor of real adoption is whether users return without being prompted. This is unmeasurable in a lab -- it requires longitudinal observation of real behavior. But it is the metric that matters most: does the product create enough value that users voluntarily come back?
Teams that rely on lab metrics often discover at launch that their highly-usable product has a retention cliff. Users can use it. They choose not to. No task completion rate predicted this because task completion measures capability, not motivation.
Integration Into Existing Workflows
Products succeed when they fit into existing behavior patterns rather than requiring new ones. Lab tests measure whether users can perform tasks in your product. They do not measure whether users will add your product to their existing workflow stack.
The real question is not "can they complete the task?" but "will they choose your product over their current approach?" This requires understanding their current workflow, measuring switching costs, and evaluating motivational thresholds -- none of which standard usability testing protocols address.
Perceived Value After Effort Investment
Adoption depends on whether the value delivered justifies the effort invested. Lab tests measure effort (time-on-task) but not value perception -- because the task is artificially assigned, not genuinely needed. Participants cannot evaluate value from artificial tasks because they have no genuine need the product is fulfilling.
This is why mixed methods research integration is essential for product validation: quantitative lab metrics need qualitative context about value perception, motivation, and real-world constraints that pure usability testing cannot provide.
The Organizational Consequences
False Confidence and Underinvestment in Discovery
High usability scores create organizational confidence that the product is ready. This confidence discourages further discovery research -- why investigate user needs when the product already tests well? Teams ship usable products that solve the wrong problems, and the usability data is cited as evidence of readiness.
The research triangulation principle exists precisely because no single method provides sufficient evidence for major product decisions. Usability testing is one data point. It needs triangulation with adoption data, qualitative depth research, and competitive analysis to support product-market fit claims.
Metric Optimization Without Value Creation
When task completion rate becomes the success metric, teams optimize for it. They add tutorials, tooltips, progressive disclosure, and guided flows that help test participants complete tasks. These same affordances may actually reduce real-world adoption by adding complexity, slowing expert users, or patronizing returning users.
The metric is optimized. The product gets worse. This is Goodhart's Law applied to UX: when a surrogate measure becomes the target, it ceases to be a good measure. Teams that celebrate usability improvements while ignoring retention data are optimizing the wrong thing.
Better Measurement Approaches
Ecological Validity as Design Principle
Rather than abandoning lab testing, redesign tests for ecological validity:
- Embed tasks in realistic scenarios with competing demands and time pressure
- Do not force completion -- give participants permission to abandon and measure abandonment
- Include multi-session protocols to capture post-novelty behavior
- Test with realistic task sequences rather than isolated tasks
- Measure voluntary engagement -- after the required tasks, does the participant explore further?
Complement Lab Data With Field Data
Lab metrics are useful inputs, not final verdicts. Complement them with:
- First-week retention curves from real usage data
- Time-to-value measurements in natural settings
- Abandonment analysis at real decision points
- Workflow integration audits (does the product fit existing behavior?)
- Switching cost assessments (what must users give up to adopt?)
Reframe Success Metrics
Shift organizational language from surrogate endpoints to clinical endpoints:
| Surrogate (Lab) | Clinical (Real) |
|---|---|
| Task completion rate | Weekly active usage |
| Time-on-task | Time-to-first-value |
| Satisfaction score | Net retention |
| Error rate | Support ticket volume |
| Learnability | Self-sufficiency rate |
This reframing does not eliminate lab testing. It contextualizes it. Lab metrics become leading indicators that require confirmation from trailing indicators, not standalone proof of product viability.
The Deeper Epistemological Problem
Measurement Creates Its Own Reality
The surrogate endpoint problem is ultimately about confusing measurement with understanding. We measure task completion because it is measurable. We treat it as meaningful because we measured it. The circular logic -- what is measurable is important; what is important is measurable -- prevents teams from grappling with the harder, unmeasurable questions that determine product success.
Will users care enough to learn this? Will they remember to use it next week? Will it survive their first moment of frustration? Will they recommend it? These questions matter more than any lab metric, but they resist the clean quantification that makes usability testing feel scientific.
As observability for AI systems demonstrates in the engineering domain, the metrics you can easily collect are rarely the metrics that predict system health. Production AI systems fail in ways that test-environment metrics never surface. Products fail in ways that usability metrics never surface. The lesson is the same: instrument for the failures that matter, not the measurements that are convenient.
What Good Looks Like
Teams that avoid the surrogate endpoint trap:
- Use lab testing for what it measures -- learnability, interaction design quality, error-proneness
- Never claim product-market fit from usability data alone
- Design longitudinal studies that track real behavior over time
- Report lab metrics with explicit validity boundaries -- "this shows users can complete the task; it does not show they will"
- Invest equally in field research that captures natural usage patterns
- Hold product decisions to triangulated evidence -- lab data + field data + qualitative depth
The product that scores 94% in your usability test and 12% in week-two retention is not a measurement paradox. It is a product that is usable but not valuable. Lab metrics told you the first thing. Only real-world data tells you the second. Teams that conflate the two ship products that users can use perfectly well -- and never open again.



