research

The Preference–Performance Paradox: Why Your Chatbot's NPS Is Lying to You

Pedro Villa

Co-Founder & CTO

Co-founder of Serena Labs. Holds an MSc in Economics from Fundação Getulio Vargas (FGV/EESP). Also founder of Anouk Partners, a firm specialized in data, applied econometrics and digital transformation. Previously founder of multiple healthcare ventures in Brazil with successful exit, and over 15 years of executive experience as CEO, CFO and CIO of large companies.

LinkedIn ORCID Email

February 16, 2026 · 5 min read · analysis

The Preference–Performance Paradox: Why Your Chatbot's NPS Is Lying to You

The setup

You launched a conversational AI interface for your enrollment flow. Six months in, the dashboards look good:

NPS: up
CSAT: up
Self-reported preference vs. the old form: roughly 70% in favor of the chatbot
Qualitative feedback: warm, conversational, modern

You also notice, when you bother to check, that conversion has dropped, time-to-completion is up, and your operations team is fielding more clarification calls. But the satisfaction numbers are clear. So the chatbot must be working.

This is the preference–performance paradox, and it is one of the most well-documented and least-acknowledged patterns in digital health interface research.

The clearest evidence

The cleanest demonstration to date comes from Soni et al. (2022), published in Frontiers in Digital Health. The authors ran a within-subject counterbalanced experiment with N = 206 participants. Same standardized task: completing the PRAPARE social determinants of health assessment. Each participant did it twice, once on a conversational agent (Dokbot), once on a structured online form (REDCap).

Subjective metrics:

69.9% preferred the chatbot (vs 30.1% who preferred the form)
Net Promoter Score favored the chatbot (24 vs 13)
System Usability Scale scores were statistically comparable (chatbot 69.7, form 67.7)

Objective metrics:

Chatbot was significantly slower: median 212.5 seconds vs. 123.0 seconds for the form (a 73% time penalty)
Equivalent data quality on this particular task (it was a structured assessment, not a multi-attribute decision)

Same participants. Same data. The chatbot is preferred by a 2:1 margin and is roughly 75% slower at producing equivalent output.

This is not an isolated finding. Iftikhar et al. (2021) in JMIR Human Factors replicated the pattern in a counterbalanced experiment with N = 20 healthcare staff: the conversational form was rated as "intuitive" and "engaging," but was slower than the single-page form. Participants self-reported preferring the conversational interface even though they performed measurably worse on it.

When the task complexity increases (from single-document data collection to multi-attribute decision making such as insurance plan selection), the performance penalty widens dramatically. Adjacent-domain benchmarks suggest 30–40% completion rates for open chatbot enrollment flows against 67–86% for well-designed wizards (Baymard 2025; Do et al. 2024).

Why the paradox exists

The paradox is not a measurement error. It is a real psychological phenomenon with at least two distinct mechanisms.

Mechanism 1: Interaction enjoyment and decision quality are different psychological dimensions. When users evaluate an interface, they primarily assess how it felt to use (the conversational naturalness, the responsiveness, the perceived effort). They do not directly evaluate whether they made an objectively better decision, because they cannot observe the counterfactual. A chatbot that feels warm and conversational scores well on enjoyment, even when it produced a worse decision outcome.

Mechanism 2: Expectation–confirmation dynamics produce delayed dissatisfaction. Chatbots create expectations of flexibility, personalization, and intelligence. In structured enrollment tasks, those expectations cannot be fulfilled. The user wants a coherent plan recommendation, but the chatbot keeps asking open-ended follow-up questions. The frustration is real but accumulates after the typical satisfaction survey window. By the time the dissatisfaction surfaces (as a service-desk call, a re-enrollment, a churn event), it has been decoupled from the original interface evaluation.

The combination produces a structural divergence: short-term, in-session metrics favor the chatbot. Long-term, outcome-based metrics favor the structured interface.

What this means for your dashboards

If you are running an A/B test on a healthcare interface and your primary outcome is preference, NPS, CSAT, or System Usability Scale, you are systematically biased toward the chatbot, regardless of which interface is actually better for your customers. This is not a hypothetical risk. It is the modal pattern in the published evidence.

The fix is not to abandon satisfaction metrics. It is to read them alongside objective performance metrics that the user cannot self-assess.

Three categories of objective metrics that the preference–performance paradox literature converges on:

Completion quality, not completion rate. Completion rate alone is insufficient if the completed outcome is wrong. For insurance enrollment, this means measuring plan–profile adequacy: was the selected plan the right plan for the user's stated medical, financial, and family circumstances? Dominated choice rate (the percentage of users who selected an option strictly worse than another available option on every dimension) is a powerful indicator and was the centerpiece of Bhargava et al.'s 2017 finding that the majority of insured Americans chose dominated plans.

Time-to-completion, weighted by task complexity. Slower is not always worse, but for routine high-volume tasks, chatbot time penalties compound across your customer base. If your structured interface completes the same task in half the time with equivalent quality, that is a real operational signal, visible in your call-center load and in churn.

Equity-disaggregated outcomes. Average performance hides the most important signal: how does each interface perform for your lowest-literacy customers? If your chatbot performs adequately for high-Health-Insurance-Literacy users but catastrophically for low-HIL users, you are systematically shifting decision burden onto the population least equipped to bear it. Disaggregate by literacy proxy variables and report distributional outcomes.

The product implication

The paradox is not an argument against conversational AI. It is an argument for matching interface type to task type, and for measuring what matters when you evaluate the match.

Conversational interfaces work well for tasks where:

Decision stakes are low
Variability is high (no single right answer)
Empathic, flexible interaction is itself part of the value (re-engagement, support, recovery)

Structured interfaces win on tasks where:

Decision stakes are high (long-term financial or health consequences)
The decision involves comparing many interdependent attributes
The user lacks domain expertise to formulate the right questions
Regulatory compliance requires deterministic completeness (IDD suitability assessments, ACA SBC disclosures, ANS Resolução)

Most healthcare enrollment flows contain both types. Most current chatbot deployments use the conversational interface for both. The published evidence predicts, and field deployments confirm, that this design choice optimizes for satisfaction at the expense of outcomes.

What we do at Serena Labs

Serena Labs builds AI-driven engagement for healthcare that allocates interface type to task type. Conversational AI for triage, onboarding, and recovery, where the evidence supports it. Structured AI (what we call Structurally Guided LLM Interfaces) for configuration and selection, where the evidence requires it. We evaluate every deployment on the joint distribution of completion, decision quality, and equity outcomes, not on satisfaction alone.

If you are deploying conversational AI in enrollment and want a second opinion on what your current metrics are actually telling you, talk to our team.