market

The $36 Billion Question: What the Conversational AI in Healthcare Hype Misses

Pedro Villa

Co-Founder & CTO

Co-founder of Serena Labs. Holds an MSc in Economics from Fundação Getulio Vargas (FGV/EESP). Also founder of Anouk Partners, a firm specialized in data, applied econometrics and digital transformation. Previously founder of multiple healthcare ventures in Brazil with successful exit, and over 15 years of executive experience as CEO, CFO and CIO of large companies.

LinkedIn ORCID Email

March 16, 2026 · 7 min read · analysis

The $36 Billion Question: What the Conversational AI in Healthcare Hype Misses

The number you've heard

If you have sat through a vendor pitch or investor deck on AI in healthcare in the last 18 months, you have seen a version of this slide.

USD 8.2 billion in 2023. USD 36 billion by 2032. A 4.4× expansion of the global healthcare conversational AI market in nine years (Gartner, 2024). The slope is steep, the trajectory inevitable, the implicit message clear: be in this market, or be left behind.

The number is real. It comes from Gartner's Market Guide for Conversational AI Solutions (report G00807063, April 2024). The methodology is the standard vendor-survey-plus-extrapolation that drives most enterprise software market sizing. Healthcare is one of the high-growth verticals.

What the slide does not show is the second number you should know.

The number you have not

The most rigorous synthesis to date of patient engagement with conversational agents in healthcare is Cevasco et al. (2024), published in Journal of Medical Systems. The meta-analysis pools 14 randomized controlled trials conducted between 2016 and 2022, with individual study samples ranging from 28 to 9,124 participants. The pooled effect size:

RR = 0.99 (95% CI 0.95–1.03).

A null effect. The confidence interval includes unity. Translated into plain language: across 14 controlled studies and tens of thousands of patients, conversational AI did not produce a statistically detectable improvement in engagement and retention versus standard-of-care alternatives.

Only one of the 14 trials (Fitzpatrick et al.) showed significantly higher retention in the chatbot condition. The authors of the meta-analysis conclude, in their own words, that "including chatbot technology in eHealth applications is not a substitute for holistic eHealth product development."

The earlier scoping review by Milne-Ives et al. (2020) in the Journal of Medical Internet Research examined 31 studies of AI conversational agents in healthcare. Effectiveness was reported for 30 of those; 23 showed positive outcomes, but the positive cases were concentrated in narrow task types (manualized protocols, single-domain assessments). Generalization across healthcare interaction types was not supported.

Sohn et al. (2026), an npj Digital Medicine meta-analysis spanning 38 studies (N = 7,401 for depression) and 34 studies (N = 7,621 for anxiety), found small-to-moderate positive effects on mental-health symptoms (g = 0.31 and g = 0.28). Notably, the positive subset was concentrated in structured, manualized chatbot protocols, not open-ended generative agents. This is the same pattern that recurs across the literature: structure matters; generative naturalness alone does not.

The pattern across the most rigorous syntheses available is consistent: aggregate effectiveness of conversational AI in healthcare is closer to null than to the order-of-magnitude productivity gains implied by the market projections.

The arithmetic of the gap

Hold the two numbers next to each other.

A market projected to grow from $8.2B to $36B between 2023 and 2032 implies that buyers are expected to deploy nearly $28 billion of incremental spending on healthcare conversational AI products over that horizon. The strongest evidence base assembled to date (14 RCTs, N > 55,000) supports a null aggregate effect on the primary outcome those products are sold for.

Both can be true at the same time. Markets are not priced by efficacy; they are priced by adoption velocity, vendor enthusiasm, and the conviction of buyers that this generation of technology will succeed where the previous one did not. Healthcare conversational AI is currently in the conviction phase. The evidence catches up later, or it does not.

The question for any healthcare operator allocating budget this cycle is not "is the market real?" The market is real. The question is what fraction of the $28 billion will be deployed on interface types that the published evidence already predicts will underperform, versus on interface types the same evidence supports.

This is not a hypothetical risk. It has a name.

Agent washing

Gartner's Innovation Insight for the AI Agent Platform Landscape (report G00825163, March 2025) coined the term "agent washing" to describe the practice of rebranding existing chatbot products as "AI agents" without substantive capability improvements. The motivation is clear: agent is the new keyword; budget moves toward it; vendors that own incumbent chatbot products do not want to be left in the previous category.

The result is a market in which the term "AI agent" no longer reliably distinguishes products with autonomous, multi-step, tool-using behavior from products that are open-ended generative chatbots with a new sticker. Buyers who evaluate vendors by category label rather than by demonstrated capability are vulnerable to deploying the rebranded product and discovering, six months in, that they bought the previous generation at the new price.

Dall'Occhio (2026), in a Gartner report on how healthcare CIOs realize AI value, documents that many AI projects in healthcare fail to capture the ROI promised by vendors. The gap between projection and realization is structural, not anomalous.

For investors and analysts modeling the $36B trajectory, this is the operational reality below the curve.

What the evidence does support

Not all conversational AI in healthcare is in the null-effect category. Two recent randomized trials illustrate where the technology is supported and where it is not.

Tao et al. (2026), an RCT published in Nature Medicine (N = 2,069 patients plus 111 specialists), evaluated an LLM chatbot called PreA for primary-to-specialist care transitions. The chatbot performed structured pre-assessment before the specialist visit. The trial reported a 28.7% reduction in physician consultation duration. Note the framing: this is a chatbot for a structured, sequential pre-assessment task, with explicit clinical scaffolding. Not open-ended dialogue.

Kaphingst et al. (2024), the BRIDGE randomized clinical trial published in JAMA Network Open (N = 3,073), compared a structured rules-based chatbot against standard-of-care delivery for cancer genetic services. The result: equivalence (estimated percentage-point difference 2.0; 95% CI −1.1 to 5.0). The chatbot matched standard-of-care, which is a real outcome for scalability arguments. Again: the chatbot was structured, scripted, and rules-based, not open-ended generative.

The pattern is consistent with the broader literature: conversational AI works when the task is appropriately structured and the dialogue is constrained. It does not work, in aggregate, when deployed as an open-ended interface for high-complexity decisions.

What this means for the next budget cycle

Three operational implications for anyone allocating against the $36B trajectory.

First, decompose the category. "Healthcare conversational AI" is not one market; it is at least three. Triage and qualification (low-complexity, high-volume) supports a chatbot interface and has positive evidence. Recovery and re-engagement (asynchronous, empathic) supports a chatbot interface for similar reasons. High-complexity decision support (plan selection, treatment choice, multi-attribute comparison) does not. Treating the three as one budget line will produce average outcomes that mask large internal heterogeneity.

Second, ask vendors for the specific evidence base, not for the category label. "We are an AI agent platform" is an asserted category. The question that distinguishes capability from rebranding is "In what controlled comparison has your product been evaluated, and what was the primary outcome?" If the answer is internal-only benchmarks or vendor-funded pilot studies without controls, you are in the conviction phase, not the evidence phase. That is fine, but price the risk accordingly.

Third, decouple satisfaction metrics from performance metrics. As covered separately in our piece on the preference–performance paradox, user satisfaction metrics systematically favor conversational interfaces even when objective performance is worse. If your vendor evaluation rests on NPS, CSAT, or qualitative feedback alone, you will systematically over-adopt the underperforming category.

What Serena Labs does

Serena Labs operates in the part of the $36B market where the evidence is strongest: structured AI for healthcare engagement. We build conversational interfaces for triage and recovery (where the literature supports them) and structured AI interfaces (what we call Structurally Guided LLM Interfaces) for configuration and selection (where the literature requires them). We benchmark every deployment on the joint distribution of completion, decision quality, and equity outcomes, not on satisfaction alone.

If you are an operator, investor, or analyst trying to distinguish signal from noise in healthcare conversational AI category, book a call. We are happy to walk through the evidence base in detail.