We analyzed thousands of conversations in a first-of-its-kind study. The biggest failures were invisible.

Invisible Failures in AI

Invisible Failures

How much does a user’s skill with AI shape what AI actually delivers for them?

How much does a user’s skill with AI shape what AI actually delivers for them?

This question is critical for users, AI product builders, and society at large. Using an annotated sample of 27K transcripts from WildChat, we show fluent users take on more complex tasks vs. novices, adopting a very different interactional mode

This question is critical for users, AI product builders, and society at large. Using an annotated sample of 27K transcripts from WildChat, we show fluent users take on more complex tasks vs. novices, adopting a very different interactional mode

Reading time:

2

min

TL;DR: Bigspin analyzed 27K richly annotated transcripts from the WildChat-4.8M dataset and found that high-fluency users fail 64% of the time, while low-fluency users fail just 24% of the time. The catch: 86% of novice failures are invisible, and 59% of expert failures are caught and recovered. Fluency is about engaging critically with AI – pushing back, asking questions. For teams building AI products, this shifts the understanding of what success user success looks like within these products.


What is AI fluency?

AI fluency is a user's level of skill in working with AI, ranging across a spectrum from low to high (novice to expert). At the high end, users operate in an augmentative mode: iterating with the AI, refining goals mid-conversation, and critically assessing outputs. At the low end, users operate in a delegative mode: passively accepting the AI's plans and responses and treating the output as final. Most users fall somewhere in between.

The distinction matters because the same model produces dramatically different outcomes depending on which mode the user is in. In our analysis, 93% of high-fluency interactions were augmentative. Fewer than 1% of low-fluency interactions were.


The paradox: experts fail more

The intuitive story says experts should fail less. These users know the tools, they write better prompts, and they understand the failure modes of AI.

The data says the opposite. 64% of expert conversations contain a failure, compared to 24% of novice conversations. Experts attempt harder work — average task complexity of 3.1 on a five-point scale, versus 1.5 for novices — but that alone does not explain the gap.

What explains it is how users engage with AI. Experts catch failures because they are looking for them. As a result, they probe and ask follow-up questions. This critical posture surfaces the failures.


Visible vs. invisible failures

  • 59% of expert failures are visible. The user notices, names the problem, and works through it. Often the conversation ends with partial recovery — value salvaged from a flawed start.

  • 86% of novice failures are invisible. The conversation reads as fine. The user accepts the output, leaves, and never registers that the AI missed the mark.

Mapped against the eight invisible failure archetypes from earlier Bigspin research, two patterns dominate. Experts cluster around the Partial Recovery: hitting a wall, recognizing it, and steering toward something useful. Novices cluster around the Walkaway: abandoning the conversation without resolution and without a signal to the system that anything went wrong.

This is the core of the paradox. Novices look successful because they cannot see what they are missing. Experts look failure-prone because they can.


What this means for product and engineering leaders

If you are building an AI product, your usage metrics are probably misleading you. A clean conversation log is not evidence of a good experience. It might be evidence of a user who did not know what to ask for and accepted whatever they got.

Three implications for how you build:

Frictionless is not the same as good. The dominant UX pattern for AI products — ask, get an answer, accept — is optimized for the delegative mode. It rewards the behavior most strongly correlated with low fluency and invisible failure. Designing for engagement means giving users surfaces to push back, compare, and verify, even when that adds friction.

Instrument for the failures you cannot see. The ones that matter least are the ones users complain about. The ones that matter most are the ones that look fine in the logs and quietly erode trust over weeks. Product analytics built on thumbs-up/thumbs-down or session length will systematically underweight novice failures. Quality monitoring needs to read the conversation, not just count it.

Teach the augmentative behaviors. Pushing back, refining mid-conversation, treating plausibility with suspicion — these are specific and learnable. Onboarding flows, empty states, and in-product nudges can shape user behavior toward the patterns that produce real value. You are not just shipping a model. You are shaping how people engage with it.


What this means for individuals

Engage, do not defer! Question outputs! Refine goals partway through! Treat a confident-sounding answer as a hypothesis, not the final answer. The behaviors that produce value with AI are not mystical — they look a lot like the behaviors that produce value in any collaboration with a smart, fast, occasionally wrong colleague.


The bigger picture

The "AI as oracle" narrative works against users. It frames the model as a source of answers and the user as a recipient. The data points to a different frame: the model is a collaborator, and the user's posture toward it is the deciding variable.

Fluency is a set of practices. Teams that recognize this — in how they build products, measure quality, and onboard users — will ship AI experiences that actually deliver. Teams that do not will keep shipping interfaces that look great in demos yet fail at scale.


This post draws on Bigspin's research with Chris Potts (Stanford NLP) and Moritz Sudhof, analyzing 27,000 annotated conversations from the WildChat-4.8M dataset. The full technical report is available here.


Megan Melack

Head of Design + Brand