We analyzed thousands of conversations in a first-of-its-kind study. The biggest failures were invisible.

Invisible Failures in AI

Failure Patterns

Invisible failures in human–AI interactions



Most teams building AI products today keep a watchful eye on a wide range of standardized quality signals – e.g., completion rates, response times, user satisfaction scores. These signals are meant to capture the state of the product and help teams home in on systematic failures as they emerge.

Do these quality signals actually provide an accurate picture? Much depends on this question, but it has not been systematically addressed. To begin to fill this empirical gap, we conducted a large-scale study of the WildChat dataset, a collection of over 1M ChatGPT conversations. For WildChat, users were given free access to ChatGPT in exchange for having their deidentified conversations released publicly, making it the largest naturalistic conversational AI dataset available to date.

Our analysis reveals that the standard quality signals are woefully inadequate. Of all the failures we identified in WildChat, 78% produced no visible signal of failure. No corrections, complaints, or abandonment. We call these invisible failures. Luckily, the invisible failures are not random, but rather cluster into recognizable patterns that we can monitor for.


Published paper available on arXiv:
Invisible failures in human-AI interactions


Chris Potts

Co-Founder + Chief Scientist