What is AI fluency and why does it matter for AI products?

AI fluency is a user's level of skill in working with AI, ranging from novice to expert. High-fluency users operate in an augmentative mode — iterating, refining goals, and critically assessing outputs. Low-fluency users operate in a delegative mode — passively accepting responses as final. The same AI model produces dramatically different outcomes depending on which mode the user is in, making fluency a critical factor in product success.

Why do AI product metrics look good but users aren't retaining?

Most AI failures are invisible to standard monitoring. Bigspin's analysis of 27,000 conversations found that 86% of novice user failures leave no trace in logs, feedback, or analytics. These users accept flawed outputs without complaint and quietly disengage. Clean conversation logs and positive CSAT scores can mask widespread quality problems that drive silent churn.

Why do expert AI users fail more often than beginners?

Expert users fail 64% of the time compared to 24% for novices, but not because they are worse at using AI. Experts attempt harder tasks (average complexity 3.1 vs 1.5 on a five-point scale) and actively probe for errors. 59% of expert failures are visible — the user catches the problem and works through it. Novices fail less often but miss 86% of their failures entirely.

How does user skill level affect AI product outcomes?

User skill level is the deciding variable in AI conversation quality. In Bigspin's research, 93% of high-fluency interactions were augmentative — users iterated, refined, and challenged the AI. Fewer than 1% of low-fluency interactions were. Teams building AI products need to instrument for invisible failures and design experiences that encourage critical engagement rather than passive acceptance.

What is the difference between augmentative and delegative AI use?

Augmentative users iterate with the AI, refine goals mid-conversation, and critically assess outputs. Delegative users passively accept the AI's plans and responses, treating the output as final. Augmentative use is strongly correlated with high fluency and visible failure recovery. Delegative use is correlated with low fluency and invisible failures that erode product quality silently.

How can AI product teams detect invisible conversation failures?

Standard monitoring tools like thumbs-up/thumbs-down feedback, session length, and error rates systematically miss most failures. Quality monitoring needs to analyze the actual content of conversations, not just count them. Bigspin's multi-pass analysis reads 100% of transcripts to surface failure patterns that leave no trace in conventional analytics — the silent mismatches, walkways, and confidence traps that drive users away without a signal.

The Mystery of Opus 4.6’s Sudden Tokenflation

The paradox of AI fluency: Novices vs. experts

Read the study

The paradox of AI fluency

Read the study

About us

Resources

Research

About us

Resources

Research

Models

The Mystery of Opus 4.6’s Sudden Tokenflation

Why did Opus 4.6’s token usage in Claude Code skyrocket in the period February 25 to March 4, 2026?

June 4, 2026

Reading time:

min

Why did Opus 4.6’s token usage in Claude Code skyrocket in the period February 25 to March 4, 2026? We stumbled upon this trend in the course of our analysis of the SWE-chat corpus, and we can’t get it out of our heads. We’re now obsessed with solving the mystery for its own sake… but of course these token-usage trends also have a significant impact on everyone’s Anthropic bills!

The Mystery

The following figure summarizes what we know so far. We are tracking various token-usage and token-usage-adjacent quantities over time (x-axis), and the y-axis measures the changes in each of these quantities relative to its respective early February baseline.

Starting around February 25, Opus 4.6 token usage (the red line in our plot) skyrocketed. This is the highlighted area of the plot: the climb. We are not aware of any change to the model that might have caused the climb. That is the essence of our mystery.

Adaptive thinking does not solve the mystery

Opus 4.6 launched on February 5, 2026. The big change from Opus 4.5 was the introduction of adaptive thinking, which “lets Claude dynamically determine when and how much to use extended thinking based on the complexity of each request”. Anthropic launched the model with default thinking “high”. People immediately noticed that Opus 4.6 used way more tokens than Opus 4.5. This is even before the skyrocket.

On March 4, Anthropic changed the default reasoning to “medium”. In an April 23 postmortem, they said this was a response to the latency that “high” introduced. As you can see, this might have slowed the rise in tokens somewhat, but it did not reverse it. On April 7, they changed the default back to “high”. This change did not cause a sudden spike in token usage comparable to the February 25 spike. In other words, the known changes to the default thinking effort seem to have had a smaller effect than whatever happened on or around February 25.

The bugs shipped and reverted around this time (introduced Mar 26 and Apr 16, reverted Apr 10 and Apr 20; see their postmortem for details) happened later and seem not to change things very much, so we set them aside as potential factors.

Ruling out other explanations

Our plot tracks a number of other quantities that might be relevant to this sharp increase, but none of them unravels our mystery.

For example, as models improve, session length is likely to increase. (We thank Will Held for raising this issue with us.) The brown line in our plot tracks session length at the level of turns, and the pink line tracks session length in seconds. Both show a modest increase overall. In contrast, the orange line tracks output tokens per turn, that is, it controls for session length. This rises just as steeply as the per-session rate during the mystery period. Thus, increased session length does not solve the mystery; the per turn rate also soared.

Is Opus 4.6 just chattier? The dark and light purple lines track visible response tokens, with and without normalization by turn. These show that Opus 4.6 did become somewhat chattier, but the timing of this increase doesn’t align with the climb. During the mystery week itself, visible response tokens per turn is essentially flat (≈60 tokens / turn before and ≈63 during).

We’ve included a few other measures of session activity: tool calls, API calls, cache usage. All of these do increase over time, but in a narrow band around 1–3x the baseline. Only the number of output tokens jumped up from its baseline starting February 25.

One might wonder whether the mystery is explained by a change in the user base for SWE-chat. In principle, a group of new users might have arrived on February 25 and caused a fundamental shift in the usage patterns. To address this, we defined a cohort of 15 users who had sessions throughout the key period. This cohort’s pattern tracks the larger dataset pattern almost exactly (dashed blue line), so this seems like a dead-end. (We also looked within individual days: every percentile of the per-session token distribution moves up together through the climb, which rules out a gradual server-side rollout where some sessions got the new behavior before others.)

The SWE-chat data do not allow us to precisely count the number of tokens that come from thinking. However, all the specific measures of token usage we have remain in that 1–3x band relative to their baselines, which strongly suggests that the mystery increase is caused by either tool-orchestration overhead or Opus’s private lucubrations.

Why this matters

Our ultimate goal is to assess the extent to which all these tokens are making us more successful as engineers. The above situation seems ideal for addressing this, because we can track outcome success measures over this same period and see whether they correlate with token usage. From there, given the richness of SWE-chat, we might even be able to make progress in identifying the causal links between these quantities. However, it makes us really nervous that we don’t know what caused the biggest change in token usage here.

Help us solve the mystery

If you would like to join our detective team, please reach out to our tip-line: hello@bigspin.ai. We welcome hunches, hypotheses, and insights. We also have a lightweight script you could run to gather your own token-usage data, and then you could share the resulting spreadsheet with us to expand our collective evidence base.

Moritz Sudhof

Co-Founder & CEO

Chris Potts

Co-Founder + Chief Scientist