WHO IS THE MODEL WORKING FOR?

Apr 21
4 min read

pin in jelly wth text pinned still drifting

A developer pins their workflow to a stable version of a closed-API model. They change nothing. Not the prompt. Not the pipeline. Not the evaluation set. Six months later, success rate has dropped from roughly nine out of ten to roughly six out of ten. There is no changelog, no deprecation notice, no visible model-version change. The pin is still in.

That is the drift problem in its most operational form. The model name is pinned. The endpoint is pinned. The inputs are unchanged. But the behaviour has moved.

The evidence is public. In September 2025, Anthropic's postmortem identified three overlapping infrastructure issues that intermittently degraded Claude over roughly a month – a routing error that peaked at 16% of Sonnet 4 requests in its worst hour, a TPU server misconfiguration, and a compiler bug confirmed on Haiku 3.5 and believed to have touched a subset of Sonnet 4 and Opus 3 requests. In April 2025, OpenAI rolled out a GPT-4o update and began rolling it back four days later after users documented a shift towards sycophantic behaviour. The lab's status page has separately logged a degraded-performance incident on gpt-4o-2024-08-06 – evidence that a dated model identifier does not guarantee a stationary served system, whatever the reason for the degradation. Peer-reviewed work in Harvard Data Science Review documented GPT-4's accuracy on prime-number identification falling from 84% to 51% over three months – further corroboration that the service is not stationary, rather than proof of any particular mechanism behind the shift.

A pinned model identifier does not necessarily mean a stationary served system. The provider can keep the same model name while changing the infrastructure around it. Routing, deployments, hardware configuration, compiler behaviour, sampling implementation and serving setup can all shift without the caller changing anything. From the buyer's point of view, the pin held. From the system's point of view, something still moved.

This is the economic reality of serving frontier models at scale. These are among the most expensive production systems in software. Providers are continuously optimising for cost, latency, throughput and reliability. The problem is that those optimisations can be caller-visible, even when they do not appear as a formal model change.

For most casual use, this is tolerable. A human in a chat window can absorb small shifts in tone, verbosity or style. But enterprise workflows are often not using the model that way. They are parsing structured outputs, triggering downstream actions, or relying on consistent behaviour across repeated tasks. In those settings, predictability is part of the product.

That is why the drift question matters more in agentic and production workflows than in chat. A small behavioural change that a person would barely notice can break a pipeline. Slightly different JSON, a different tool-selection pattern, a weaker reading of the same instruction, a higher tendency to hedge or refuse, and the workflow stops behaving as designed. The provider's internal optimisation schedule becomes an external dependency the buyer did not explicitly choose.

Often the question has been about which model is most capable. That framing assumes capability and predictability came together. Often they do not. The more useful question for many enterprises is simpler: which model will keep doing this job reliably for the way we need to run it?

Once you ask that question, a large share of workloads look different. Many enterprise tasks do not need frontier-level generality. They need repeatability, control and clear change management. They need a model that performs one bounded function well and does it again tomorrow. In those cases, maximum intelligence can be less valuable than governed behaviour.

That is where self-hosted and specialised open-weight models start to look stronger. They move when you move them – the model file on disk does not change unless you change it. That is a meaningful operational property for teams that care about auditability, regression testing and controlled deployment.

Running your own inference means owning your own incidents. Open-weight models still lag the best closed models on some real-world robustness measures. Inference is not perfectly deterministic even at temperature zero – batch-level non-determinism, floating-point non-associativity and kernel selection introduce variance on every major GPU inference stack. But those are implementation problems inside the buyer's own change boundary. That is often preferable to depending on a vendor's invisible one.

For genuinely open-ended reasoning tasks, frontier APIs will often remain the right choice. But for many enterprise workflows, the task is narrower. More repeatable. More operational. In those environments, the best model is often the one whose behaviour the buyer can govern, regardless of whether it tops a leaderboard.

Model choice is becoming less about raw capability in isolation, and more about whose incentives shape the system's behaviour over time. If the provider is continuously optimising the served system for its own economics, the buyer is forced to live with the byproducts. But when the buyer controls the deployment boundary, those trade-offs become visible and manageable.

For many workloads, the model needs to be one the buyer can control for reliability. With that frame much of the noise about benchmarks and frontier races becomes background to a simpler conversation about fit.

WHO IS THE MODEL WORKING FOR?

Recent Posts

brightbeam.com

© Brightbeam 2025

Read our Privacy policy