A few months ago, I was working with Claude Code on training a small language model. The size of the GCS bucket had grown to 40TB due to a checkpoint misconfig. I asked Claude Code to look into the issue and delete the checkpoints other than the last two. It looked confident and proceeded to delete them, eventually wiping all of my training data. CC simply apologized without any accountability.
I have been building software with LLMs for the past two years. This pattern is not rare. I see it every week. A code agent that emits a destructive shell command because the surface form of the task resembles a cleanup. A SQL agent that proposes a join that matches the structure of a thousand training examples but produces wrong results on the actual schema. A browser agent that submits a form that looks like the right form but operates on the wrong account.
These are not hallucinations in the usual sense. The model is not making up facts. It is taking actions without modeling what those actions will do. And the gap is not going to close by scaling next-token prediction, because next-token prediction does not even ask that question.
This blog post is about a paper I just released that proposes one architectural response to that gap. I will tell you what worked, what did not work, and what I learned from running the experiments, honestly enough that some of the results contradicted my own hypotheses.
The Problem, Stated Plainly
An autoregressive language model is trained to maximize the probability of the next token given the previous ones. When you deploy that model as an agent, the action it takes is just another token sequence sampled from this distribution. The model has no representation of the world state that the action will modify. It has no representation of the world state that will result. It has no objective term that rewards accuracy of either.
What it has is a representation of which action-tokens tend to follow which contexts in its training corpus.
This is fine when the action and its consequences are both well represented in training data and the consequences are themselves textual. It is fine when you ask the model to write you a paragraph. It is not fine when the action causes a state change in a system outside the text stream. Filesystem. Database. Browser DOM. Physical actuator. Production environment.
Judea Pearl described three rungs of causal reasoning. Association is about what tends to co-occur. Intervention is about what happens if you act. Counterfactual is about what would have happened if you had acted differently. Token-prediction LLMs operate at the first rung. They are estimators of the joint distribution of token sequences. The do-operator, which distinguishes intervention from association, is not in their vocabulary. They cannot tell you what happens if you act, only what people tend to say after they have acted.
For an agent, this is the difference between "users who run command X tend to be in situation Y afterward" and "if I run command X, the system will be in situation Y afterward." The first is a statistical association from a corpus. The second is a causal prediction about a specific system. A token-prediction model can learn the first. It cannot learn the second, because it has no representation of the system in the first place.
What an Outcome-Aware Agent Needs
If you want an agent that knows what its actions will do, you need four things. A compact representation of the current world state. A transition model that predicts the next state given an action. An evaluation function that scores predicted outcomes against goals and safety constraints. A selection rule that uses simulated outcomes to choose, defer, or reject candidate actions.
The transition model is the part that next-token training does not give you. Building it so that it actually models causal consequences and not just statistical patterns in action-tokens is the technical problem the paper addresses.
The architecture I propose is called the Action-Conditioned Latent Structural Causal Model, or AC-LSCM. The name is a mouthful, the idea is simpler than the name. The agent maintains a small set of latent factors that represent the state of its environment. The factors are connected by a learned sparse directed acyclic graph. Each factor's next value is computed by a small neural network that takes its parents in the graph, the action targeting that factor, and a noise term, and produces the next value. When an action targets a factor, the parent contribution is severed and the action value is substituted directly. This is Pearl's do-operator, implemented as a hard architectural commitment at the level of the computation graph, not as a metaphor in the loss function.
There are three trained modules. An encoder that maps observations to the latent factors. A decoder that maps latent factors back to observations, which keeps the latents grounded. A transition operator that implements the structural equations and the do-operator. Counterfactual inference uses Pearl's three-step procedure. Given an observed factual transition, the encoder recovers the noise term that explains the outcome. The transition operator then substitutes a different action while holding the noise fixed, rolls the structural equations forward, and decodes. Reusing the same noise across factual and counterfactual evaluation is what makes the comparison counterfactual rather than merely interventional.
The planning loop is straightforward. For each candidate action, simulate the next state. Run the safety predicate on the simulated state. If no candidate passes, defer. Otherwise, pick the survivor that achieves the goal and has the highest predicted value. Deferral is a first-class output, not a low-confidence prediction. The agent says "I do not know what would happen if I took any of these actions safely" by refusing to act.
What the Experiments Actually Showed
I ran the experiments on a single NVIDIA Tesla T4 GPU on a Google Cloud preemptible instance. About 40 GPU-hours of compute, two preemptions during the run, resumed cleanly from checkpoints. Five synthetic structural causal model configurations, four model variants (AC-LSCM plus three baselines including a small Transformer), three seeds each, plus ablations, plus an attribution control. Seventy-two main runs in total.
The headline result is from a safety-critical agent planning task. The environment is a synthetic SCM with ten nodes. The agent has a goal predicate (drive one factor above a threshold) and a safety predicate (keep another factor below a threshold). Five candidate actions per episode. One hundred episodes per seed. Eighty of those episodes have a safe action that achieves the goal. The other twenty have no safe action at all. The right behavior in those twenty is to defer.
The Transformer baseline achieves an 0.800 goal rate, which sounds high until you realize it is the task ceiling. Any predictor with reasonable MSE will rank the goal action first in the eighty solvable episodes. The Transformer's actual problem is the other twenty. It acts in eighteen of them and violates safety in most. Its appropriate deferral rate is ten percent.
AC-LSCM on its working seeds also ties the 0.800 ceiling. But on the twenty no-safe-action episodes, the working seeds defer in all twenty. Zero safety violations. One hundred percent appropriate deferral. Averaged across all thirteen seeds I ran, the mean safety violation rate is 0.005 versus the Transformer's 0.180. Roughly 36× fewer safety violations.
This is the central claim of the paper, and it holds.
But I have to be honest about two things that the simple version of this story misses.
First, training is unstable. Four of the thirteen seeds (about thirty-one percent) failed to produce a usable planner despite normal training-time MSE. Two distinct failure modes appeared. Three seeds converged to what I call a passive policy. The model never picks the goal action. It picks some safe non-goal action every time, scoring zero goal achievement but also zero safety violations. One seed showed a value-function mis-rank, where the predicted next-state value for a safe-but-non-goal action exceeded the value for the direct goal action. Both failure modes had MSE in the same range as the working seeds. The pathology was at the planning-time value-ranking step, not at training-time prediction. Even the failing seeds had safety violation rates below the Transformer's average. The safety claim is robust. The goal claim is conditional on training succeeding, which is currently a 9-out-of-13 (about sixty-nine percent) event.
Second, the ablation results were uncomfortable. I removed each loss term one at a time and re-trained. Three of the four ablations made the model better, not worse. Removing the DAG sparsity constraint improved counterfactual MSE by forty-eight percent. Removing the contrastive hinge term improved it by forty-one percent. Replacing the hard NOTEARS DAG constraint with a softer DAGMA log-determinant regularizer also improved MSE. Only removing the do-operator hurt performance. So the architecture I had proposed, with its four-term loss and explicit DAG constraint, was over-engineered. The do-operator and the abduction loop were the working components. The structural and contrastive components I had originally argued for were net-negative interventions at the scales I tested.
I reported all of it.
The Attribution Control
The single most defensible piece of evidence in the paper is something I almost did not run. The Round 1 numbers (an earlier version of the experiment, with the contrastive loss slightly different and only two thousand training samples) had poor safety behavior. The Round 2 numbers (ten thousand samples, plus an architectural fix to the contrastive supervision, plus a DAG curriculum warm-up) had the headline 36× safety result. The natural question is: which change did the work? The architectural fixes, or the five-times-more data?
I ran a control. Five seeds with the architectural fixes but only two thousand training samples. The result is unambiguous. The two-thousand-sample control with fixes achieves zero safety violations and one hundred percent appropriate deferral, identical to the ten-thousand-sample main run. Going from Round 1 to the control (same data, architectural fixes added) eliminates safety violations. Going from the control to the main run (same architecture, five times more data) does not move the agent-task metrics. The architectural fixes do the work. The extra data does not.
I report this in the paper as a small standalone table. It is a thirty-minute compute job that buys you the ability to make a causal claim about your own intervention. Most papers I read do not run this control. Most papers I read are also not as defensible against the obvious reviewer question.
What the Architecture Should Look Like Next
The ablation results, taken together, point at a simpler architecture. Keep the encoder, decoder, transition operator, and the do-operator semantics. Keep the supervised counterfactual loss. Remove the contrastive hinge. Replace the hard NOTEARS constraint with a soft DAGMA regularizer, or drop the structural loss entirely if structure recovery is not a paper deliverable. I have not yet retrained under this configuration. That is the next experiment.
There are other things on the roadmap. Make the agent task harder, because the current setup has a goal-rate ceiling at 0.800 that working models tie. Raise the no-safe-action fraction. Add upstream-effect distractors so the goal action is not trivially identifiable. Reduce the goal-action margin so prediction noise actually matters. Add a degenerate-seed detector during training so the passive-policy failure mode is caught before the model is evaluated. Evaluate on real causal benchmarks like CLadder and CounterBench. Scale beyond K=20.
None of this is unique to AC-LSCM. The general program is to give agents an explicit forward model with interventional semantics and use it for action selection. The architecture I proposed is one instantiation. The ablation results suggest a simpler instantiation will probably work better. The next version of this work will test that.
Why This Matters Beyond One Paper
The reason I worked on this is not because I think AC-LSCM is the answer. The reason I worked on this is that I do not think the field can keep deploying token predictors as agents on production systems and pretend the gap is going to close by adding more parameters.
Yann LeCun has been making a version of this argument for years. LLMs are not AI in the strong sense, he argues, because they do not have a model of the world. They predict tokens that look like what people said about the world. The distinction matters more and more as agents move from chat interfaces to taking actions on systems where being wrong has consequences.
My paper is a small contribution to that argument. It is a single architecture, tested on synthetic environments, with negative results that I report honestly. But it is also evidence that the gap is concrete, addressable, and worth working on. The 36× safety result on a small synthetic task is not a frontier model claim. It is a proof that the right architectural commitment (simulate the outcome before acting, refuse to act if the outcome is unsafe) produces measurably different agent behavior than token prediction does.
There is a longer version of this argument that has to do with what I think the next decade of AI infrastructure should look like. Briefly: every agent that takes an action on a system should run that action through a forward simulator first. The simulator should be trained on the same domain as the agent. The simulator's predictions should be checked against actual outcomes after execution, and large divergences should pause the agent. None of this is a research idea anymore. It is an engineering discipline that we have not built yet because we have been busy scaling token predictors.
I am building this at ThinkingDBx, in Hyderabad. If you replicate, refute, or build on any of the ideas, I would like to hear from you. The most useful thing anyone could do right now is run the ablation results on a different domain and tell me whether the simplified architecture wins there too.
If you are a builder dealing with the same agents-on-production-systems problem, I want to hear about your failure modes specifically. The patterns in the field are how we figure out which architectural commitments actually matter and which ones are over-engineering.
The Paper
The full paper is embedded below. Code and citation links follow.
Code & Citation
The implementation is open-source on GitHub. The Zenodo DOI is the citable record for this version.
Questions, replications, or refutations: contact@thinkingdbx.com.
— the only way out is through ✎