CID-01 | Foundational | Position Paper
Core Thesis
As AI models become more capable, the binding constraint on domain-sufficient outcomes shifts from model capability to governing dynamics — the measurement infrastructure, agent harness, tool definitions, and environment fit.
Key Implications
- The model is not the bottleneck. The next performance gains come from other aspects of AI agents.
- Clinical AI agents require careful systems engineering. Agents interacting with real patients have higher stakes than coding agents, and need more expert-information architecture and guardrails.
- Governing dynamics is the scarce resource. Foundation models are commoditizing. The durable position belongs to whoever owns the orchestration layer.
Abstract
Scaling model capability does not automatically lead to agentic AI with domain-sufficient outcomes — results reliable enough for consequential deployment in a specific field. The binding constraint is governing dynamics — the measurement infrastructure, agent harness, tool definitions, and environment within which a model operates. This paper makes two structural arguments. First, the capability-reliability gap — the divergence between average-case performance and failure detectability — intensifies as models improve: stronger models fail less often but less visibly, and the organizational response is typically to reduce verification effort at precisely the moment it matters most. Second, achieving domain-sufficient outcomes in unforgiving environments requires stronger governing dynamics. The independent convergence of multiple clinical AI efforts on this pattern constitutes early empirical evidence. The quality of the orchestration layer complements the model and improves the marginal return on model improvement.
1. Introduction
The next gains in domain-sufficient AI outcomes will not come from the next model generation. They will come from the orchestration infrastructure surrounding the model.
The empirical basis for optimism about scaling is real. Scaling foundation models — increasing parameters, training data, and compute — produces predictable capability gains, and qualitative jumps emerge at scale thresholds. These models are learned systems: statistical functions whose behavior is determined by parameters fitted to data. Models are only one part of modern AI agents. The agent acts in an environment, and consists of a model that is embedded in complementary software elements: an agent harness, which can update agent beliefs and supply guardrails; a structural connection to tools and an environment; and a system for measuring and validating agent success. Successful agent design requires systems engineering to unlock the value of foundation models.
The largest capability gains in practice appear not when a single model improves, but when many sufficiency thresholds are crossed together — data hygiene, measurement infrastructure, tooling, orchestration, and post-processing all improving in concert. This is criticality: the system-level state in which a conjunction of necessary conditions are simultaneously satisfied, producing threshold-like emergence of outcomes. The components are complements — improving one raises the marginal return of improving others. Below the threshold, strengthening any single component yields diminishing returns. At the threshold, outcomes become structurally reachable.
This paper makes two contributions. Section 2 demonstrates that the capability-reliability gap intensifies with model capability. Section 3 argues that domain structure in unforgiving regimes selects for compositional, contract-bounded architectures, and that independent convergence on this pattern in clinical AI constitutes early empirical evidence.
2. The Capability-Reliability Gap
2.1 Inductive Bias and Structured Failure
Every learned system generalizes by compressing training regularities into representations and applying them to novel inputs by interpolating across structurally similar cases. This requires inductive bias: built-in assumptions about which generalizations to favor when data underdetermines the answer. Without inductive bias, no learner generalizes beyond its training examples. But inductive bias means the system interpolates along the dimensions its bias treats as salient and systematically discounts the rest. This applies to any learned system — image classifiers, structured prediction models, and large language models alike, though the texture of bias differs by architecture and training regime.
More training on typical cases worsens this dynamic: as representations of typical features become more dominant, the signal from rare features is further suppressed. Geirhos et al. (2020) term this shortcut learning — the tendency to rely on statistically dominant features, causing systematic failure when those features are absent or misleading. Grove et al. (2000), in a meta-analysis of 136 clinical and behavioral prediction studies, found that experienced clinicians systematically underperformed actuarial models on atypical cases — not because experts lacked knowledge, but because richer experiential representations more aggressively suppressed atypical signal.
A learned system can therefore fail not just on out-of-distribution inputs, but on in-distribution inputs where the decisive features are ones the system has learned to discount. It can fail with confidence. This is well-documented for LLMs specifically: RLHF fine-tuning — the training stage that produces the instruction-following models widely deployed today — introduces systematic verbalized overconfidence, because the RL objective concentrates probability mass on the most likely response rather than preserving a calibrated distribution over possibilities (Leng et al., 2025). The model does not signal uncertainty; it signals certainty it has not earned.
The practical consequence is that reliability is not a fixed property of a learned system. It is a function of the match between the system's posterior distribution (developed from training data and inductive bias) with the structure of the deployment environment.
2.2 Unforgiving environments
AI agents have made impressive advances, but this success has mainly been on a narrow set of problems. For example, Karpathy (2026) identifies coding as "the perfect first task for AI". Code is text, an LLM's natural modality. Trajectories are short: a function, a module, a pull request — not a months-long care episode. Feedback is immediate and binary; the code compiles or it does not, the tests pass or they do not. The data density is high. In such a forgiving environment, it is relatively easy to detect and reverse errors, and an agentic AI system can, through many rapid trials, often succeed.
Clinical AI agents operate in a very different, unforgiving environment. While the agent may contact the patient through text or speech, what is ultimately of consequence is the patient's health. The results of medical treatment may take months to come to fruition, and can be confounded with factors outside of the clinician's control. Training data for such AI systems can be hard to find, especially for rare medical conditions. And, most unforgiving of all, each patient has only one life to lead, and it may not be possible to reverse errors. Clinical AI deployments face potentially severe ethical and legal consequences for failure.
At Amigo, we are concerned mainly with the quality of agentic AI systems in unforgiving domains. This makes us especially attentive to how monolithic model-based systems can fail confidently or silently.
2.3 The Gap
As models grow more capable, this failure mode becomes harder to detect.
A more capable learned system has representations more finely tuned to its training distribution. Its outputs are fluent and correct on the common cases that constitute most of the deployment surface. Its remaining failures are concentrated at the distribution boundary — where inputs begin to differ from training examples in ways that matter — and those failures arrive with the same surface confidence as its successes. This is the capability-reliability gap: as average-case reliability increases, the remaining failures concentrate at the distribution boundary where they are hardest to detect. Capability improvement and failure detectability may move in opposite directions.
This gap is structural at the model level. It becomes more acute at the agent level. An agent acting over extended trajectories accumulates error into sequential decisions, each one shifting the system's state further from the admissible region. If the model has no mechanism to detect its own drift, the errors will compound. In an unforgiving environment, where error is not easily corrected but rather can lead to costly failures, it is essential for the AI agent to have an adaptive policy that detects and corrects deviations. For an agentic AI system using a general foundation model, this environment-specific error correction will be built into the agent harness.
Error may also go undetected at the organizational level. As output quality rises, the perceived cost of verification rises with it. Parasuraman and Manzey (2010) document this as automation bias: the tendency to reduce monitoring effort as system reliability increases. The effect intensifies in high-reliability systems where failures are rare enough that vigilance is rarely rewarded. Capability gains can therefore reduce verification effort at precisely the moment when remaining failures are most consequential and least detectable.
The governing dynamics infrastructure is what closes this loop: feedback systems sensitive to distribution-boundary failures, verification processes that do not treat output quality as a proxy for reliability, and transition constraints that catch trajectory deviation before it compounds. This is not a model improvement. It is the architecture within which any model — however capable — must operate to produce reliable agent-level outcomes.
3. Domain Structure Selects for Compositional Architecture
Monolithic policies can fail silently, which is unacceptable in unforgiving regimes. In regimes with high partial observability, long decision horizons, narrow admissible regions, and high cost of error, great care must be taken in the design of the agent harness, tools, and environment fit. This is why Amigo has developed novel compositional, contract-bounded architectures that address this differently. The system is decomposed into specialized components, each operating within a defined contract — a specification of what inputs it accepts, what outputs it produces, and what invariants it maintains. Transitions between components are governed by explicit rules rather than by the model's judgment alone. Constraint violations are caught by the architecture rather than by the model's self-assessment. The governing dynamics do not make the model more capable — they make failures consequential only within bounded regions.
Other substantial clinical AI efforts, developed without a shared framework, arrived at the same structural pattern. Nori et al. (2025) built MAI-DxO not by scaling a single generalist, but by orchestrating specialized virtual physician roles — diagnostician, skeptic, cost monitor, safety checker — and found that this orchestration layer improved diagnostic accuracy across every underlying foundation model tested, regardless of its capability tier. The implication is that the orchestration layer is doing independent work, not merely packaging model outputs. Moritz, Topol, and Rajpurkar (2025), in a survey of clinical AI limitations, conclude that coordinated networks of specialized agents are the future of AI in healthcare.
These empirical findings confirm our central thesis. In unforgiving regimes, the architecture question resolves toward closed-loop, contract-bounded composition because high partial observability, long horizons, narrow admissible regions, and high error cost make monolithic policies unable to maintain trajectory admissibility.
This carries a direct strategic implication. Foundation models are increasingly commodities. The durable competitive position belongs to whoever owns the orchestration infrastructure: the governing dynamics that define what the model optimizes, close the loop when it drifts, and maintain auditability across the full trajectory. As models strengthen, this layer becomes more binding, not less.
Glossary
| Term | Definition |
|---|---|
| Admissible region | The set of states and actions that remain acceptable in a given domain. Leaving this region constitutes a failure. |
| Automation bias | The tendency of users to reduce monitoring effort as system reliability increases, even when remaining failures are consequential. |
| Capability-reliability gap | The structural divergence between a system's average-case performance and the detectability of its remaining failures. As capability improves, remaining failures concentrate at the distribution boundary where they are hardest to detect. |
| Closed loop | A system architecture in which outputs are continuously checked against independent signals, enabling detection and correction of errors. |
| Complementarity | A property of a system in which the value of each component depends on the quality of the others. Improving one component raises the marginal return of improving others. |
| Compositional, contract-bounded architecture | A system design in which specialized components operate within defined contracts and transitions between them are governed by explicit rules, rather than relying on a single model's judgment. |
| Contract | A specification of what inputs a component accepts, what outputs it produces, and what invariants it maintains. |
| Criticality | The system-level state in which a conjunction of necessary conditions are simultaneously satisfied, producing threshold-like emergence of outcomes. |
| Distribution boundary | The region where inputs begin to differ from training examples in ways that affect the system's reliability. |
| Domain-sufficient outcomes | Results reliable enough for consequential deployment in a specific field. |
| Governing dynamics | The full set of structures that couple a model to its operating environment: measurement systems, feedback loops, transition constraints, and outcome specifications. |
| Inductive bias | The assumptions built into a learning system's architecture or training setup that determine which generalizations it favors. |
| Miscalibration | A condition in which a system's stated confidence scores do not track its actual correctness likelihood. |
| Monolithic policy | A single model mapping beliefs to actions without external structural constraints. |
| Out-of-distribution inputs | Inputs that differ meaningfully from the examples a system was trained on. |
| Partial observability | The condition in which the true state of the world is never fully visible to the system; the system must act on incomplete information. |
| Shortcut learning | The tendency of learned systems to rely on the most statistically dominant features in the training distribution, causing systematic failure when those features are absent or misleading. |
| Trajectory deviation | The gradual drift of a sequence of decisions away from the intended path. |
| Unforgiving regime | An environment where decisive features may be rare, errors are costly or irreversible, and the system's learned representations may not match the structure of the problem. |
References
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2, 665–673.
Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical versus mechanical prediction: A meta-analysis. Psychological Assessment, 12(1), 19–30.
Karpathy, A. (2026). Interview.
Leng, J., Huang, C., Zhu, B., & Huang, J. (2025). Taming overconfidence in LLMs: reward calibration in RLHF. International Conference on Learning Representations (ICLR) 2025.
Moritz, M., Topol, E., & Rajpurkar, P. (2025). Coordinated AI agents for advancing healthcare. Nature Biomedical Engineering, 9, 432–438.
Nori, H., et al. (2025). Sequential diagnosis with language models. Microsoft Research.
Parasuraman, R., & Manzey, D. H. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52(3), 381–410.