I upgraded my Agent OS to a local 35B model and its code failure rate dropped to 0%

A developer upgraded their autonomous agent operating system from a 9B model to Qwen 3.6 35B, resulting in significantly improved code execution quality and a 0% failure rate through their 5-layer validation system. The larger model's increased architectural discipline enables better re-evaluation and internal verification loops instead of hasty code generation under system stress. The developer plans to integrate frontier models like Claude and Codex with isolated VM wrappers to prevent potential system interference.

Detailed Analysis

A developer working on autonomous local multi-agent systems has published findings from upgrading the runtime engine of a self-described "Agent OS" from a 9B parameter fallback model to the Qwen 3.6 35B A3B, a Mixture of Experts architecture that activates approximately 3B parameters at inference time. The project, hosted on GitHub under the name hollow-agentOS, centers on agents that autonomously write, sandbox, and hot-load new tools when they encounter tasks outside their existing capabilities — a design philosophy the developer characterizes as an "aversive state" system, in which discomfort with failure drives the agent to self-expand its own toolchain without human intervention. The critical performance claim is a reported drop to a 0% code failure rate after the model upgrade, attributed to the system's five-layer validation gate, which previously saw frequent failures when smaller models generated hallucinated function calls and syntactically broken scripts under high computational stress.

The behavioral difference between the 9B and 35B model classes, as described by the developer, is qualitative rather than merely quantitative. Smaller models under stress exhibited what the author characterizes as "panic" — rushing through code generation and forcing malformed outputs past validation in ways that corrupted or stalled the OS. The 35B model, by contrast, demonstrated self-corrective behavior: pausing on failure, re-evaluating prior outputs, and entering internal verification loops before committing changes. This distinction points to an emergent threshold in model scale where logical self-correction becomes architecturally reliable enough to serve as a trust boundary in autonomous execution pipelines — a phenomenon the developer invites peer comparison on, asking whether others have observed similar discipline crossing the approximately 30B parameter mark.

The inclusion of Claude and Codex in the project's near-term roadmap is notable from a systems integration standpoint. The developer explicitly acknowledges the risk of frontier models overriding host environments, and plans to address this through hyper-isolated mini-VM wrappers that constrain execution to total sandboxes. This reflects a broader engineering problem that has emerged as capable models are embedded in agentic loops: containment and privilege escalation become first-order concerns rather than afterthoughts. The decision to deploy frontier API models only within strict isolation layers, rather than granting them broad system access, suggests growing practitioner awareness that raw capability and safe deployment are orthogonal properties that require deliberate architectural separation.

The broader trend this project sits within is the rapid maturation of local, self-modifying agentic systems as a distinct category from cloud-API-dependent agents. Where earlier autonomous agent frameworks relied heavily on frontier model APIs for reasoning quality, the demonstrated viability of a 35B MoE model — with only 3B parameters active at any given inference step — at zero-failure code synthesis suggests the capability gap between local and cloud-hosted models is closing faster than many expected for coding-intensive agentic tasks. The economic and privacy implications are significant: operators who can achieve reliable autonomous tool synthesis locally gain independence from API rate limits, latency, data egress concerns, and per-token costs, all of which compound severely over long-horizon agentic runs.

The developer's framing of an "infinite library" of autonomously generated tools raises important questions about long-term system stability and auditability that the current write-up does not fully address. A self-expanding toolchain, even one that passes a multi-layer sandbox validation gate, accumulates technical debt and potential attack surface in ways that are difficult to audit after the fact. The project as described is clearly in an experimental stage, and its claims — particularly the 0% failure rate — are self-reported without independent replication. Nevertheless, the architectural pattern of model-scale-gated trust, combined with hard sandbox boundaries for frontier model integration, represents a pragmatic and increasingly common engineering response to the core challenge of deploying autonomous AI systems reliably at the edge.

Read original article →

Detailed Analysis

Don't Miss a Deploy