Show HN: Gonfire – Assess how well candidates steer AI coding agents

Technical assessments for AI Engineer positions have shifted from traditional leetcode challenges to take-home or live case studies where AI use is encouraged, yet these formats remain ineffective at evaluating candidates' problem-solving abilities and burden interviewers with significant time commitments. Gonfire addresses these limitations by functioning as a proxy that records and analyzes candidates' Claude code interactions during assessments, generating a digestible report for hiring managers that captures the previously untapped value of the AI coding session logs.

Detailed Analysis

Gonfire, a new hiring-assessment tool built by a recent software engineering graduate, represents an early commercial attempt to solve a structural problem emerging in technical recruiting: as AI coding agents become standard in professional software development, the industry's inherited interview formats have failed to keep pace. The founder, who graduated from a CS program in 2020, observed during his own job search that startups conducting "AI Engineer" interviews had largely abandoned timed LeetCode-style evaluations in favor of open-ended case studies where AI tool use is explicitly permitted. The resulting assessments, however, arrived in two formats — take-home assignments and live screenshare sessions — both of which the founder argues are fundamentally broken in different ways.

The core critique Gonfire advances is that current AI-friendly assessments discard the most diagnostic signal available. A take-home submission produces a codebase that is, in practice, entirely AI-generated, leaving hiring managers with no reliable way to distinguish a strong engineering thinker from someone who simply prompted their way to a passing result. Live assessments, by contrast, extract an hour of time from senior staff — often the CTO at an early-stage startup — per candidate screened, which is operationally unsustainable at any meaningful hiring volume. Gonfire's proposed solution is to instrument the assessment itself: the tool operates as a proxy layer that intercepts and records a candidate's Claude Code session in real time, then analyzes the interaction log and surfaces a structured report to the hiring manager. The insight is that *how* a candidate directs an AI agent — which prompts they issue, how they decompose problems, where they intervene and where they defer — is far more revealing than the finished artifact the agent produces.

The technical approach is notable for what it reveals about the current AI tooling landscape. Gonfire is specifically built around Claude Code, Anthropic's terminal-based agentic coding tool, rather than a generic interface. This is significant because Claude Code, unlike simple chat-based coding assistants, operates with substantial autonomy — executing shell commands, reading and writing files, and managing multi-step workflows — meaning the candidate's role shifts from author to director. Assessing that directorial competence is precisely the skill most relevant to engineering roles where AI agents are embedded in the development loop, and it is a skill for which no standardized rubric yet exists. Gonfire is essentially proposing a rubric, even if its current form is more observational than scored.

The broader trend this tool exemplifies is the market's scramble to define what "AI engineering" competence actually means. Anthropic itself has published internal thinking on AI-resistant technical evaluations, and the gap between that conversation and commercially deployable hiring infrastructure remains wide. Gonfire is entering that gap early, betting that the interaction log — the sequence of decisions a human makes while collaborating with an AI system — will become the primary unit of evaluation in technical hiring, displacing the static code artifact that LeetCode and take-home assignments treat as the terminal output. Whether companies will trust an automated analysis of those logs, or whether they will want human review of the raw session data, remains an open design question the tool will need to resolve as it scales.

Read original article →

Detailed Analysis

Don't Miss a Deploy