← Reddit

I made an open source tool to use on device hardware to integrate with Claude Code and start-to-finish edit a video and export it to Premiere

Reddit · Kemerd · April 22, 2026
An open-source video editing tool uses Claude Code to automatically edit raw footage by analyzing speech transcription, visual captions, and audio classification entirely on local hardware, then exports the results in Premiere-compatible XML format. The system processes video through semantic understanding of context rather than direct video analysis, enabling it to edit four hours of 4K footage in approximately 15 minutes. This approach significantly reduces video editing work that typically requires hours of manual labor.

Detailed Analysis

A developer operating under the GitHub handle Kemerd has released an open-source, fully local video editing pipeline called `video-use-premiere` that integrates with Anthropic's Claude Code to automate the end-to-end editing of raw footage into Premiere Pro-native XML exports, with additional support for FCPXML compatible with DaVinci Resolve and Final Cut Pro X. The tool's core architectural decision is that Claude never directly processes video frames; instead, a multi-model preprocessing stage constructs three parallel text-based timelines — speech transcription via NVIDIA's Parakeet, visual scene descriptions via Microsoft's Florence-2 captioning model at one frame per second, and ambient audio classification via CLAP — which are semantically compressed and handed to Claude as structured text. Claude then performs edit decisions entirely from those textual representations. The developer reports processing four hours of 4K 60FPS HDR footage into a 20-minute cut in approximately 15 minutes on an RTX 5090 paired with an Intel Core i9-14900K, with the resulting edit described as roughly 90% immediately usable.

The audio classification subsystem represents one of the more technically novel aspects of the project. Rather than mapping audio to a fixed taxonomy of a few hundred predefined categories — the standard limitation of most open-vocabulary audio classifiers — the tool first uses Claude to generate a custom vocabulary of audio event types specific to the video's inferred context. Because the system already knows from visual captions and speech transcription that a video involves, for instance, a basketball court and athletic discussion, Claude can generate a tailored label set like "sneaker squeak," "ball bounce," and "crowd murmur," which CLAP then classifies against. This context-driven audio labeling loop substantially expands the system's semantic range without requiring cloud infrastructure or a pre-trained domain-specific model.

The project sits within a small but active ecosystem of Claude Code-integrated video editing tools that have emerged in early 2026. Projects like ButterCut use a comparable local-processing philosophy — combining WhisperX for word-level transcription, FFmpeg for frame extraction, and Claude Code terminal skills for rough-cut generation — exporting to YAML-based sequences for Premiere, Final Cut, and Resolve. The Claude-Soundbite-Editor takes a narrower approach, accepting a pre-exported Premiere transcript JSON and returning an XML edit sequence for highlights or podcast cuts. What distinguishes `video-use-premiere` from these peers is the fusion of all three modalities — speech, vision, and audio — into a single unified preprocessing pipeline, and the semantic compression step that allows Claude to reason over hours of footage without hitting context window limits. The tool requires no API keys beyond a Claude Code subscription and is designed to run entirely on consumer GPU hardware from an RTX 3060 Ti upward.

The broader significance of this release lies in what it reveals about the current capability frontier for agentic AI applied to creative production workflows. Video editing has historically resisted automation because it demands multimodal understanding — knowing not just what was said, but what was visible, what sounds were present, and how those elements combine into a coherent narrative. By decomposing that multimodal problem into text representations that a language model can reason over, the tool effectively converts a creative judgment task into a structured text-editing task, which is precisely where large language models like Claude perform most reliably. The developer's own observation that "this kind of thing would not be possible even a few months ago" reflects the rapid maturation of lightweight on-device vision and audio models — Florence-2, Parakeet, CLAP — that make local multimodal preprocessing viable on prosumer hardware for the first time, enabling Claude to serve as the reasoning layer atop a locally executed sensor stack.

Read original article →