Built a Claude voice assistant for the Cardputer ADV — press OK, talk, it talks back

A developer built a pocket voice assistant for the M5Stack Cardputer ADV that uses Whisper for speech-to-text transcription, Claude for processing queries, and OpenAI TTS for text-to-speech responses, with all API calls made directly without cloud middleware. The device includes seven applications: the voice assistant with context persistence, a voice translator, weather forecast, internet radio, snake game, a Bluetooth companion app, and settings. The primary technical challenge involved fitting the system within approximately 90 KB of available RAM by streaming audio directly to flash storage during capture and downloading TTS responses fully before playback.

Detailed Analysis

A developer has created a fully functional pocket voice assistant running on the M5Stack Cardputer ADV, a compact handheld device, integrating Anthropic's Claude as its core conversational intelligence. The system operates through a three-stage pipeline: OpenAI's Whisper handles speech-to-text transcription, Claude processes the query and generates a response, and OpenAI's text-to-speech converts that response to audio output. The project, published to GitHub under the handle Nachtfux, requires no cloud middleware beyond direct API calls made with user-supplied keys, and maintains a rolling 10-message session context that allows coherent multi-turn conversations. The device also ships as a small launcher ecosystem with seven applications including a voice translator, a weather app, internet radio, and a Snake game.

The most technically significant aspect of the project is the engineering required to make it run on the StampS3A microcontroller, which lacks PSRAM and operates with only 320 KB of SRAM. After system resources consumed by Wi-Fi, the TLS stack, and the display, only approximately 90 KB of heap remained available — a margin so thin it barely accommodated a TLS handshake. The developer resolved the central memory bottleneck by bypassing RAM entirely for audio capture, writing microphone PCM data directly to the device's LittleFS flash filesystem during recording and streaming it to the Whisper API in 1 KB chunks. This architectural choice freed enough heap space for the upload process to complete reliably, sidestepping what would otherwise have been a hard ceiling on recording length.

Several additional low-level obstacles shaped the final implementation. The mbedTLS library stalled when handling TLS 1.3 post-handshake records during streaming playback, which forced the developer to shift TTS audio to a fully buffered download-before-play model rather than streaming it incrementally. A separate audio artifact — an audible echo — traced back to the `playRaw` function storing a pointer to audio data rather than copying it, causing corruption when the buffer was reused; this was corrected with a rotating buffer pool. These engineering decisions reflect the iterative, constraint-driven nature of embedded development where standard assumptions about memory management and protocol behavior must be renegotiated at every layer.

The project sits within a growing maker-community trend of embedding frontier AI models into severely resource-constrained hardware, a space that has accelerated as API-accessible models like Claude have lowered the barrier to integrating large language model capabilities without on-device inference. Rather than running a local model — which remains impractical on microcontrollers with sub-megabyte RAM — the architecture offloads all intelligence to cloud APIs while keeping the device logic lean. This approach trades latency and offline capability for access to a far more capable model than anything that could run locally, a tradeoff increasingly common in hobbyist and embedded IoT contexts. The open invitation for collaboration on the BLE GATT pairing component suggests the project is positioned as a community platform rather than a finished product.

The work also illustrates how Claude is penetrating non-commercial, grassroots hardware contexts beyond the enterprise and developer-tool deployments that typically dominate AI coverage. Enthusiast projects like this one serve as informal proof-of-concept demonstrations that voice-driven AI interaction is achievable on commodity embedded hardware with modest budgets, potentially informing future low-cost consumer devices. The complete debug history preserved in the commit log — which the developer describes as "the more interesting half of the project" — offers an unusually transparent record of the failure modes encountered when pushing TLS-dependent AI APIs onto microcontrollers, a practical resource for others attempting similar integrations.

Read original article →

Detailed Analysis

Don't Miss a Deploy