I wanted to run a bot I could talk to by voice, entirely on my own machine, for free. That was the starting point.
No hitting external APIs. No monthly subscription. Nothing going up to the cloud. Just a single M1 Max on my desk: I speak into the mic, it gets transcribed, a local LLM thinks, and a synthesized voice answers me back. I wanted that kind of "fully local, free, owned" voice conversation bot. I named it "kit." You speak into the mic, after a while it answers in a voice, and then it automatically goes back to listening for the next thing — it's like a partner for recording a one-person show.
Why did I want a partner? Because, originally, I wanted to do a podcast. Over in (Podcast Log #1) I Want to Try a Podcast After Turning 40, I wrote that "talking alone is hard, so I'll build the tool — a voice to talk with — first." This is entry #2, the continuation. kit is what's inside that tool. If that piece is the "why I'm building it," read this one as the "how I built it" implementation record, and the two connect.
A normal voice chat bot has a fairly simple structure: "the person finishes speaking → transcribe → think → speak back." A single line. One straight road. But I just couldn't stand how this never ends up feeling like talking to a person. When you talk with someone, they don't sit silently and wait for you to finish. They nod while listening, they backchannel, they nudge you along when you trail off, and in their head they're already preparing a reply. It's all running at the same time. So with kit, I dropped the single line and built it as a parallel system.
This article is an honest, end-to-end write-up of what's inside that implementation. I'll give you the function names, the config values, and the numbers I actually measured, without rounding them off. And rather than only the nice parts, I'll honestly write about what — because of the physical constraint of a single GPU — "isn't fully realized yet," because that's where a technical article earns its value.
What makes it different from the usual — dropping the single line
While the person is in the middle of speaking, kit is doing several things at once.
- The ears (whisper) keep transcribing the in-progress speech with a sliding window
- The DSP keeps measuring the voice's volume (RMS) and pitch (true F0) every 80 ms
- When it finds a "phrase valley," it lays a colorless backchannel ("uh-huh," "ahh") over the voice
- As silence deepens, it gradually offers questions or new topics
- In the background, a large model is slowly simmering a deeper answer
All of this runs in parallel during the span of a single human utterance. The problem is, when you have five such "eager talkers," multiple of them try to vocalize at the same instant and collide. The voice trying to fill the silence, the backchannel laid over you, and the interrupting utterance all fight over the same beat.
So kit is designed so that the moment of vocalizing always passes through a single point: the Arbiter (control thread). Even when the five parallel systems each say "I want to speak," the one and only thing that decides "what do we do at this exact moment" is the Arbiter. The answer it returns is exactly one of {STAY_SILENT | BACKCHANNEL | FACILITATE | SPEAK}. I'll cover the details under "The heart of the design."
The big picture
Saying "it runs in parallel" out loud is easy, but the moment I actually started writing code, I'd get lost myself unless I drew, once and properly, where things branch and where they merge. So first, how the five systems spin in parallel and where they converge to a single point. It's fastest to just show you in one diagram.
┌─────────────────────────────────────────────────────┐
│ Human (mic input) │
└───────────────────────┬─────────────────────────────┘
│ getUserMedia (1 stream・no double tap)
▼
┌──────────── single AnalyserNode (fftSize=2048) ───────────────┐
│ │
┌────┴─────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐ ┌────┴──────┐
│ ① Percept │ │ ② Control │ │ ③ Timing │ │ ④ Gen │ │ ⑤ Output │
│ P layer │ │ FSM │ │ reflex/bc │ │ 2-tier │ │ O record │
│ │ │ │ │ │ │ │ │ │
│ readRms │ │ IDLE │ │ bcTick 80ms │ │ Swallow │ │ raw mux │
│ readF0 │ │ ACQUIRING │ │ overlap bc │ │ fast ~8s │ │ AI voice │
│ partial │ │ LISTENING │ │ silence nudge│ │ + │ │ bc mix │
│ STT │ │ THINKING │ │ stage1/2/3 │ │ Gemma-4 │ │ → wav │
│ (whisper) │ │ NUDGING │ │ thinking fill│ │ deep │ │ → mp4 │
│ │ │ SPEAKING │ │ │ │ ~110-185s│ │ │
└────┬─────┘ └────┬─────┘ └──────┬───────┘ └────┬─────┘ └───────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────────────────────────────────────────────────────────────┐
│ ★ single Arbiter (control thread) arbitrate(kind, stage) │
│ look at muteMode → floor → governor, return 1 action │
│ {STAY_SILENT | BACKCHANNEL | FACILITATE | SPEAK} │
└────────────────────────────────────────────────────────────┘
│
▼
Server (body: HTTP I/O・serialize the single GPU)
The point is that while systems ① through ⑤ all spin in parallel, the moment a sound goes out it always passes through one point: the Arbiter. This structurally prevents the three-way collision of "I want to fill the silence," "I want to lay a backchannel over you," and "I want to interrupt and speak" all fighting over the same instant. If there's only one place where the decision is made, a collision can't happen in the first place.
Here's the hardware and the models I'm using. Since everything runs locally, this is the physical ceiling.
| Role | Implementation |
|---|---|
| Machine | M1 Max 64GB / 1 Metal GPU |
| Ears (speech→text) | whisper.cpp (whisper-cli) + ggml-small |
| Brain (fast voice LLM) | Llama-3.1-Swallow-8B-Instruct Q4_K_M |
| Brain (deep / text LLM) | Gemma-4-26B Q4_K |
| Mouth (text→speech) | AivisSpeech (VOICEVOX-compatible local HTTP) |
All numbers are real values confirmed from the actual config file (data/chat_config.json, about 400 keys) — not guesses.
The heart of the design — the single Arbiter
The most important thing in kit is this Arbiter. A function called arbitrate() is the one and only decision point: it takes the "I want to speak" requests from each parallel system and returns exactly one verdict.
Why did I insist this hard on "narrowing it to one point"? Because at first each system decided on its own whether to speak, and that made the voices collide over and over. Scatter "am I allowed to speak?" checks across the code with if-statements, and a hole always opens up somewhere. So I made up my mind: gather the decision into a single place, and just get that one place right. Here's the actual priority order.
arbitrate(request):
1. Arbiter disabled → legacy path (kept to avoid regression)
2. muteMode (the human has muted it) → SILENT (first class)
3. non-overlap and floor not open to AI → SILENT
(a backchannel doesn't take the floor, so it's exempt from the floor gate)
4. non-overlap and stage >= FACILITATE → legacy path
(staged escalation isn't cut by budget = separation of duties)
5. overlap and governor-exempt → legacy path
6. governor on and this valley's AI utterances over budget → SILENT
7. legacy path (map temperature to 1 action)
Two "separations of duties of the same shape" are at work here.
The first is floor. The conversational "floor" (the right to speak) is held by the human by default. Only when the human has been silent for a while (floor_open_silence_ms=2200 or more) does the floor open to the AI. Normally, if the floor is closed, the AI can't speak. But the backchannel alone is an exception, exempt from the floor gate. A backchannel is "a signal that may sound even while the human holds the floor, saying I'm listening." It clearly separates utterances that take the floor from backchannels that don't.
The second is the governor (valley budget). To keep the AI from rattling off fillers in the "valley" between when the human finishes a stretch of talking and starts again, there's a budget on how many times the AI can speak in one valley (governor_valley_budget=2). However, staged escalation into deep silence (stage2/3) is not counted against the budget. If you counted it, you'd get the secondary bug where "one light reflexive backchannel uses up the budget, so the genuinely needed deep prompt never fires."
And crucially, choosing silence (SILENT) is also an active decision, not a non-action. For kit, going quiet isn't "doing nothing" — it's a breath the Arbiter actively chose because "staying quiet is the right call right now."
Backchanneling while listening — the reflex layer
A backchannel can't wait on the brain (the LLM). There's no time to call the LLM to lay one over the human mid-speech. So kit puts a "spinal reflex" layer on the front end.
A function called bcTick() runs every 80 ms (bc_tick_ms=80) and lays a colorless backchannel over the "phrase valley" while the human is mid-speech. The trick to finding the "phrase valley" is using the true F0 (pitch).
At first I was hunting for valleys using only the voice's volume (RMS), but that wasn't enough. The moment to backchannel isn't a mere dip in volume — it's the moment the voice drops toward a clause boundary (low-pitch settling). When people head into closing a sentence, their pitch naturally falls. That comes at a different timing than the volume dip. So I started measuring pitch in Hz space and checking whether it dropped below a recent baseline (f0_settle_ratio=0.92). When the pitch is uncertain (clarity below f0_clarity_min=0.45), it doesn't fabricate a sloppy value — it treats it as 0. It doesn't pretend to measure what it can't.
What's interesting is that it fires the backchannel probabilistically. Even when the conditions line up, it only fires with a probability of bc_fire_prob=0.12, and otherwise lets it pass. This is to avoid being a fixed metronome. If you knock out "uh-huh, uh-huh, uh-huh" at perfectly regular intervals, it feels mechanical and, ironically, like it's not listening. Backchannels come alive precisely because they arrive at unpredictable timing. On top of that, I layered a refractory window of bc_suppress_window_ms=4500 and a rate limit of at most 8 per 60 seconds, keeping it to a sparse effective rate of about 7–8 per minute. Turning the firepower relatively down creates a posture of having stepped down to listen.
What supports this reflex layer is an FSM (finite state machine) that holds only one state. kit's states form this single road, and transitions only ever happen through setState().
IDLE → ACQUIRING → LISTENING → THINKING → NUDGING → SPEAKING
It is never the case that two states are held at once (I forbid flag combinations like isCharging && isStunned). If the state is unique, "what happens in this state" is fully predictable, and no holes or dead zones are born. For example, the overlap backchannel only spins in LISTENING. On top of that, the playback path has a triple guard of "generation counter + watchdog," physically guaranteeing that playback happens "exactly once" even if external async work (mic acquisition, playback end, fetch resolution) cuts in.
Turning a 100-second lag into an asset — two-tier thinking
This is kit's most intellectual gimmick.
A large local model takes time to return a deep answer. In the production environment, Gemma-4 26B takes over 100 seconds for a single turn. By any normal reasoning, this is a fatal weakness. If you stay silent for 100 seconds, the conversation dies.
So kit runs a fast shallow answer and a slow deep answer in parallel as two tiers. When the human finishes speaking, Swallow 8B first returns a "receiving beat" in about 8 seconds. An immediate "I'm listening." This keeps the conversational tempo alive. And in the background, Gemma-4 26B is simmering a deeper answer for about 110–185 seconds. When the simmering is done — and a "moment it's OK to speak" arrives (the human isn't talking, isn't trailing off, and it's an answer to the latest topic) — it slips the deep answer in.
This is also an information-theoretic gimmick. You place a "light, low-expectation reply" up front as a stepping stone. Then, when the deep answer arrives later, the upside ("oh, it was actually thinking") — the reward prediction error — is maximized. Set expectations low, then exceed them later.
And the most important part: when a new utterance comes in, it immediately cancels the deep computation and yields the GPU. Even mid-simmer, the instant the human starts saying the next thing, deep_cancel mercilessly stops it. One of kit's North Stars is "the human always wins." No matter how good an answer the AI is in the middle of forming, the human's voice, the human's interruption, the human's new utterance is instantly prioritized over any AI utterance. I built an AI that steps back the instant the human speaks, not one that grabs the GPU and won't let go.
The worker managing the deep tier has a single slot; when a new request comes in, it bumps a generation counter and discards the old computation (it only adopts the result of the latest generation). This prevents "an answer to an old topic showing up just now."
The physical wall of doing it all locally
This is the section I should write about most honestly.
The M1 Max has only one Metal GPU. The ears (whisper), the brain (LLM), and the mouth (AivisSpeech) all use this one. Which means they are mutually exclusive in time. There's no choice but to time-slice. So "listening and, at the same time, thinking deeply" — true full parallelism — is not realized, in principle. What looks parallel is actually the single GPU being switched finely; it isn't truly running simultaneously.
To make this work, access to the GPU is serialized through one lock (_GPU_SPAWN_LOCK). The main voice conversation waits until the lock is free (it waits if the deep tier is running). The silence-filling nudge is best-effort: if the GPU is busy, it immediately returns a "GPU busy" error and gracefully gives up with an empty voice (again, the human wins). The deep tier always releases the lock in a finally block no matter which exit it takes, so the GPU never gets permanently jammed.
On top of that, here are the honest limits.
| Limit | Detail |
|---|---|
| The deep tier gets canceled every turn and rarely lands | In long turns where the human keeps talking, the deep tier is discarded each time a new utterance comes in to yield the GPU, so the deep answer often doesn't complete. "Turning the 100-second lag into an asset" is a concept; in real use with frequent intermittent cancels, it still rarely lands. Tuning the content-reply (how the deep answer is delivered) is a work in progress. |
| Memory pressure makes the deep tier slow | As I'll cover below, a model that's fast in isolation slows down a lot with co-resident processes. |
| Overlap interruption only during recording | While the AI is mid-speech (SPEAKING), the analyzer for measuring the voice is torn down, so voice interruption doesn't work — only button interruption. You can interrupt by voice only within the recording window. |
The one I especially want to be honest about is the time measurement disproved my own hypothesis.
"The deep LLM is slow because of Gemma-4's MoE architecture" — that's what I assumed at first. If the architecture is the problem, swapping to a simpler, smaller model should speed it up. But when I measured the model in isolation, Gemma-4 puts out about 28 tok/s. Plenty fast. When I re-measured in the production environment (with Swallow and AivisSpeech co-resident), it was about 4.7 tok/s. A 6x slowdown. The bottleneck wasn't the architecture — it was memory pressure from co-resident processes. Before I made the move of suspecting the architecture and downgrading the model, the measurement disproved it. That one move of killing a hypothesis with measurement saved me a pointless model swap. Don't act on a guess; measure first. That's a lesson I want to keep on the record as a technical article.
The record of beating it into shape on real hardware
kit was not built by writing a spec and implementing it all at once. I live-tested it on real hardware, pointed out "this feels off" by feel, diagnosed it, and fixed it. It grew through stacking up that loop. Here are some representative grubby fixes.
| Symptom (what I felt on real hardware) | Diagnosis | Fix |
|---|---|---|
| The show's mp4 was missing the person's raw-voice turns | During high latency, interruptions caused recording events to be dropped | On the server side, persist the raw voice to disk before the turn response, and later pick it back up using file mtime as a fallback. The human's voice always remains in the mp4. |
| Backchannels chained and became absurdly long (the owner was furious) | The thinking-filler chain was unbounded | Limited the chain to one shot, placed a silence window, and changed it to a breathing pattern of "filler → silence → one survival signal → merge with the response." |
| A 224-second monologue wouldn't end the turn | Crossing a filler mid-recording reset the turn's cap-timer origin, rewinding it | Pinned the cap origin to a turn-level origin so it stays put even across fillers. |
| The voice stops completely (permanent GPU block) | A lock contention between nudge and turn left the lock dangling when the process was killed | Always reap the child process in finally before releasing the lock. No path leaks even once. |
The problem of whisper producing hallucinations like "memememe" was because a single-character check missed multi-character loops, so I fixed it to use n-gram repeat detection. But to protect legitimate two-repeat backchannels like "sou-sou," it only treats something as a hallucination from three repeats up. Verification is done independently from the production build, and the results are recorded in a roughly 70KB verification report.
Closing
kit implements "talking to an AI by voice" not as a single-line request/response, but as a system where five lines — perception, understanding, timing, generation, output — spin in parallel and a single Arbiter collapses them into one verdict. Boiled down to three technical points:
- FSM exclusive state — by holding only one state, holes and dead zones go to zero
- The single Arbiter — by passing through one point only at the moment of vocalizing, three-way collisions are structurally prevented
- Two-tier thinking — running a fast shallow answer and a slow deep answer in parallel, trying to turn the inherently unfavorable 100-second lag into an asset
And honestly: because of the physical constraint of a single GPU, "listening and thinking deeply in full parallel" is not realized yet. The deep tier tends to get canceled every turn, it's slow under memory pressure, and tuning the content-reply is a work in progress. Even so, the two North Stars — "never let the conversation die" and "the human always wins" — are upheld by a design that completes the turn even while dropping quality best-effort, and by serializing the GPU lock.
Fully local, free, owned: a single M1 Max time-slices ears, brain, and mouth while backchanneling at a near-human tempo, thinking deeply, and recording it all as a show. This was a record of building that whole thing as a "parallel system, not a single line." A lot of it is still in progress, but I tried to write up the implementation so far, end to end, honestly.


