Adding Me/Them Speaker Detection to an Open-Source Meeting Transcriber

Most meeting transcription tools send your audio to a cloud service that handles speaker diarization — figuring out who said what. That works, but it means your meeting audio leaves your device. Meetily takes a different approach: everything runs locally. Whisper for transcription, Silero for voice activity detection, all processing on your own hardware.

I wanted to add speaker attribution — labeling each transcript segment as either “Me” or “Them” — without breaking the local-only constraint. No cloud API calls, no speaker enrollment, no voice fingerprinting.

The insight

The trick is that your computer already knows who’s talking, if you pay attention to where the audio comes from. Your microphone captures you. Your system audio (what comes through the speakers/headphones) captures everyone else on the call. If you process these streams independently instead of mixing them first, speaker identity falls out naturally.

This is simpler than traditional speaker diarization, which tries to cluster audio segments by voice characteristics. It doesn’t distinguish between individual remote participants — they’re all “Them” — but for most meeting notes that’s exactly what you want. You care about what you said versus what others said.

The implementation

Meetily already captured both microphone and system audio, but it mixed them into a single stream before running VAD and transcription. The change was to keep them separate through the VAD pipeline.

Dual VAD architecture

Each audio source gets its own Silero VAD processor. The microphone stream flows through one VAD instance, system audio through another. When either detects speech, it produces a segment tagged with its source:

Microphone speech → “Me”
System audio speech → “Them”

The VAD configuration matters more than you’d think. Silero works with a positive threshold (0.50 — how confident it needs to be that speech is happening) and a negative threshold (0.35 — how confident it needs to be that speech stopped). There’s also a 2-second redemption window that bridges natural pauses. Without that, a sentence with a brief pause in the middle gets split into two fragments.

Each segment carries 300ms of pre-speech padding and 400ms of post-speech context. This sounds minor, but clipping the start of a word makes transcription noticeably worse — Whisper needs to hear the onset of phonemes.

The audio pipeline changes

The ring buffer mixer already existed for combining mic and system audio. I modified it to route each source through its own VAD processor before mixing. The mixed audio still goes to Whisper for transcription (Whisper needs to hear everything for context), but the VAD decisions — when speech starts/stops and from which source — happen per-stream.

This reduced the audio sent to Whisper by roughly 70%. Silence gets filtered out by the VAD, so Whisper only processes segments where someone is actually talking.

Database and frontend

Added a speaker column to the transcripts table via a migration. Each transcript segment now stores whether it came from “Me” or “Them”. The frontend renders these with color-coded badges — blue for “Me”, green for “Them” — making it easy to scan a transcript for your own contributions or for what the other party said.

Where it breaks down

The approach has real limitations:

If you’re in a room with other people and one laptop is running Meetily, everyone in the room shows up as “Me” because they’re all captured by the microphone. It only works cleanly for remote calls where you’re the only local participant.

Crosstalk — when you talk over someone on the call — produces overlapping segments from both VAD processors. The transcription handles this OK (Whisper hears the mixed audio), but the attribution can get confused about which segment belongs to which speaker.

System audio capture is platform-specific and sometimes fragile. macOS uses ScreenCaptureKit, Windows uses WASAPI loopback, Linux uses PulseAudio/ALSA. Each has its own quirks around permissions and device discovery.

And obviously, it can’t distinguish between multiple remote participants. If you need “Sarah said X, then Tom replied Y,” you need actual speaker diarization with voice embeddings. That’s a different (harder) problem.

Why this matters for the project

Meetily is privacy-first, which means every feature has to work without phoning home. Traditional diarization typically requires a cloud service or a large model that’s impractical to run locally in real-time alongside Whisper. The dual-VAD approach sidesteps this entirely — no model, no training, no enrollment. Just paying attention to which input device the audio came from.

It’s not as capable as cloud diarization, but it covers the most common use case (one person on a remote call) with zero additional resource cost and zero privacy compromise.

Contributed to Meetily, an open-source, privacy-first meeting assistant.