Chain three models: Whisper (speech → text), an LLM (text → response), TTS like OpenVoice or StyleTTS (text → speech). Stream between steps for sub-second latency. Deploy as a web app with WebRTC mic access or a mobile app via Capacitor.
MediaRecorder API. Ask AI: "Generate a React hook that captures 16kHz mono audio from the mic and emits 100ms chunks as WebM."whisper.cpp compiled to WASM. Target: first partial transcript <300ms.max_tokens short. Voice users want brevity.echoCancellation: true.| Tool | Best For | Price |
|---|---|---|
| Whisper API | STT | $0.006/min |
| Cartesia | Low-latency TTS | $0.013/1K chars |
| StyleTTS 2 | Self-hosted TTS | Free |
| Silero VAD | End-of-speech | Free |
| LiveKit | WebRTC infra | Free tier |
Q: Can this run fully offline on phones? Yes — whisper.cpp + small LLM (Phi-3) + on-device TTS. Quality drops, privacy wins.
Q: How do I reduce latency below 500ms? Self-host all 3 models on same GPU, use streaming, skip hosted APIs.
Q: What about non-English? Whisper supports 90+ languages. TTS quality varies — test Cartesia or self-hosted XTTS.
Q: Can I clone my own voice? Yes — OpenVoice and XTTS support voice cloning from 10 seconds of audio.
Q: Is WebRTC mandatory? No — plain HTTPS + MediaRecorder works. WebRTC improves low-latency duplex.
Q: How much does ChatGPT Voice cost to replicate? Infra-wise, ~$0.02/min all-in on good architecture.
Voice is the next interface. Streaming at every step is the secret to feeling magical. Build one narrow voice assistant (doctor's scribe, cooking helper, language tutor) and nail the latency. Everything else follows.
Free newsletter
Join thousands of creators and builders. One email a week — practical AI tips, platform updates, and curated reads.
No spam · Unsubscribe anytime
Comments
Sign in to join the conversation
No comments yet. Be the first to share your thoughts!