Voice Commerce: The End of the Keyboard
Typing is friction. Speaking is natural. How to build Voice interfaces using Whisper (STT) and ElevenLabs (TTS) to allow users to shop hands-free.
The History of Input: From Punch Cards to Voice
The QWERTY keyboard was invented in 1873. It was expressly designed to slow typists down to prevent mechanical typewriter jams. 150 years later, we are still using this layout. We are tapping on glass screens, using our thumbs to hit tiny virtual keys, battling autocorrect, and dealing with “Fat Finger” errors. It is absurd. Typing is High Friction. It requires visual attention (“Look at the keys”) and dexterity (“Hit the right key”). Speaking is Zero Friction. It requires no hands and no eyes. Humans can speak 150 words per minute. They can type 40 words per minute on mobile. Voice Commerce is the transition from Graphical User Interfaces (GUI) to Conversational User Interfaces (CUI). It is the move from “Command Line” to “Natural Language”.
Why Maison Code Discusses This
At Maison Code, we serve the “Luxury of Time”. Our clients (High Net Worth Individuals) are busy. They are driving. They are holding a baby. They are cooking. They don’t have time to browse 50 pages of filters on a small screen. They want to say: “Send a gift to my mother for her birthday, budget $500, something floral.” And they want it done. We build Voice-First experiences that act as digital concierges. We use the latest AI models to ensure the system understands not just the words, but the Intent.
The Technology Stack (The Modern Voice Pipeline)
For a long time, voice (Siri, Alexa) was bad. It didn’t understand accents (“I’m sorry, I didn’t catch that”). It was rigid. It was a decision tree, not AI. In 2024, the stack matured significantly. We can now build human-level voice interactions. The pipeline consists of three stages: Ear -> Brain -> Mouth.
1. The Ear: Speech-to-Text (STT)
This converts audio waves into text. The Leader: OpenAI Whisper. It is a transformer model trained on 680,000 hours of multilingual data. It handles accents, background noise (Starbucks ambiance), and technical jargon perfectly.
- Latency: ~300ms (Turbo model).
- API:
POST /audio/transcriptions. - Innovation: It understands “Ums” and “Ahs” and filters them out.
2. The Brain: Large Language Model (LLM)
This processes the text and decides what to say. The Leader: GPT-4o or Claude 3.5. Voice requires high intelligence to understand context (“I want that one but in red”). Standard chatbots fail here. You need models that understand Intent and Nuance.
- Latency: ~500ms (First token).
3. The Mouth: Text-to-Speech (TTS)
This converts text back into audio. The Leader: ElevenLabs. It generates hyper-realistic, emotional audio. It breathes. It pauses. It laughs. It intonates questions correctly.
- Latency: ~300ms (Streaming).
The Engineering Challenge: Latency
If you chain these three APIs sequentially:
Wait for User -> STT (1s) -> LLM (2s) -> TTS (1s) -> Play Audio.
Total Delay: 4 Seconds.
In a conversation, 4 seconds is an eternity.
“Hello?” … (4s silence) … “Hi there.”
It feels broken. Users will hang up.
We need to get under 1 second (The “Magic Threshold” of conversation).
Solution: Streaming Pipelines and WebSockets. We do not wait for the user to finish speaking. We do not wait for the LLM to finish thinking.
- VAD (Voice Activity Detection): The browser uses WebAudio API to detect when the user stops speaking (silence > 500ms). It automatically cuts the microphone.
- Optimistic STT: Send audio chunks to Whisper as they are recorded via WebSocket.
- LLM Streaming: As soon as GPT-4 outputs the first word (“Hello”), send it to ElevenLabs.
- Audio Streaming: As soon as ElevenLabs generates the first byte of audio for “Hello”, play it. This parallel processing brings the perceived latency down to ~800ms. GPT-4o (Omni): Does this natively (Audio-in / Audio-out) in a single model, reducing latency to ~300ms. This is the holy grail.
Use Cases for Luxury Commerce
1. The Concierge
Imagine a “Call Concierge” button on your app.
- User: “I need a gift for my wife. She loves silk scarves but hates the color yellow. Budget is around 300 euros.”
- AI: “I understand. I have a beautiful Hermes-style silk square in Azure Blue. It is 250 euros. Shall I show it to you?”
- User: “Yes.”
- The App Navigates automatically to the product page. This is Multimodal interaction. Voice drives the Screen.
2. The Post-Purchase Support
- User: “Where is my order?”
- AI: “I see order #1234. It is currently in Lyon. FedEx says it will arrive tomorrow by 2 PM. Do you want me to text you the tracking link?”
- User: “Yes please.” This replaces the frustrating “Press 1 for English” IVR menus.
3. In-Car Commerce
Drivers cannot look at screens. “Hey Maison, reorder my usual cologne.” The transaction happens purely via audio.
Privacy and Trust: The “Hot Mic” Problem
Users are paranoid that apps are listening to their conversations. This is the biggest barrier to adoption. Best Practices:
- Push-to-Talk: Require a physical button press to listen. It is safer than “Wake Words” (“Hey Siri”) which imply constants surveillance.
- Visual Feedback: Show a waveform animation when listening. Show a “Processing” state.
- Ephemeral Data: Do not store the audio recordings. Transcribe and delete immediately. State this in your Privacy Policy.
- Local Processing: If possible, run the “Wake Word” engine on-device (TensorFlow.js) so no audio is sent to the cloud until the user intends it.
The Skeptic’s View
“People don’t want to talk to robots.” Counter-Point: People don’t want to talk to dumb robots. People love talking to smart assistants (Her, Jarvis). Once the latency drops and the intelligence rises, the friction feels minimal. Also, Gen Alpha (kids) only use voice. They search YouTube by shouting at the iPad. They are your future customers.
FAQ
Q: Is it expensive? A: Yes. STT + LLM + TTS = ~$0.05 per minute. It is cheaper than a human agent ($0.50/min), but more expensive than a button click ($0.00). Use it for high-value interactions (Sales, Support), not for browsing.
Q: Does it support multiple languages? A: Yes. Whisper and ElevenLabs are natively multilingual. You can speak French and the AI can reply in English (or vice versa). This opens up global markets without hiring local support teams.
Conclusion
Voice is the ultimate interface because it is the oldest interface. We have been speaking for 100,000 years. We have been clicking mice for 40 years. Voice is “Back to Basics”. In 2026, a brand without a Voice Interface will feel as mute as a brand without a website in 2000. We are moving from “Search” to “Ask”.
13. Voice Authentication (Biometrics)
“Purchase Confirmed.” How do we know it’s you? Voice Biometrics. Your voice print is unique. We can use AI to verify identity with 99.9% accuracy (“My voice is my password”). This is smoother than asking for a PIN code or 2FA SMS. However, for high-value items, we recommend a Hybrid Flow: “Order placed. Please confirm with FaceID on your phone.” This multi-factor approach balances speed with security.
14. The Hybrid Voice/Screen Flow
Voice is great for Input (“Find red shoes”). Screen is great for Output (Showing 10 red shoes). We build Multimodal apps. User speaks. App updates the screen. User taps “Blue”. App says “Here are blue ones.” The modes reinforce each other. Do not force the user to “Listen” to a list of 10 products (“Product 1: … Product 2: …”). That is terrible UX. Use Voice for intent, Screen for selection.
15. Conclusion
People speak differently than they type.
Type: “Best red wine 2025”
Speak: “What is a good red wine for a steak dinner for under 50 euros?”
Voice queries are Long Tail and Question-Based.
To rank for voice (Siri/Google Assistant), you must structure your content as FAQ answers.
Schema.org Speakable property helps.
But mostly, it’s about having high-quality, conversational content that answers specific questions directly.
14. Accessibility: Beyond Convenience
For us, Voice is a luxury feature. For a blind user, it is an essential feature. By building a Voice Interface, you are inadvertently making your site accessible to the visually impaired. It allows them to navigate, select products, and checkout without a screen reader. This is Inclusive Design. It expands your market addressability while doing social good.
15. Conclusion
If you want to offer a premium, hands-free shopping experience, Maison Code can build your Voice Stratagy. We integrate Whisper, LLMs, and ElevenLabs to build sub-second latency Voice Interfaces for web and mobile.