anyreach-asr: Speech Recognition for Voice Agents
Sub-300ms streaming transcription across 50+ languages with domain-adaptive accuracy
Voice agents live or die by transcription quality.
Streaming Latency
<300ms
Time-to-first-token
Languages
50+
With regional dialects
Executive Summary
What we built
anyreach-asr is a high-accuracy, low-latency speech recognition engine powering Anyreach's real-time voice agent pipeline. It handles noisy telephony audio, accented speech, domain-specific vocabulary, and multilingual code-switching out of the box.
Why it matters
Voice agents live or die by transcription quality. A misheard medication name, a garbled email address, or a 500ms+ ASR delay creates cascading failures downstream -- wrong LLM responses, broken conversation flow, lost customer trust. Every word matters, every millisecond counts.
Results
- 5.26% median WER on batch (47.4% lower than competitors)
- 6.84% median WER on streaming (54.3% lower than competitors)
- Sub-300ms time-to-first-token for real-time streaming
- Real-time multilingual code-switching across 10 languages
- Up to 90% improvement in domain-specific term recognition via keyterm prompting
Best for
- →High-volume outbound/inbound voice agent calls
- →Multilingual customer support (code-switching callers)
- →Healthcare, finance, and legal transcription requiring domain accuracy
- →Real-time captioning and compliance recording
Limitations
- Streaming PII redaction currently English-only (batch supports all languages)
- Audio Intelligence features (sentiment, summarization) English-only for now
The Problem
Current solutions fall short. Each approach has different causes and different costs.
Symptom
High word error rates in noisy telephony
Cause
Background noise, codec artifacts, far-field mics cause generic ASR models to produce garbage transcripts
Business Cost
Symptom
Name and email spelling failures
Cause
Caller spells out "B as in Bravo, R-O-W-N at gmail dot com" and generic ASR outputs "brown at gmail.com" or worse "be our own a gmail com." Proper nouns, email addresses, and alphanumeric sequences are among the hardest recognition tasks
Business Cost
Symptom
Accent and dialect misrecognition
Cause
Indian English, Australian English, Swiss German, regional Arabic dialects -- generic models struggle with non-standard pronunciation. A caller saying "schedule" (British) vs "schedule" (American) or Arabic speakers from Egypt vs Morocco produce different phonetic patterns
Business Cost
Symptom
Multilingual caller confusion
Cause
Caller switches between Spanish and English mid-sentence. Monolingual ASR produces gibberish for the non-English segments
Business Cost
Symptom
Latency-induced conversation breakdown
Cause
ASR taking 500ms+ creates awkward "walkie-talkie" pauses. Voice agent can't respond naturally. Competitors like OpenAI Whisper have 500ms+ TTFT and lack native streaming entirely
Business Cost
Symptom
Filler words polluting LLM input
Cause
"Um, so, uh, I wanted to, uh, check on my, um, appointment" -- if filler words aren't handled, the LLM receives noisy input that degrades response quality
Business Cost
How It Works
Different approaches offer different tradeoffs. Here's how they compare.
Audio Ingestion
Accepts raw audio from telephony/WebSocket
- Works directly on unprocessed audio without noise reduction preprocessing
- Preserves acoustic cues critical for accuracy
- Handles codec variations and telephony artifacts
Core Recognition Engine
Transformer-based architecture with latent space audio embedding
- Handles accents, dialects, and noisy conditions natively
- Supports 50+ languages with regional variants
- Sub-300ms streaming latency with interim results
Language & Formatting
Smart formatting and domain adaptation
- Punctuation, capitalization, paragraph breaks
- Entity detection (50+ types)
- Filler word handling and keyterm prompting
Output & Intelligence
Speaker diarization and PII protection
- Word-level timestamps with speaker labels
- Real-time PII/PHI/PCI redaction
- Sentiment analysis, topic detection, summarization
Product Features
Ready for production with enterprise-grade reliability.
Sub-300ms Streaming Latency
200-300ms time-to-first-token, with partial/interim transcripts delivered while the caller is still speaking. Endpointing (voice activity detection) is configurable from 10ms to 500ms+ to tune for chatbot-style short utterances vs natural conversation. Competitors like Whisper lack native streaming entirely (500ms+ TTFT). This eliminates "walkie-talkie" delays that make voice agents feel robotic.
50+ Languages, 10 Simultaneous
Supports languages from English to Arabic (17 dialects) to South Asian languages. Real-time code-switching handles multilingual callers across English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch simultaneously -- without explicit language detection or routing.
Domain-Adaptive Vocabulary (Keyterm Prompting)
Accepts up to 100 custom terms per request. Domain-specific words like "Clindamycin" improve from 71% to 96% confidence instantly, no model retraining required. Supports proper nouns (company names, person names), product names, medical/legal/financial jargon. One customer saw 625% improvement in veterinary term recognition.
Name & Email Spelling Accuracy
Entity detection identifies 50+ entity types including person names, email addresses, phone numbers, and SSNs in real-time. Combined with keyterm prompting for expected proper nouns, and smart formatting that automatically structures emails/URLs/phone numbers. Handles letter-by-letter spelling dictation in voice agent workflows.
Accent & Dialect Recognition
Not just language support but explicit dialect handling: 5 English variants (US, AU, GB, IN, NZ), Swiss German (de-CH), Flemish (nl-BE), Canadian French (fr-CA), Brazilian vs European Portuguese (pt-BR, pt-PT), Latin American Spanish (es-419), and 17 Arabic regional variants (Egypt, Morocco, Saudi, UAE, etc.). Handles regional phonetic shifts and non-standardized pronunciation patterns.
Filler Word Handling
Recognizes "um", "uh", "uh-huh", "mhmm", and "nuh-uh." Default behavior strips "um" and "uh" for clean LLM-ready transcripts. Verbatim mode (`filler_words=true`) preserves all disfluencies with consistent spelling normalization regardless of spoken duration. Essential for sales coaching (measuring confidence), legal transcription (verbatim record), public speaking analysis, and language instruction.
Smart Formatting
Automatic punctuation, capitalization, and paragraph breaks. For English: dates, times, currency amounts, phone numbers, email addresses, and URLs are formatted correctly. Works across all languages with broadest support for English.
Speaker Diarization
Word-level speaker labels with precise start/end timestamps and confidence scores. Identifies who said what and when. Works in both streaming (speaker IDs) and batch (IDs + confidence scores) modes.
Real-Time PII Redaction
Supports 50+ entity types across PII (names, locations, SSNs), PHI (medical conditions, drugs, blood types), and PCI (credit card numbers, CVV, expiration). HIPAA-compliant. Granular control -- choose specific entity types to redact or use category groups.
Noise Robustness
Works directly on raw, unprocessed audio rather than relying on noise reduction preprocessing (which can actually degrade accuracy by removing acoustic cues). Handles significant speaker-to-microphone distance, overlapping speech, background noise, codec artifacts. Proven in air traffic control, drive-thru, call center, and clinical environments.
Integration Details
Runs On
Anyreach Cloud or Self-Hosted (AWS, GCP, Azure)
Latency Budget
<300ms TTFT streaming
Providers
REST API, WebSocket Streaming, Python SDK, Node.js SDK, .NET SDK
Implementation
1-2 days typical
Frequently Asked Questions
Common questions about our voicemail detection system.
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
