On Tuesday, Amazon debuted a brand new generative AI mannequin, Nova Sonic, able to natively processing voice and producing natural-sounding speech. Amazon claims that Sonic’s efficiency is aggressive with frontier voice fashions from OpenAI and Google on benchmarks measuring velocity, speech recognition, and conversational high quality.
Nova Sonic is Amazon’s reply to newer AI voice fashions such because the mannequin powering ChatGPT’s Voice Mode, which really feel extra pure to talk with than the extra inflexible fashions from Amazon Alexa’s early days. Latest technological breakthroughs have made legacy fashions and the digital assistants they underpin, comparable to Alexa and Apple’s Siri, appear extremely stilted by comparability.
Nova Sonic is obtainable via Bedrock, Amazon’s developer platform for constructing enterprise AI purposes, through a brand new bi-directional streaming API. In a press launch, Amazon known as Nova Sonic “essentially the most cost-efficient” AI voice mannequin in the marketplace, and round 80% inexpensive than OpenAI’s GPT-4o.
Parts of Nova Sonic are already powering Alexa+, Amazon’s upgraded digital voice assistant, in keeping with Amazon SVP and Head Scientist of AGI Rohit Prasad.
In an interview, Prasad informed TechCrunch that Nova Sonic builds on Amazon’s experience in “giant orchestration methods,” the technical scaffolding that makes up Alexa. In comparison with rival AI voice fashions, Nova Sonic excels at routing person requests to totally different APIs, mentioned Prasad. This functionality helps Nova Sonic “know” when it must fetch real-time data from the web, parse a proprietary information supply, or take motion in an exterior utility — and use the suitable device to do it.
Throughout a two-way dialogue, Nova Sonic waits to talk “on the applicable time,” making an allowance for a speaker’s pauses and interruptions, says Amazon. It additionally generates a textual content transcript for the person’s speech, which builders can use for varied purposes.
Nova Sonic is much less susceptible to speech recognition errors than different AI voice fashions, in keeping with Prasad, that means the mannequin is comparatively good at understanding a person’s intent even when they mumble, misspeak, or are in a loud setting. On a benchmark measuring speech recognition throughout languages and dialects, Multilingual LibriSpeech, Amazon says Nova Sonic achieved a phrase error charge (WER) of simply 4.2% when averaged throughout English, French, Italian, German, and Spanish. Meaning that roughly 4 out of each 100 phrases from the mannequin differed from a human transcription in these languages.
On one other benchmark measuring loud interactions with a number of members, Augmented Multi Celebration Interplay, Amazon says Nova Sonic was 46.7% extra correct by way of WER than OpenAI’s GPT-4o-transcribe mannequin. Nova Sonic additionally has industry-leading velocity, with a mean perceived latency of 1.09 seconds, in keeping with Amazon. That makes it sooner than the GPT-4o mannequin powering OpenAI’s Realtime API, which responds in 1.18 seconds, per benchmarking by Synthetic Evaluation.
Prasad says Nova Sonic is part of Amazon’s broader technique to construct AGI (synthetic basic intelligence), which the corporate defines as “AI methods that may do something a human can do on a pc.” Transferring ahead, Prasad says Amazon plans to launch extra AI fashions that may perceive totally different modalities, together with picture, video, and voice, in addition to “different sensory information which can be related for those who carry issues into the bodily world.”
Amazon’s AGI division, which Prasad oversees, appears to be taking part in a bigger position within the firm’s product technique as of late. Simply final week, Amazon launched a preview of Nova Act, a browser-using AI mannequin that seems to be powering components of Alexa+ and Amazon’s Purchase for Me function. Beginning with Nova Sonic, Prasad says the corporate needs to supply extra of its inner AI fashions for builders to construct with.