Theodore Twombly, the main character of the movie ‘Her‘, fell in love with a machine called Samantha. He didn’t even need to see her or touch her. It was enough to listen to his voice, which was actually that of actress Scarlett Johansson.
That was science fiction, but little by little we are approaching a point where to fall in love with a machine is no longer. We have been seeing some time with replikathe AI service that allows virtual avatars to become our friends or something else.
That service achieves it with an AI model that generates text, such as chatgpt. Until now we chatted with the machines, but little by little we are beginning to talk to them. Chatgpt’s voice modes precisely give that optionand in fact the company He had to withdraw one of his voices for being too similar to the Scarlett Johansson.
But now an artificial intelligence startup called Skew me has gone one step further. At the end of February the company He published a demonstration of its voice conversational generation model (CSM, by conversation Speech Model), and its impact has been remarkable.
Some users have informed of feeling an emotional connection with the male and female voices of the model (“Maya” and “thousands”). One of them, who published his impressions in Hacker News, explained How “I am even a little worried about whether I start feeling emotionally linked to a voice assistant with this level of so human sound.”
Anyone can try to speak with Maya or thousands Thanks to that demo on the Sesame website. The only obstacle is that conversations must be in English: these models do not speak other languages at the moment.
I just did it for a few minutes, and the operation of this conversational chatbot is really surprising. The voice is warm and close, but above all I perfectly imitate the way a person would speak. With pauses, doubts or intonation changes. The voice generation is instantaneous, there is no latency, and certainly the sensation is to be having a conversation with another human being. It’s strange, exciting and disturbing at the same time.
As they explain In his blog Those responsible, “in Sesame our goal is to achieve a” presence of the voice “, that magical quality that makes oral interactions look real, are understood and valued.” They are pointing to something similar to what Replika pointed out: to create “conversational companions” that offer a genuine dialogue with which to build some confidence over time.
These models are not perfect. Maya, for example, has demonstrated do strange things From time to time, but comments on Some forums of discussion like this Reddit They make it clear that the quality of these models is spectacular.
And if you do not believe it, take a look at this conversation that Gavin Purcell, one of those responsible for the podcast Ai for Humanshe posted on Reddit arguing unsuccessfully with the machine to try to find its limits.
It does not seem to achieve it, and in fact it is impossible to detect that one of the interlocutors is a machine. His speed of answer, his changes in tones, his choice of phrases and words … is amazing. Sesame’s conversational chatbot It also allows you to interpret different roles (“Roleplaying”), something that for example Openai usually limits.
Openai has been working on their voice modes for chatgpt, and Grok 3 has also implemented different synthesized voices and also adjust to diverse personalities. There is even a “deranged” and another “sexy” voice, for example, which demonstrates once again that Musk and Xai do not mind experimenting
As they comment In Ars Technicain Sesame they have achieved this advance thanks to two models (one trunk and another decoder) that work together. Both are based on architecture calls, and Sesame has raised three different sizes. The largest of all combines a trunk model of 8,000 million parameters with a decoder of 300 million, which results in a joint 8.3b model. To train it they have used a million hours of audio files in English.
The comments In a debate In Hacker News they made it clear that the quality of Sesame’s voices is almost human, but even users continued to notice that something failed. One of Sesame’s co -founders, Brendan Iribe, I participated In the debate thanking those comments and confirming that they still have a lot of work ahead. Is “still too anxious Often inappropriate in his tone, prosody and rhythm “, He explained, and has problems with the interruptions, times and fluidity of the conversation. “Today we are firmly In the valley (disturbing)“, he said,” but we are optimistic and we can get out of it. “
The possibilities seem almost unlimited for these types of models, but they are both for good and for worse. Its use to supplant identities, for example, has already given some serious scares. Here is the Creation of a “family password” It can be very useful to avoid part of those problems, although at the moment you are not allowed to clone voices.
We will see how AI companies react to these types of problems, but everything indicates that this future in which We will talk constantly (and we will even fall in love) with the machines It is getting closer.
GIPHY App Key not set. Please check settings