AI headphones let users focus on a single voice in noisy environments

Ethics & Society

May 28, 2024

  • Researchers developed headphones that single out individual voices from crowds
  • The University of Washington team calls it Target Speech Hearing (TSH)
  • It’s particularly promising for those with auditory problems

Researchers at the University of Washington have developed an AI system that allows noise-canceling headphones to isolate and amplify a single voice in a crowded, noisy environment. 

The technology, called Target Speech Hearing (TSH), enables users to select a specific person to listen to by simply looking at them for a few seconds.

The TSH system addresses a common challenge faced by noise-canceling headphones: while they effectively reduce ambient noise, they do so indiscriminately, making it difficult for users to hear specific sounds they might want to focus on. 

As Shyam Gollakota, a professor at the University of Washington and the project’s leader researcher, explains, “Listening to specific people is such a fundamental aspect of how we communicate and how we interact with other humans. But it can get really challenging, even if you don’t have any hearing loss issues, to focus on specific people when it comes to noisy situations.”

How it works

The study smartly combines noise-canceling headphones and AI to home in on individual voices in loud and crowded settings. 

  1. During the “enrollment” phase, the user looks at the target speaker for a few seconds, allowing the binaural microphones on the headphones to capture an audio sample containing the speaker’s vocal characteristics, even in the presence of other speakers and noises.
  2. The captured binaural signal is processed by a neural network that learns the characteristics of the target speaker, separating their voice from interfering speakers using directional information.
  3. The learned characteristics of the target speaker, represented as an embedding vector, are then input into a different neural network designed to extract the target speech from a cacophony of speakers.
  4. Once the target speaker’s characteristics have been learned during the enrollment phase, the user can look in any direction, move their head, or walk around while still hearing the target speaker.
  5. The TSH system continuously processes the incoming audio, using the learned speaker embedding to isolate and amplify the target speaker’s voice while suppressing other voices and background noise.

The current prototype can only effectively enroll a targeted speaker whose voice is the loudest in a particular direction, but the team is working on improving the system to handle more complex scenarios with diverse, varied audio sources.

Samuele Cornell, a Carnegie Mellon University’s Language Technologies Institute researcher, praises the research for its clear real-world applications, stating, “I think it’s a step in the right direction. It’s a breath of fresh air.”

While the TSH system is currently a proof of concept, the researchers are in talks to embed the technology in popular brands of noise-canceling earbuds and make it available for hearing aids. 

Together with improved audio and speech analysis, which leaped forward with GPT-4o, those with both visual and auditory impairments will be able to better connect to the sensory world around them.

Leave a Reply

Your email address will not be published. Required fields are marked *