Schedule as of Oct 11, 2022 - subject to change

Default Time Zone is EDT - Eastern Daylight Time

Back To Schedule
Thursday, October 27 • 11:30am - 11:45am
Audio-Visual-Information-Based Speaker Matching Framework for Selective Hearing in a Recorded Video

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

When we capture a video using a smartphone, unwanted sound are often recorded together or desired sound are not heard clearly. Selective hearing technology enables users to select a visual object in a video and focus on the desired sound of the object. To make it possible, the sound source needs to be separated from each object. Also, it is necessary to find the mapping relationship between the separated sound source and the visual object. With the recent advancement of deep learning technology, the accuracy of sound source separation has tremendously improved even in the lightweight deep learning network model. However, with the conventional sound source separation methods, wherein they only use audio information, it is still difficult to know which visual object the separated sound source corresponds to in the video.

In this paper, to solve this problem from the perspective of applying the selective hearing technology to voice, we propose a deep neural network based speaker matching framework using both audio and visual information.
As a pre-processing step for the proposed framework, we first prepare all the voice sources heard in a recorded video by utilizing a conventional voice separation algorithm. We also segment mouth regions including lip for the main speakers seen in the video. Then, based on the assumption that changes in the lip movement and muscles around the mouth are highly correlated to changes in the voice signal, the proposed framework predicts the mapping relationship, by utilizing the prepared sets of voice sources and mouth regions.

The framework consists of three main parts, those are i) extraction of audio feature from audio information including voice, ii) extraction of visual feature from visual information such as movement of the lip and surrounding muscles for each speaker, and iii) analysis of similarities between those features in order to map the separated voice to the corresponding speaker. The framework was implemented based on end-to-end convolution neural network and the cross entropy loss is utilized for the training.
Meanwhile, we observed that comparing the similarities in the section where the speaker shows different mouth movements, the section where it stops for breathing, or the section where it speaks at different speeds, can further improve the matching accuracy and the robustness of the framework. To determine the time section well for the matching analysis, we also introduce the optimal time section selection scheme via the proposed intra- and inter-voice pattern analysis for separated voice signals.

We performed the accuracy tests on the three categories of datasets, those are, interview (100 speakers), script-reading (30 speakers), and singing-a-song (30 speakers) datasets. According to our analysis in a general user scenario for selective hearing, the number of meaningful multi-speakers present in the recorded video are likely to be about 2 to 5. The conventional sound source separation methods also typically target that number of speaker voices. Therefore, the matching accuracy of the framework was evaluated under the condition of randomly selecting 5 people from each dataset and simulating the real situations of speaking or singing at the same time.
The speaker matching accuracies on the three datasets are 99.2%, 98.2%, and 95.3%, respectively. The results of these experiments show that the proposed framework can provide high accuracy enough to be utilized for actual selective hearing applications. It is noted that it can provide high-accuracy speaker matching even on the singing-a-song dataset that sing lyrics of the same song with similar beats.


Jungkyu Kim

Samsung Research, Samsung Electronics

Kyung-Rae Kim

Samsung Research, Samsung Electronics
avatar for Woo Hyun Nam

Woo Hyun Nam

Principal Engineer, Samsung Research, Samsung Electronics
Woo Hyun Nam received the Ph.D. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea in 2013. Since 2013, he has been with the Samsung Research, Samsung Electronics, where he is a Principal Engineer and is currently leading... Read More →

Thursday October 27, 2022 11:30am - 11:45am EDT
Online Papers
  Applications in Audio
  • badge type: ALL ACCESS or ONLINE