66 — Lip2AudSpec: Speech reconstruction from silent lip movements video

Akbari et al (1710.09798)

Read on 26 October 2017
#CNN  #LSTM  #neural-network  #audio  #signal-processing  #speech 

This paper presents a way to lip-read using CNNs to reconstruct an audiogram. This audiogram is then passed into a speech-recognization system to determine what the speaker is saying.

The system is able to generate an audiogram with 98% correlation with the actual audiogram of the video file, which suggests that there is a fundamental correlation between the shape of someone’s face/mouth/biology and the sound of their voice — or the system is picking up on audio data encoded otherwise in the video data.

The system is trained by removing the audio track from a video and training a LSTM to generate the phonemes “pronounced” by a silent face on the screen.

The generated spectrograms were then presented to twenty humans using Amazon’s Mechanical Turk to inspect and transcribe; the graders were asked to decide if the voice was “Natural”, of “Correct Gender”, and other metrics.

The authors demonstrated that they could reliably generate realistic-sounding audio from the muted video clips, and from that spectrogram, interpret audio and transcribe speech with high accuracy.

This sort of system could be deployed for automatic transcription of muted video, or other cases in which transcription of video without audio is required. Sure would love to run this on some background characters in old silent films!

The code is available online at https://github.com/hassanhub/LipReading.