238 — Voices Obscured in Complex Environmental Settings (VOICES) corpus

Richey & Barrios, et al (1804.05053)

Read on 15 April 2018
#dataset  #audio  #acoustics  #echo  #distraction  #corpus  #speech-recognition  #speaker-recognition  #signal-processing  #speech-to-text 

Okay so I know I just said I don’t usually write about datasets, but here’s another exception, because I think that this use-case is really interesting: The VOICES dataset is a constructed database of voice utterances where each utterance is provided both in its simple, “pure” form, as well as in a distorted, echoed, or complex (i.e. other voices or sounds are cooccurrant) form. For each utterance, 12 microphones were used, so the sound at each location in a large room can be cross-referenced.

TV, radio, other human conversation, echo, or other features were added to some of the audio recordings, which makes it so that source separation is not only a nontrivial task, but an inconsistent one: If you sample randomly from the sound clips, you’re likely to receive a variety of different acoustic profiles.

The dataset is over 120 hours of audio (per-microphone), with over 374,688 audio clips total. In addition, the 3D layout of each scene is provided as a metadata file for each file.

The researchers intend to make the dataset publicly downloadable, though as far as I can tell, it is not currently so.