98 — Speech Recognition for Medical Conversations

Chiu et al (1711.07274)

Read on 26 November 2017
#CTC  #LAS  #transcription  #medicine  #audio  #group:google 

Medical transcription is hard. Google is good at hard things because they have a lot of GPUs, but… medical transcription is really really hard.

In this paper, Chiu et al use two models:

The first model is the Listen Attend and Spell (LAS) model of transcription, in which the transcription of a word can include the spelling of that word:

“Patient presented with syncope, s-y-n-c-o-p-e…”

The second model is dubbed “Connectionist Temporal Classification” (CTC), which uses phonemes to construct words from either a static or dyanmic dictionary.

CTC generally failed at the beginnings or ends of utterances, where there was less context available for the transcription engine. LAS failed by paradoxically replacing mundane, everyday words with medical similar-sounding words.

In general, the LAS model was more robust to noise, and had a higher success-rate (errors at 20.1% for CTC, 18.3% for LAS), but both systems show promise for future use in medical transcription.

One important distinction when transcribing medical words is that the primary difficulty is in transcribing difficult words, not difficult utterances. (Speech recognition is already very good at reconstructing sentences, but out-of-vocabulary words are generally the greater challenge.)