43 — Adversarial Generation of Natural Language

Rajeswar et al (1705.10929v1)

Read on 03 October 2017
#language  #grammar  #cfg  #gan  #neural-network 

This paper reminds me of a team at HopHacks a few ago: They trained a simple neural net on the text of a Star Wars movie script, and then tried to generate new scenes. The model accurately reflected the structure of the script: It contained all-caps “names” and mixed-cap stage direction; and it even appeared to capture formatting well (text was centered, or left/right aligned) but the text itself was total noise, and didn’t even follow english-like patterns. There were words full of consonants, for example; or words with triple-letters in them.

Generating language is hard.

This paper by Rajeswar et al, from May of this year, attempts to use GANs to generate natural language. This is a notoriously difficult problem because of the complexity of language and the nuance introduced by grammar; even image-GANs are easier, because the human visual system is so adept at forming scenes out of mixed signal. On the other hand, a jumbled sentence like “The dog past the tree walks the boy” is quite obviously broken.

To solve this problem, the authors looked to some more consistent corpi, such as Chinese Poetry, or english phrases from CMU-SE that follow either a simple context-free grammar (CFG), or a probablistic CFG. The BLEU measure — an evaluation usually used for machine-translation fluency — was used to determine ‘human-ness’ of a generated sentence.

The paper found that CFG-trained Wasserstein-GANs generalize well to sentences of longer length (n=11) but even conventional GANs generated feasible 5-word sentences. The discussion includes mention of shortcomings of the models — but by far my favorite part was trying to dissect what about a sentence fooled the network into thinking it was reasonable.

Here are some example sentences:

how there , is the restaurant popular this cheese ?

what ’s in the friday food ? ?

extreme crap-not working and eeeeeew a horrible poor imposing se400

great britani ! I lovethis.

And these:

how many snatched crew you have ?

i think i deeply take your passenger

And these, trained by-character on the 1-billion word dataset, which reads like a passage from the Jabberwocky:

To holl is now my Hubby ,
The gry timers was faller
After they work is jith a
But in a linter a revent

.