221 — Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings
Read on 29 March 2018It would be amazing to be able to design 3D spaces using only your voice: Imagine creating a virtual reality environment simply by describing the space you wanted to stand in.
Text2Shape gets us closer to this world by providing a GAN architecture that can convert between natural language freeform text descriptors of objects and 3D representations of those objects, either in surface-mesh form or in solid voxelizations.
It works by learning a joint embedding between a text descriptor and the shape that the text describes (with data taken from ShapeNet). The shapes are converted to voxelizations (dense volumes), and five humans are asked for their descriptions of the object.
These representations are embedded in a space from which it is possible to learn associations between objects: Unlike prior methods which required manual segmentation of the training set into groups, this technique automates shape classification generation.
A Conditional Wasserstein GAN (CWGAN) generates new shapes based upon this model that, though still rough, much more closely approximate the ground truth than other approaches.
I highly recommend the video available at the project website, which shows a very cool application that the embedding provides for free: The ability to add and subtract features from a piece of furniture. (Imagine “This exact table, but with glass on the top instead of wood”.)