237 — CubeNet: Equivariance to 3D Rotation and Translation
Worrall & Brostow (1804.04458)
Read on 14 April 2018Most neural networks that process spatial information (like images or scenes) suffer from issues identifying 3D rotations of the object: If shown from vantage-point A, a network is unlikely to be able to recognize that same object from vantage-point B. And even if it recognizes it from both A and B, it is equally unlikely that it will recognize the object from point C.
This shortcoming is most often mitigated by providing the network with “truth” data from an enormous number of vantage-points or 3D rotations. This is computationally and logistically expensive, and isn’t a sustainable solution.
CubeNet begins to solve this problem by representing an image in 3D voxelspace identically no matter its 3D right-angle rotation (‘pose’). This means that CubeNets are fundamentally invariant no matter from which face of a cube they are viewed. (An object can be rotated ${\theta_x, \theta_y, \theta_z}$ and still be represented identically by the last layer of the network, with additional information supplying the orientation of the object.)
On the ModelNet10 dataset, this CNN was able to achieve state-of-the-art performance when identifying 3D CAD models. And slightly more interestingly (at least to me), this net was comparable to other segmentation architectures when used on the ISBI connectome segmentation challenge (2012).