295 — Why do deep convolutional networks generalize so poorly to small image transformations?
Read on 11 June 2018Despite their incredible power to perform seemingly impossible feats of image-recognition or image-processing, large-scale deep convolutional networks — the fundamental component of state of the art computer vision machine learning systems — fail miserably when their inputs are shifted only very slightly.
Why?
Azulay and Weiss explain that CNNs are extremely susceptible to simple image manipulation: Take an image, set it in a larger “frame” of inpainted pixels, and try to classify it (a very common CV task). Now shift that image very slightly in its frame, or rotate it very slightly, or perturb it in any other minor way — and the net falls to pieces. The brittleness of the classification result is not only vulnerable to these modifications; the classification quality actually decreases with the depth of the net (completely counter to our intuitions).
The researchers’ explanation for this is that CNNs are so sensitive because of biases in their training datasets: All of the training data conventionally hold a single ‘target’ in frame, and the target is generally centered. Changing this is trivial, but not so intuitive: Photographers generally center targets in frame and intend to maintain an ‘upright’ angle in the frame, keeping lines or faces oriented to a cardinal axis. Though it’s possible to modify this when training, the size of the training dataset must grow exponentially, making this a far more difficult task than simply adding new photos or perturbations to the training set.