组成卷积神经网络：在阻塞下用于对象识别的强大且可解释的模型

论文标题

组成卷积神经网络：在阻塞下用于对象识别的强大且可解释的模型

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition under Occlusion

论文作者

Kortylewski, Adam, Liu, Qing, Wang, Angtian, Sun, Yihong, Yuille, Alan

论文摘要

现实世界应用中的计算机视觉系统必须对部分闭塞是可靠的，同时也可以解释。在这项工作中，我们表明黑盒深卷积神经网络（DCNN）对部分闭塞的鲁棒性有限。我们通过将基于部分模型的DCNN统一为组成卷积神经网络（CompositionalNnets）来克服这些局限性 - 一种可解释的深层体系结构，具有天生的稳健性对部分闭塞。具体而言，我们建议将DCNN的完全连接的分类头替换为可以端对端训练的可区分组成模型。组成模型的结构使组成词可以将图像分解为对象和上下文，并进一步以各个部分和对象的姿势分解对象表示。我们组成模型的生成性质使其能够定位封闭器并根据其非封闭部分识别对象。我们对来自Pascal3D+和Imagenet数据集的人为遮挡对象的图像的图像分类和对象检测进行了广泛的实验，以及来自MS-Coco数据集的部分咬合车辆的真实图像。我们的实验表明，由几个流行的DCNN骨干（VGG-16，Resnet50，resnext）制成的组成词通过其在分类和检测部分闭塞的物体时的非组合对应物进行了较大的边距。此外，尽管只接受了班级的监督，但他们仍可以准确地定位闭塞器。最后，我们证明组成词提供了人类的可解释预测，因为它们的各个组件可以理解为检测部分并估算对象的观点。

Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets) - an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects' pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects' viewpoint.

下载PDF全文

下载文献需遵守相关版权规定

论文标题