Presented at TTI/Vanguard's Networks, Sensors, & Mobility May 3–4, 2016 San Francisco, CA Alex Kendall, Department of Engineering, University of Cambridge
We can now teach machines to recognize objects. However, in order to teach a machine to “see” we need to understand geometry as well as semantics. Given an image of a road scene, for example, an autonomous vehicle needs to determine where it is, what's around it, and what's going to happen next. This requires not only object recognition, but depth, motion and spatial perception, and instance-level identification. A deep learning architecture can achieve all these tasks at once, even when given a single monocular input image. Surprisingly, jointly learning these different tasks results in superior performance, because it causes the deep network to uncover a better deep representation by explicitly supervising more information about the scene. This method outperforms other approaches on a number of benchmark datasets, such as SUN RGB-D indoor scene understanding and CityScapes road scene understanding. Besides cars, potential applications include factory robotics and systems to help the blind.