Visual data provides rich messages about what is in the world, provided that they can be interpreted. We do this with apparent ease, with our own vision systems telling us what is out there, where it is, and what it is like, as we go about our daily lives. It is a fascinating to ask if this can be done by computational means. In other words, can we make computers see? Pragmatically, having computers do so opens up a wide range of important applications. Small steps towards this goal have already provided many useful technologies, and continued progress will further benefit many domains such as: 1) mining and searching image and video for content; 2) robots operating in natural human environments; 3) enhancing security by anonymous surveillance; 4) integration of images and video with graphics to educate and entertain us; and 5) automated interpretation of scientific images in high throughput experiments, or satellite environmental scans, where having humans look at every image in impractical.
To make sense of the world from visual data we need to connect it to relevant meaning systems. The IVILAB takes a direct approach by working both on the representations and inferring them from data. In particular we work with stochastic geometric and semantic models, learning their parameters from visual data. Making models paramount leads to challenging inference which we implement using Markov chain Monte Carlo (sampling) methods. We have applied this approach to learning models of object structure, understanding indoor scenes, fitting biological structure (plants, filamentous fungus, neurons grown in vitro) from images, tracking pollen tubes grown in vitro, and 3D tracking of humans in video. Collaborators in these endeavors include Alon Efrat (Computer Science), Barry Pryor (Plant Sciences), Linda Restifo (Neurobiology), and Ravi Palanivelu (Plant Sciences).
Research Project (1) A Bayesian approach to understanding indoor scenes.
An indoor scene has familiar components (e.g., tables, chairs), which are themselves made of semantically and geometrically salient parts (e.g., legs, top). Hence we can describe them as a top-down generative statistical process that draws samples from possible furniture objects and places them in the scene in likely configurations. Given an image of an indoor scene, our goal is to reverse this forward process to infer what is in the room, using the visual data as evidence that confirms or refutes hypotheses. Because we are model driven, hypotheses will always be plausible, and the representation is explicit and useful. Our approach provides state-of-art accuracy in indoor scene understanding.
Research Project (2) Learning models of object structure.
Consider half a dozen different four legged tables of different sizes and shapes. The key element in common is the topology (four symmetric tall skinny blocks supporting a top). It is relatively easy to build a geometric model for such an object, but suppose we wanted to build a system that can learn this abstraction from the six images. This is confounded by the fact that each image has a different camera angle. In this work we have developed an approach to learn the topology, based on the assumptions that the objects are made of blocks and the only thing in common in the examples is their topology. Here we simultaneously fit the camera parameters and the objects in each image, while searching for the best common topology.
Research Project (3) 3D tracking of humans in video.
In a video of a human walking away from the camera, the image of the human gets smaller due to perspective—we know the person is not actually shrinking. Hence we can use this to infer the camera. Knowing the camera, and working in 3D means that prior knowledge about peoples sizes and walking speeds can be used to an advantage. The system is also not surprised when people to occlude each other. Further, as we track people walking about, we can learn about the geometry of the scene as they disappear briefly behind objects (cars, statues), disappear for prolonged time (doorways), and their paths are consistent with vertical structure. These are some of many reasons why we consider tracking as yet another 3D model extraction problem, with visual data being evidence about is going on in both time and space.