I’ve been interested for several years in computer vision - that is, having a computer look at a scene and gain some comprehension of what it is seeing. In particular, I have wanted software that can examine a scene in motion and compute the 3D structure of the scene as it changes over time.

What cues to structure are available? A digital video stream usually contains at least a sequence of images (the frames of video) and two separate sequences of audio samples (for left and right speakers). When we observe our surroundings, we have a pair of sequences of images (left and right eyes) and stereo audio, and we combine cues from all of these sources with our memories of the appearance and sound of objects to produce a best guess at the structure of our environment.


What applications does this have? Consider digital video. The current best compression algorithm for video is MPEG, which obtains good compression from the assumption that video generally doesn’t change much from one frame to the next. However, a stronger assumption can be made most of the time: the objects depicted generally don’t change much over the duration of the video, or even across multiple videos in the case of, say, a TV series. (Anyway, if an object does change significantly, for the purposes of compression there is no harm in representing it as multiple distinct objects.) I would expect much better compression to result from this stronger assumption.

Now imagine that you have a tool that converts 2D video into an animated 3D scene description. Is there any benefit besides better compression? Yes, there are all sorts of interesting applications.

Given multiple video streams depicting the same scene with different errors, the streams may be combined and some of the errors removed, producing a higher-quality stream. Think of old movies, with multiple film reels containing the same movie, each reel degrading differently.

Even with a single video source, errors in the stream will not map to any object in the stream, and therefore could be removed. Analysis of these errors could lead to a model of error for different media - film from different manufacturing eras, for instance - enabling both the automatic removal of such errors from future video streams and the automatic generation of such errors for that old-school feel.

One could easily replace or remove characters or scenery, or re-render the scene in a different style - transforming live action into the appearance of hand-drawn animation, for instance. The artistic possibilities are remarkable.

When true 3D displays enter the market, automatic conversion to 3D would make a lot of existing media content immediately available for the new hardware.

Computer vision research has usually focused on real-time processing for the purpose of controlling a robot or computer using video of its environment. What I have described above is different in that the software can take a fairly arbitrary amount of time and space for its processing, and that it has access to the entire time-sequence of frames from the beginning of its processing. I am also interested in real-time robotic control, however, due to my work with the Portland State Aerospace Society (PSAS), which basically builds flying robots. Rocket guidance using an on-board camera and computer vision algorithms has not previously been discussed by our group, to the best of my knowledge.

Related work

I have done a little research into this topic, and found some interesting papers and other resources.

There are a ton of links to computer vision software at Carnegie Mellon University’s Computer Vision Homepage.

Here’s A Survey of Spatio-Temporal Grouping Techniques from 2002.

Researchers at Caltech produced the work I stumbled across first. See for summaries. I was particularly interested in the “Dynamic Vision” category, and their paper:

S. Soatto, R. Frezza, and P. Perona. Motion estimation on the essential manifold. In “Computer Vision ECCV 94, Lecture Notes in Computer Sciences vol. 801”, Springer Verlag, May 1994.

This paper is about computing the camera’s motion through an unchanging scene; given the position and orientation of the camera in each frame, algorithms to find the shapes of the objects become much easier, and apparently such algorithms are described in other papers.

Interestingly, Kalman filters show up here. This is interesting to me because the PSAS has known for some time that a Kalman filter would be a good way of combining the sensor data we have from accelerometers, gyroscopes, and GPS into position and orientation data - which is to say, exactly the kind of output the above paper was looking for. This suggests that a fourth input to our rocket’s navigation system, from a camera mounted on-board the rocket, would be sensible and a fascinating research project.

Computer vision in video compression is not new either: Adelson presented Layered Representations for Vision and Video at the IEEE Computer Society Workshop: Representation of Visual Scenes in 1995.

There’s been some research into Parallel Scalable Libraries and Algorithms for Computer Vision.