Foreground object removal: February 2010

Sunday, February 28, 2010

Preliminary pedestrian removal results

Currently, this is my pipeline to remove pedestrians:

Compute homographies between two views (I1 and I2) using SIFT and RANSAC.
Detect pedestrians on both views (with bounding boxes).
Warp pedestrian bounding boxes using homography from step 1, determine overlap (if any).
Use method proposed by James Davis to obtain a dividing boundary in overlap region.
Replace pixels where a pedestrian is detected with pixels from the other (warped) view using the boundary from step 4.

Thursday, February 25, 2010

How google does surface normals

Apparently, google does use 3D laser point clouds to estimate surface normals (or building facades as they call them), see this quote from http://google-latlong.blogspot.com/2009/06/introducing-smart-navigation-in-street.html :

We have been able to accomplish this by making a compact representation of the building facade and road geometry for all the Street View panoramas using laser point clouds and differences between consecutive pictures.

I've checked the google api to see if there is some way to extract the surface normal at each pixel, but so far have not been able to find that information.

Monday, February 22, 2010

Pedestrian detection

Due to the lack of progress with the proposed method, I decided to simplify and restrict the problem to removal of pedestrians only. From a privacy standpoint, this would be a step beyond the face blurring that google already does. I have been testing with pedestrian detection code by Bastian Leibe. Some results on google streetview data can be seen here. These results are inconsistent, but I think I can achieve better results by tuning the detector parameters to google streetview data.

I also tried pedestrian detection software by Liming Wang, but this proved to be too slow for my purposes (took about an hour on one image).

Thursday, February 18, 2010

A relevant paper

Found a paper that may be useful: "Piecewise Planar City 3D Modeling from Street View Panoramic Sequences ". The focus of this paper is 3D modeling, but they mention a multi-view stereo technique for dense depth estimation. This may make it possible to remove foreground objects based on histogram analysis of pixel depths (similar to the way foreground objects were removed in Dr. Zakhor's work, which was the inspiration for this project).

Wednesday, February 17, 2010

Deghosting (local registration)

Wednesday, February 10, 2010

Hole filling (and reading)

As an initial experiment, I manually selected a rectangle (containing the person in one view) and filled it in with corresponding pixels from the other view. The results are shown above. The window and wall-ground borders do not line up perfectly, so it seems that further refinement in the homography estimation is needed. Another problem in this particular set of images is that there are multiple "foreground" objects (the bike, the parking meter, the bike rack). Also, for some of the pixels in the manually selected rectangle, the corresponding pixels in the other view included the foreground object (the person) I wanted to remove, so I need to find a way to detect and handle this possibility.

Since it is likely I will need to incorporate one more view of the scene, I have been reading about the trifocal tensor in Hartley and Zisserman and MaSKS. I've also been reading some material on how to further refine the estimated homography here.

Monday, February 1, 2010

A new direction

Up to this point, I had been trying to make the code from the proposed method ("What went where?", which will be referred to as WWW from now on) work with the google streetview images. After some discussion with Dr. Belongie, I think it is apparent that I need to try a slightly different approach. The main reason being that SIFT seems unable to get good matches on the foreground object. This makes it impossible for the WWW code to detect multiple motion layers. The correspondences that have been detected (see previous post) can only compute one motion layer, which corresponds to the motion of the streetview car.

The new approach I will try consists of:

Compute correspondences using SIFT.
Compute a homography using RANSAC
Detect pixels which do not agree with the homography.
Apply graph cuts to obtain piece-wise contiguous and smooth regions

For the image sequence I have been working here are the inliers used to compute the homogrpahy, and the warped images after finding the homography:

For step 3 above (detecting pixels that don't agree with the homography), my first guess was to simply compute the difference between the reference image and the warped image, but it appears the computed homography is not very precise, which results in a lot of noise:

As in the WWW paper, I tried a second round of RANSAC with a tighter threshold. The inliers and difference (between reference image and warped image) are shown below.

Foreground object removal