As an initial experiment, I manually selected a rectangle (containing the person in one view) and filled it in with corresponding pixels from the other view. The results are shown above. The window and wall-ground borders do not line up perfectly, so it seems that further refinement in the homography estimation is needed. Another problem in this particular set of images is that there are multiple "foreground" objects (the bike, the parking meter, the bike rack). Also, for some of the pixels in the manually selected rectangle, the corresponding pixels in the other view included the foreground object (the person) I wanted to remove, so I need to find a way to detect and handle this possibility.
Since it is likely I will need to incorporate one more view of the scene, I have been reading about the trifocal tensor in Hartley and Zisserman and MaSKS. I've also been reading some material on how to further refine the estimated homography here.