This is paper #2 in my 2020 paper reading goal.

Title:
Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations [arxiv] - by Vincent Sitzmann, Michael Zollhöfer and Gordon Wetzstein.

Why this is interesting:
It provides a new way for "novel view synthesis", that is being able to create an image of how an object/scene would look from a new (previously unseen) vantage point.

Some highlights:

  • It exploits the underlying 3D geometry.
  • It could potentially scale to arbitrary resolutions, since it maps the 3D space as a continuous function rather that discretizing it into grids. So this is likely to be closer to a "true" representation of the object.

How it works:

There are different parts to this problem:

  1. Being able to create a scene representation, that maps an (x,y,z) coordinate to a vector of the properties of that point.
  2. Given the new camera vantage point, being able to map a pixel co-ordinate to a color.

The system contains the following components:

  1. Scene representation mapping function
  2. Neural rendering function
  3. Mapping from a latent representation of a class to the full scene geometry.
  4. The latent representations of a class

The paper uses a ray marching technique to go from pixel co-ordinate -> pixel color. For a given pixel co-ordinate, it iterates through a constant number of steps and updates the depth of the camera ray. It uses an LSTM to "learn" the step size. At the end of all the iterations, we obtain the (x,y,z) co-ordinates of the nearest visible surface along that camera ray.  

All these components are jointly optimized with stochastic gradient descent over a dataset of several class datasets. The result is also multi-view consistent, because a given (x,y,z) will always map to the same value.

Once training is complete, we can try 1-shot, 2-shot or multi-shot reconstruction. Given a view of an unseen image, this tries to reconstruct what this particular object would look like from a brand new vantage point.

Another neat application that was mentioned in the paper was for "non-rigid deformation". By conditioning on the identity parameters of the Basel face dataset. SRNs succeeded in reconstructing face geometry and appearance, and more interestingly - by tweaking the (separate) expression parameters, they smoothly transitioned from one facial expression to another for entirely new faces. This demonstrated that the model actually learned the underlying 3D geometry.