LightSpeed: Light and Fast Neural Light Fields on Mobile Devices
NeurIPS 2023
-
Aarush Gupta
Carnegie Mellon University -
Junli Cao*
Snap Inc. -
Chaoyang Wang*
Snap Inc. -
Ju Hu
Snap Inc. -
Sergey Tulyakov
Snap Inc. -
Jian Ren
Snap Inc. -
László A Jeni
Carnegie Mellon University
Abstract
Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synthesis results on mobile devices. Neural light field methods learn a direct mapping from a ray representation to the pixel color. The current choice of ray representation is either stratified ray sampling or Plücker coordinates, overlooking the classic light slab (two-plane) representation, the preferred representation to interpolate between light field views. In this work, we find that using the light slab representation is an efficient representation for learning a neural light field. More importantly, it is a lower-dimensional ray representation enabling us to learn the 4D ray space using feature grids which are significantly faster to train and render. Although mostly designed for frontal views, we show that the light-slab representation can be further extended to non-frontal scenes using a divide-and-conquer strategy. Our method offers superior rendering quality compared to previous light field methods and achieves a significantly improved trade-off between rendering quality and speed.
Method
LightSpeed first renders a low-resolution ray feature map from the feature grids. This is accomplished by generating ray bundles at a reduced resolution, where each ray corresponds to a pixel in a downsampled image. We project each ray’s 4D onto 6 2D feature grids (as shown in figure) to obtain feature vectors from corresponding sub-spaces. The feature values undergo bilinear interpolation from the 2D grids, resulting in six interpolated feature vectors. These features are subsequently concatenated; since the feature grids are multi-resolutional with L levels, features from different levels are concatenated together to create a single feature per ray. Combining the features from all rays generates a low-resolution 2D feature map. To mitigate the approximation introduced by decomposing 4D grids into 2D grids, the features undergo additional processing through a MLP. This is implemented by applying a series of 1 × 1 convolutional layers to the low-resolution feature map. Subsequently, the processed feature map is passed through a sequence of upsampling layers to generate a high-resolution image.