LightSpeed: Light and Fast Neural Light Fields on Mobile Devices

NeurIPS 2023

Abstract

Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synthesis results on mobile devices. Neural light field methods learn a direct mapping from a ray representation to the pixel color. The current choice of ray representation is either stratified ray sampling or Plücker coordinates, overlooking the classic light slab (two-plane) representation, the preferred representation to interpolate between light field views. In this work, we find that using the light slab representation is an efficient representation for learning a neural light field. More importantly, it is a lower-dimensional ray representation enabling us to learn the 4D ray space using feature grids which are significantly faster to train and render. Although mostly designed for frontal views, we show that the light-slab representation can be further extended to non-frontal scenes using a divide-and-conquer strategy. Our method offers superior rendering quality compared to previous light field methods and achieves a significantly improved trade-off between rendering quality and speed.

Method

LightSpeed first renders a low-resolution ray feature map from the feature grids. This is accomplished by generating ray bundles at a reduced resolution, where each ray corresponds to a pixel in a downsampled image. We project each ray’s 4D onto 6 2D feature grids (as shown in figure) to obtain feature vectors from corresponding sub-spaces. The feature values undergo bilinear interpolation from the 2D grids, resulting in six interpolated feature vectors. These features are subsequently concatenated; since the feature grids are multi-resolutional with L levels, features from different levels are concatenated together to create a single feature per ray. Combining the features from all rays generates a low-resolution 2D feature map. To mitigate the approximation introduced by decomposing 4D grids into 2D grids, the features undergo additional processing through a MLP. This is implemented by applying a series of 1 × 1 convolutional layers to the low-resolution feature map. Subsequently, the processed feature map is passed through a sequence of upsampling layers to generate a high-resolution image.

Results on Synthetic 360o Dataset.

Chair

Drums




Ficus

Hotdog




Lego

Materials




Mic

Ship




Results on Forward-Facing Dataset.

Fern

Flower




Fortress

Horns




Leaves

Orchids




Room

Trex




Results on Unbounded 360o Dataset.

Kitchen

Counter




Bonsai

Bicycle




Garden

Stump




Citation