Photo-Realistic Continuous Image Super-Resolution with Implicit Neural Networks and Generative Adversarial Networks

The implicit neural networks (INNs) can represent images in the continuous domain. They consume raw (X, Y) coordinates and output a color value. Therefore they can represent and generate images at arbitrarily high resolutions in contrast to convolutional neural networks (CNNs) that output a constant-sized array of pixels. In this work, we show how to super-resolve a single image using an INN to produce sharp and photo-realistic images. We employ a random patch-based coordinate sampling method to obtain patches with context and structure; we use these patches to train the INN in an adversarial setting. We demonstrate that the trained network retains the desirable properties of INNs while the output is sharper compared to previous work. We also show qualitative and quantitative comparisons with INN and CNN baselines on benchmark datasets of DIV2K, Set5, Set14, Urban100, and B100. Our code will be made public at https://github.com/iSarmad/CiSRGan .


Introduction
Image enhancement and super-resolution find applications in various consumer products such as smartphone photography, TV and video, etc. The advent of deep learning and neural networks has enabled advancements in single-image superresolution (SISR). Convolutional neural networks (CNNs) are the most popular method for SISR [11]. However, the output of CNNs is an array of pixels with a fixed size. Therefore, we need to train a new network for different scaling factors. This strategy * Corresponding Author: muhammad.sarmad@ntnu.no can be very inconvenient and time-consuming.
Recently a class of neural networks called implicit neural networks (INNs) has gained attention [33,25,28]. These networks can represent an image by storing the color value of each pixel corresponding to a given pixel coordinate [26,31]. This image representation leads to a continuous model where one can zoom in to a single image arbitrarily by changing the discretization level of the input coordinates.
Chen et al. [8] proposed an INN based method called local implicit image function (LIIF) for SISR. They used a single INN to perform SISR for any scale and achieved arbitrary zooming capability i.e. given a neural network that was trained for scales in the range of 1x to 4x (we refer to this range as in-scale), their model can perform superresolution on 6x and 8x etc (out-of-scale). This ability to extrapolate makes LIIF very beneficial for super-resolution. Furthermore, LIIF is on par with CNNs in terms of distortion metrics such as the PSNR [22]. Despite these advantages, LIIF suffers from blurry outputs for out-of-scale superresolution due to the use of pixel-wise loss function. In this work, we propose continuous image superresolution generative adversarial network (CiSR-GAN) that trains INNs in an adversarial setting for super-resolution, thus improving the perceptual quality and photo-realism of output for out-of-scale SISR. To the best of our knowledge, training implicit network for the task of out-of-scale single image super-resolution in an adversarial setting has not been proposed before.
Implicit Neural Networks for SISR Implicit neural networks (INNs) have recently become popular as a way to represent continuous images and shapes [26,38,4,9,3,10]. Occupancy Networks [25] and Deep SDF [28] used INNs for 3D shape representation. Then Sitzman et al. [31], and Tancik et al. [34] showed that the INNs could also be used to represent images with high fidelity. Later works learned GANs using INNs [7,32,30,2]. Local implicit image function (LIIF) [8] recently showed that continuous representation could also be used to perform SISR. The resulting SISR model is agnostic to resolution, and a single model can be used to super-resolve images to any required resolution. LIIF [8] uses the L 1 loss to train the network, which renders the output blurry. However, we train our model in the adversarial setting to perform photorealistic SISR and achieve a better result.

Method
Consider a low-resolution 2D Image I ↓s that consists of arrays of pixels. The high resolution 2D image corresponding to I ↓s is given as , and s is the scaling factor. Each pixel in I has coordinates x and y. Let's assume that a continuous image can be represented by a function f θ . Then the discrete image I can be represented as:  The low-resolution image I ↓s is passed through CNN encoder to get feature vector z. A random patch is selected from the coordinate space of desired high resolution image to obtain high resolution coordinates x hr . z and x hr are passed through Local implicit function image (LIIF) generator to obtain the super-resolved output image I. This I is compared with I GT using adversarial loss ('Adv loss'), perceptual loss ('VGG loss') and with I HR using pixel loss L 1 .
More specifically, for f θ we employ the local implicit image function (LIIF) with default configurations. For details, we refer to the paper [8].
Training LIIF in an Adversarial Setting An overview of our approach is shown in Figure. 1. The input image is passed through a convolutional encoder to obtain a latent vector z. This latent vector z and the image I coordinates x hr are used to obtain the color values of the pixels at input coordinates x hr using LIIF block [8]. Note that the INN consists of a few multilayer perceptron (MLP) layers that are present inside the LIIF block. We need an output image patch to train the INN using adversarial and perceptual loss. The previous method [8] uses a random set of coordinates from the image. This sampling method works well when the objective is to minimize the pixel-wise loss, e.g., L 1 . However, looking at only pixels means the contextual information is lost. Therefore, we propose a random patch-based sampling procedure instead of a random point-based sampling method to retain contextual information. We first train LIIF [8] with random patches instead of random points with only a pixel-wise loss. We notice that this random patch-based sampling method performs similar to a random coordinate-based sampling method in terms of performance. We use the L 1 loss following previous work [8], which trains with only the L 1 objective leading to smooth images which blur the textural information for out-of-scale super-resolution.
The use of a patch-based sampling procedure permits the use of adversarial loss that is based on generative adversarial network (GAN) [12]. The GAN consists of a generator and a discriminator that compete against each other. The goal of the generator is to generate realistic images, whereas the goal of the discriminator is to get good at classifying generated images as fake. In this joint training, both get better, resulting in realistic image generation. However, instead of using a standard GAN formulation, we use a relativistic GAN formulation instead [16]. This formulation is different from the standard discriminator, which estimates the probability that an input image is real. Instead, the discriminator predicts the probability that a real image is relatively more realistic than a fake one. We define a discriminator network D θ D , which is optimized in an alternating manner along with generator network G θ G to solve the adversarial min-max problem. The relativistic GAN solves the following min-max problem: Note that, X = (I GT , I ↓s ) ∼ (p train (I GT ), p G (I ↓s )) and . Where E G θ (I ↓s ) [.] is mean over the generated data in the mini-batch. σ is the sigmoid activation function and C is the output of discriminator before the activation function. For details, we refer to [16].
We also use the perceptual loss that is the distance between the features of a pre-trained VGG network between the predicted image I and the ground-truth image I GT [15]. The complete training objective for the generator is as follows: Where L 1 , L G and L V GG are the content, adversarial and perceptual losses respectively. The λ 1 , λ 2 and λ 3 are weighting hyperparameters terms for each of the objectives respectively. We set them following guidelines from previous work [37].

Experiments
We employed Pytorch for the implementation of all our models [29]. We trained all the networks on an NVIDIA RTX Titan GPU. The code is built on the open-source implementations [8,35].
Dataset and Metrics Like [8], we use the DIV2K dataset with standard split for training and validation [1] for fair comparison. Testing is performed on multiple test datasets including Set5, Set14, Urban100 and B100 [5,40,14,24]. The results for the related works were generated for comparison using pre-trained models provided by Chen et al. [8], and SPSR [23]. We use peak signal-to-noise ration (PSNR) as a metric for comparison. PSNR (measured in dB) is a measure of quality between super-resolved image and ground truth. Even though it is a good measure of distortion, however, it is a poor indicator of perceptual quality [6]. Therefore we additionally report perceptual similarity metric (LPIPS) [41] for comparison with previous works. LPIPS measures the distance in VGG [15] feature space between the super-resolved and the ground-truth image. The lower the distance, the more perceptually similar the super-resolved image is to the ground truth.
Training Details Similar to LIIF [8], we use RDN [42] as the encoder, where a feature map z is generated with the same size as the input image. The INN f θ is a 5-layer MLP with ReLU activation and hidden dimensions of 256. Encoder and INN act as the generator in our model. The discriminator is based on the architecture used by ESR-GAN [37]. We use input patches of 64 x 64 during training. The generator's output is the same as the input patch size, i.e., 64 x 64; therefore, the discriminator is adjusted to cater to an image patch of this size. We use transfer learning and initialize the weights of our generator from a pre-trained RDN-LIIF [8]. We train all models for 75 epochs with batch size 16 on the DIV2K training set. We utilize the Adam [17] optimizer for both generator and discriminator with a learning rate of 1 −4 . The weights for λ 1 , λ 2 and λ 3 are set to 1 −2 , 5 −3 and 1 [37]. For a fair comparison with LIIF, we also train the models from the 1x-4x scale range. . This figure shows the reference image from DIV2k, the low-resolution input image (LR), super-resolved image using LIIF [8] and finally our model's output (CiSR-GAN). LR images are 6x and 12x down-sampled from ground-truth HR images and super-resolved to 6x and 12x in the top 2 and bottom 2 rows respectively demonstrating out-of-scale performance. All models were trained for 1x-4x only therefore we refer to 6x and 12x as out-of-scale. From the images we can see that LIIF has a smoothing effect where it blurs out the high-level detail in the images. Comparatively, our models clearly produces sharper results retaining textural details like waves of water, texture in butterfly wings and fine hair of animals.

Qualitative Analysis
Out-of-Scale: The qualitative results on DIV2K validation set [1] and Set14 [40] test set are shown in Figure.   . This figure shows the high resolution ground truth image (HR), the low-resolution image (LR), super-resolved image using LIIF model [8] and our model's output (CiSR-GAN's). All input images are 6x down-sampled from ground-truth images and super-resolved to 6x. All models were trained for 1x-4x only. We observe the same smoothing effect for LIIF outputs where the high level details such as water waves and texture in the fence has been blurred, while our model retains the high-level details and the image produced is much more realistic than LIIF. the textural information. At the same time, we also maintain all the desired properties of an implicit network, e.g., a single model can perform super-resolution at higher scales even if the model is not trained for it. All the results presented in the qualitative comparison are for 6x or 12x upsampling to compare with LIIF, whereas we train our models on 1x-4x down-sampled images.
In-Scale: Please note that CNN decoder based models [37,27,21] are not a direct competitor of our method since they can not perform out-ofscale super-resolution. However, we test their performance for in-scale super-resolution i.e. for 4x scaling factor for the sake of comprehensiveness. We compare with the best performing recent CNNbased method Structure-Preserving Super Resolution (SPSR) [23], that recently showed great results in retrieving sharp lines and geometry. All images are 4x down-sampled from the ground truth HR images and super-resolved to 4x. The performance is shown in Figure. 4. SPSR model adds edge artifacts like lines or texture to the super-resolved image whereas CiSR-GAN produces more realistic results.

Quantitative Results
CiSR-GAN vs LIIF We compare our model (CiSR-GAN) with previous work on the DIV2k dataset, as shown in Table. 1. The perceptual similarity metric (LPIPS) is a distance metric; therefore, the lower the value, the better. Whereas the higher the peak signal-to-noise ratio (PSNR), the better. Blau et al. [6] have previously shown that there is a trade-off between distortion and perception, and this can also be observed for our model. CiSR-GAN formulation has lower PSNR values than local implicit image function LIIF [8] as it is trained on the adversarial and perceptual loss. However, it consistently performs better than LIIF in terms of LPIPS metric. Lower LPIPS means that we can expect aesthetically pleasing results from CiSR-GAN. CiSR-GAN can also be evaluated for out-of-scale models easily since it is based on an INN. It maintains the edge over LIIF in terms of perceptual metrics for all scales evaluated.
In-Scale: We further compare the performance with state-of-the-art methods, including SRGAN, ESRGAN, and SPSR [23,37,20]. We notice that CiSR-GAN outperforms all in LPIPS while main-   taining comparable PSNR, as shown in Table. 2. Generally there is large gap between the SPSR and CiSR-GAN based on LPIPS metric, however, the difference is small in the test set Urban100 [14]. This behavior is expected as the gradient guidance based structure priors used in their model encourage the retrieval of lines and geometry that are commonly found in that dataset.

Conclusion
In this work, we improved the perceptual quality of the implicit neural network based single image super-resolution. The main hindrance in utilizing adversarial losses for continuous image representation models was the random co-ordinatebased sampling procedure adopted by previous works. We proposed to use a patch-based sampling method. Then we trained the implicit neural network with additional objectives based on adversarial and perceptual losses. We demonstrated that the resulting network produces sharp and photorealistic images while maintaining the desirable properties of the implicit neural networks i.e out-ofscale super-resolution. As future work, our method can also be trained with gradient guidance based structure prior to improve PSNR.