Identifying and Mitigating Flaws of Deep Perceptual Similarity Metrics


  • Oskar Sjögren Luleå University of Technology
  • Gustav Grund Pihlgren Luleå University of Technology
  • Fredrik Sandin Luleå University of Technology
  • Marcus Liwicki Luleå University of Technology



Image Similarity Metrics, Perceptual Loss, Deep Perceptual Similarity, Deep Features


Measuring the similarity of images is a fundamental problem to computer vision for which no universal solution exists. While simple metrics such as the pixel-wise L2-norm have been shown to have significant flaws, they remain popular. One group of recent state-of-the-art metrics that mitigates some of those flaws are Deep Perceptual Similarity (DPS) metrics, where the similarity is evaluated as the distance in the deep features of neural networks.However, DPS metrics themselves have been less thoroughly examined for their benefits and, especially, their flaws. This work investigates the most common DPS metric, where deep features are compared by spatial position, along with metrics comparing the averaged and sorted deep features. The metrics are analyzed in-depth to understand the strengths and weaknesses of the metrics by using images designed specifically to challenge them. This work contributes with new insights into the flaws of DPS, and further suggests improvements to the metrics. An implementation of this work is available online:


E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1122–1131, 2017. doi: 10.1109/CVPRW.2017.150.

S. Bhardwaj, I. Fischer, J. Ball ́e, and T. Chinen. An unsupervised information-theoretic perceptual quality metric. In Advances in Neural Information Processing Systems, volume 33, pages 13–24. Curran Associates, Inc., 2020. URL

V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR 2011, pages 97–104, 2011. doi: 10.1109/CVPR.2011.5995332.

D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato. Raise: A raw images dataset for digital image forensics. In Proceedings of the 6th ACM multimedia systems conference, pages 219–224, 2015. doi: 10.1145/2713168.2713194.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848.

C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016. doi: 10.1109/TPAMI.2015.2439281.

L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In 2016 IEEE Conference 7 on Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016. doi: 10.1109/CVPR.2016.265.

G. Grund Pihlgren, F. Sandin, and M. Liwicki. Pretraining image encoders without reconstruction via feature prediction loss. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4105–4111, 2021. doi: 10.1109/ICPR48806.2021.9412239.

X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1133–1141, 2017. doi: 10.1109/WACV.2017.131.

F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. doi: 10.48550/arXiv.1602.07360.

A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016. doi: 10.1007/978-3-319-46475-6 43.

M. Kettunen, E. H ̈ark ̈onen, and J. Lehtinen. E-lpips: robust perceptual image similarity via random transformation ensembles. arXiv preprint arXiv:1906.03973, 2019. doi:10.48550/arXiv.1906.03973.

A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014. doi: 10.48550/arXiv.1404.5997.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

M. Kumar, N. Houlsby, N. Kalchbrenner, and E. D. Cubuk. Do better imagenet classifiers assess perceptual similarity better? Transactions of Machine Learning Research, 2022. URL

A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1558–1566, New York, New York, USA, June 2016. PMLR.

B. Li, E. Chang, and Y. Wu. Discovery of a perceptual distance function for measuring image similarity. Multimedia systems, 8(6):512–522, 2003. doi: 10.1007/s00530-002-0069-9.

J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan. Perceptual generative adversarial networks for small object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1951–1959, 2017. doi: 10.1109/CVPR.2017.211.

A. Lucas, S. L ́opez-Tapia, R. Molina, and A. K. Katsaggelos. Generative adversarial networks and perceptual losses for video super-resolution. IEEE Transactions on Image Processing, 28(7):3312–3327, 2019. doi: 10.1109/TIP.2019.2895768.

A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory of Majorization and Its Applications. Springer, 2 edition, 2011. doi: 10.1007/978-0-387-68276-1.

A. Mosinska, P. Marquez-Neila, M. Kozinski, and P. Fua. Beyond the pixel-wise loss for topology-aware delineation. In 2018 8 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3136–3145, 2018. doi: 10.1109/CVPR.2018.00331.

F. Ricci and P. Avesani. Learning a local similarity metric for case-based reasoning. In Case-Based Reasoning Research and Development, pages 301–312, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg. ISBN 978-3-540-48446-2. doi: 10.1007/3-540-60598-3 27.

D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings IEEE Workshop on Stereo and Multi- Baseline Vision (SMBV 2001), pages 131–140, 2001. doi: 10.1109/SMBV.2001.988771.

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015. doi: 10.48550/arXiv.1409.1556.

S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang. Deep video deblurring for hand-held cameras. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 237–246, 2017. doi: 10.1109/CVPR.2017.33.

C. C. Taylor. Measures of similarity between two images. Lecture Notes-Monograph Series, 20:382–391, 1991. ISSN 07492170. URL

Z. Wang and A. C. Bovik. Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE Signal Processing Magazine, 26(1):98–117, 2009. doi: 10.1109/MSP.2008.930649.

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. doi: 10.1109/TIP.2003.819861.

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer Inter- national Publishing. ISBN 978-3-319-10590-1. doi: 10.1007/978-3-319-10590-1 53.

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. doi: 10.1109/CVPR.2018.00068.

X. Zhao, M. G. Reyes, T. N. Pappas, and D. L. Neuhoff. Structural texture similarity metrics for retrieval applications. In 2008 15th IEEE International Conference on Image Processing, pages 1196–1199, 2008. doi: 10.1109/ICIP.2008.4711975.