Multi-modal data generation with a deep metric variational autoencoder


  • Josefine Vilsbøll Sundgaard Technical University of Denmark
  • Morten Rieger Hannemose Technical University of Denmark
  • Søren Laugesen Interacoustics Research Unit
  • Peter Bray Interacoustics A/S
  • James Harte Eriksholm Research Centre
  • Yosuke Kamide Kamide ENT clinic
  • Chiemi Tanaka Diatec Japan
  • Rasmus R. Paulsen Technical University of Denmark
  • Anders Nymark Christensen Technical University of Denmark



deep metric learning, variational autoencoder, generative modelling, otoscopy, wideband tympanometry, otitis media


We present a deep metric variational autoencoder for multi-modal data generation. The variational autoencoder employs triplet loss in the latent space, which allows for conditional data generation by sampling new embeddings in the latent space within each class cluster. The approach is evaluated on a multi-modal dataset consisting of otoscopy images of the tympanic membrane with corresponding wideband tympanometry measurements. The modalities in this dataset are correlated, as they represent different aspects of the state of the middle ear, but they do not present a direct pixel-to-pixel correlation. The approach shows promising results for the conditional generation of pairs of images and tympanograms, and will allow for efficient data augmentation of data from multi-modal sources.


H. Binol, A. C. Moberly, M. K. K. Niazi, G. Essig, J. Shah, C. Elmaraghy, T. Teknos, N. Taj-Schaal, L. Yu, and M. N. Gurcan. Decision fusion on image analysis and tympanometry to detect eardrum abnormalities. Proc. SPIE 11314, Medical Imaging 2020: Computer-Aided Diagnosis, 11314, 2020. ISSN 16057422. doi: 10.1117/12.2549394.

A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. Advances in Neural Information Processing Systems, 2016. ISSN 10495258.

P. Ghosh, M. S. M. Sajjadi, A. Vergari, M. Black, and B. Scholkopf. From variational to deterministic autoencoders. In International Conference on Learning Representations, 2020.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

E. M. Grais, X. Wang, J. Wang, F. Zhao, W. Jiang, and Y. Cai. Analysing wideband absorbance immittance in normal and ears with otitis media with effusion using machine learning. Scientific Reports, 2021. ISSN 2045-2322. doi: 10.1038/s41598-021-89588-4.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770– 778, 2016. doi: 10.1109/CVPR.2016.90.

T. A. D. Hein, S. Hatzopoulos, P. H. Skarzynski, and M. F. Colella-Santos. Wideband Tympanometry. In Advances in Clinical Audiology. BoD – Books on Demand, 2017. doi: 10.5772/67155.

A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.

X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1133–1141, 2017. doi: 10.1109/WACV.2017. 131.

H. Huang, Z. Li, R. He, Z. Sun, and T. Tan. IntroVAE: Introspective variational autoencoders for photographic image synthesis. Advances in Neural Information Processing Systems, 2018. ISSN 10495258.

T. Karaletsos, S. Belongie, and G. R¨atsch. Bayesian representation learning with oracle constraints. 4th International Conference on Learning Representations, ICLR, 2016.

D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. ISSN 09252312.

D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semi-supervised learning with deep generative models. Advances in Neural Information Processing Systems, 4, 2014. ISSN 10495258.

S. Li, B. Tai, and Y. Huang. Evaluating variational autoencoder as a private data release mechanism for tabular data. In 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 2019. doi: 10.1109/PRDC47002.2019.00050.

H. C. Myburgh, S. Jose, D. Swanepoel, and C. Laurent. Towards low cost automated smartphone- and cloud-based otitis media diagnosis. Biomedical Signal Processing and Control, 39, 2018. ISSN 17468108. doi: 10.1016/j.bspc.2017.07.015.

F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. ISBN 9781467369640. doi: 10.1109/CVPR. 2015.7298682.

C. Senaras, A. C. Moberly, T. Teknos, G. Essig, C. Elmaraghy, N. Taj-Schaal, L. Yua, and M. N. Gurcan. Detection of eardrum abnormalities using ensemble deep learning approaches. Proceedings SPIE, Medical Imaging 2018: Computer-Aided Diagnosis, 10575, 2018. doi: 10.1117/12.2293297.

H. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In International workshop on simulation and synthesis in medical imaging. Springer, 2018. doi: 10.1007/978-3-030-00536-8 1.

J. V. Sundgaard, J. Harte, P. Bray, S. Laugesen, Y. Kamide, C. Tanaka, R. R. Paulsen, and A. N. Christensen. Deep metric learning for otitis media classification. Medical Image Analysis, 71, 2021. ISSN 13618423. doi: 10.1016/

J. V. Sundgaard, P. Bray, S. Laugesen, J. Harte, Y. Kamide, C. Tanaka, A. N. Christensen, and R. R. Paulsen. A deep learning approach for detecting otitis media from wideband tympanometry measurements. IEEE Journal of Biomedical and Health Informatics, 26(7):2974–2982, 2022. doi: 10.1109/JBHI. 2022.3159263.

S. Terzi, A. Ozgu¨r, Erdivanli, Z. Co¸skun,¨ M. Ogurlu, M. Demirci, and E. Dursun. Diagnostic value of the wideband acoustic absorbance test in middle-ear effusion. In Journal of Laryngology and Otology, 2015. doi: 10.1017/S0022215115002339.

L. Van Der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008. ISSN 15324435.

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13, 2004. doi: 10.1109/TIP.2003.819861.

S. Zhao, J. Song, and S. Ermon. Towards Deeper Understanding of Variational Autoencoding Models. arXiv preprint arXiv:1702.08658, 2017.

Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001– 13008, 2020. doi: 10.1609/aaai.v34i07.7000.