Fast accuracy estimation of deep learning based multi-class musical source separation


  • Alexandru Mocanu Logitech Europe S.A., CH-1015, Lausanne, Switzerland
  • Benjamin Ricaud
  • Milos Cernak Logitech Europe S.A., CH-1015, Lausanne, Switzerland



Deep learning, audio source separation


Music source separation represents the task of extracting all the instruments from a given song. Recent breakthroughs on this challenge have gravitated around a single dataset, MUSDB, only limited to four instrument classes. Larger datasets and more instruments are costly and time-consuming in collecting data and training deep neural networks (DNNs). In this work, we propose a fast method to evaluate the separability of instruments in any dataset without training and tuning a DNN.
This separability measure helps to select appropriate samples for the efficient training of neural networks. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches such as TasNet or Open-Unmix.
Our results contribute to revealing two essential points for audio source separation: 1) the ideal ratio mask, although light and straightforward, provides an accurate measure of the audio separability performance of recent neural nets, and 2) new end-to-end learning methods such as Tasnet, that operate directly on waveforms, are, in fact, internally building a Time-Frequency (TF) representation, so that they encounter the same limitations as the TF based-methods when separating audio pattern overlapping in the TF plane.


A. D ́efossez, N. Usunier, L. Bottou, and F. Bach. Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254, 2019. URL https://

D. Fitzgerald. Harmonic/percussive separa- tion using median filtering. In Proceedings of the International Conference on Digital Audio Effects (DAFx), volume 13, 2010. URL https: // DAFx10/DerryFitzGerald_DAFx10_P15.pdf.

D. FitzGerald. Vocal separation using nearest neighbours and median filtering. 2012. doi:10.1049/ic.2012.0225.

R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software, 5 (50):2154, 2020. doi:10.21105/joss.02154.

Y.-N. Hung and A. Lerch. Multitask learning for instrument activation aware music source separation. arXiv preprint arXiv:2008.00616, 2020. URL https://program.ismir2020. net/static/final_papers/334.pdf.

A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde. Singing voice separation with deep u-net convolutional networks. 2017. URL https://openaccess.

H. Kameoka, M. Nakano, K. Ochiai, Y. Imoto, K. Kashino, and S. Sagayama. Constrained and regularized variants of non-negative matrix factorization incorporating music-specific constraints. In 2012 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 5365–5368. IEEE, 2012. doi:10.1109/ICASSP.2012.6289133.

C. Laroche, M. Kowalski, H. Papadopoulos, and G. Richard. A structured nonnegative matrix factorization for source separation. In 2015 23rd European Signal Processing Con- ference (EUSIPCO), pages 2033–2037. IEEE, 2015. doi:10.1109/EUSIPCO.2015.7362741.

J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. IEEE, 2019. URL abstract/document/8683855/.

Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and lan- guage processing, 27(8):1256–1266, 2019. doi:10.1109/TASLP.2019.2915167.

M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies. Sparse representations in audio and music: from coding to source separation. Proceed- ings of the IEEE, 98(6):995–1005, 2009. doi:10.1109/JPROC.2009.2030345.

Z. Rafii and B. Pardo. A simple mu- sic/voice separation method based on the extraction of the repeating musical struc- ture. In 2011 IEEE International Confer- ence on Acoustics, Speech and Signal Process- ing (ICASSP), pages 221–224. IEEE, 2011. doi:10.1109/ICASSP.2011.5946380.

Z. Rafii, A. Liutkus, F.-R. Sto ̈ter, S. I. Mimi- lakis, and R. Bittner. The MUSDB18 corpus for music separation, Dec. 2017. URL https: //

F. Rigaud, A. Falaize, B. David, and L. Daudet. Does inharmonicity im- prove an nmf-based piano transcription model? In 2013 IEEE International Conference on Acoustics, Speech and Sig- nal Processing, pages 11–15. IEEE, 2013. doi:10.1109/ICASSP.2013.6637599.

D. Samuel, A. Ganeshan, and J. Naradowsky. Meta-learning extractors for mu- sic source separation. In ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 816–820. IEEE, 2020. doi:10.1109/ICASSP40776.2020.9053513.

D. Stoller, S. Ewert, and S. Dixon. Wave u-net: A multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018. URL 205_Paper.pdf.

F.-R. Sto ̈ter, A. Liutkus, and N. Ito. The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pages 293– 305. Springer, 2018. doi:10.1007/978-3-319- 93764-9 28.

F.-R. Sto ̈ter, S. Uhlich, A. Liutkus, and Y. Mitsufuji. Open-unmix - a reference imple- mentation for music source separation. Journal of Open Source Software, 4(41):1667, 2019. doi:10.21105/joss.01667. URL https://doi. org/10.21105/joss.01667.

N. Takahashi, N. Goswami, and Y. Mitsu- fuji. Mmdenselstm: An efficient combination of convolutional and recurrent neural networks for audio source separation. In 2018 16th In- ternational Workshop on Acoustic Signal En- hancement (IWAENC), pages 106–110. IEEE, 2018. doi:10.1109/IWAENC.2018.8521383.