Learnable filter-banks for CNN-based audio applications


  • Helena Peic Tukuljac
  • Benjamin Ricaud Department of Physics and Technology, UiT The Arctic University of Norway, Tromsø NO-9037, Norway
  • Nicolas Aspert
  • Laurent Colbois




deep learning, filter, gammatone, audio


We investigate the design of a convolutional layer where kernels are parameterized functions. This layer aims at being the input layer of convolutional neural networks for audio applications or applications involving time-series. The kernels are defined as one-dimensional functions having a band-pass filter shape, with a limited number of trainable parameters. Building on the literature on this topic, we confirm that networks having such an input layer can achieve state-of-the-art accuracy on several audio classification tasks. We explore the effect of different parameters on the network accuracy and learning ability. This approach reduces the number of weights to be trained and enables larger kernel sizes, an advantage for audio applications. Furthermore, the learned filters bring additional interpretability and a better understanding of the audio properties exploited by the network.


J. And ́en and S. Mallat. Deep scattering spectrum. IEEE Transactions on Signal Processing, 62(16):4114–4128, 2014. doi:10.1109/TSP.2014.2326991.

S. Becker, M. Ackermann, S. Lapuschkin, K. M ̈uller, and W. Samek. Interpreting and explaining deep neural networks for classification of audio signals. CoRR, abs/1807.03418, 2018. URL http://arxiv.org/abs/1807.03418.

J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013. doi:10.1109/TPAMI.2012.230.

P. Chandna, M. Miron, J. Janer, and E. G ́omez. Monoaural audio source separation using deep convolutional neural networks. In International conference on latent variable analysis and signal separation, pages 258–266. Springer, 2017. doi:10.1007/978-3-319-53547-0 25.

K. Choi, G. Fazekas, K. Cho, and M. Sandler. A tutorial on deep learning for music information retrieval. arXiv preprint arXiv:1709.04396, 2017.

F. Cotter and N. G. Kingsbury. A learnable Scatternet: Locally invariant convolutional layers. 2019 IEEE International Conference on Image Processing (ICIP), pages 350–354, 2019. doi:10.1109/ICIP.2019.8802977.

A. Darling. Properties and implementation of the gammatone filter: a tutorial. Speech Hearing and Language, Work in Progress, University College London, Department of Phonetics and Linguistics, pages 43–61, 1991.

D. Ditter and T. Gerkmann. A multiphase gammatone filterbank for speech separation via TasNet. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 36–40. IEEE, 2020. doi:10.1109/ICASSP40776.2020.9053602.

B. R. Glasberg and B. C. Moore. Derivation of auditory filter shapes from notched-noise data. Hearing research, 47(1-2):103–138, 1990. doi:10.1016/0378-5955(90)90170-T.

V. Hohmann. Frequency analysis and synthesis using a gammatone filterbank. Acta Acustica united with Acustica, 88(3):433–442, 2002.

J.-H. Jacobsen, J. van Gemert, Z. Lou, and A. W. Smeulders. Structured receptive fields in CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2610–2619, 2016. doi:10.1109/CVPR.2016.286.

H. Khan and B. Yener. Learning filter widths of spectral decompositions with wavelets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4601–4612. Curran Associates, Inc., 2018. doi:10.5555/3327345.3327371.

T. Kim, J. Lee, and J. Nam. Comparison and analysis of SampleCNN architectures for audio classification. IEEE Journal of Selected Topics in Signal Processing, 13(2):285–297, May 2019. doi:10.1109/JSTSP.2019.2909479.

E. Loweimi, P. Bell, and S. Renals. On learning interpretable CNNs with parametric modulated kernel-based filters. In Proc. Interspeech, pages 3480–3484, 2019. doi:10.21437/Interspeech.2019-1257.

X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoising autoencoder. In Interspeech, pages 436–440, 2013. doi:10.21437/Interspeech.2013-130.

Y. Luo and N. Mesgarani. TasNet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. IEEE, 2018. doi:10.1109/ICASSP.2018.8462116.

Y. Luo and N. Mesgarani. Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019. doi:10.1109/TASLP.2019.2915167.

A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel wavenet: Fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433, 2017.

D. O’Shaughnessy. Speech Communications: Human and Machine, chapter 3, page 45. Wiley-IEEE Press, 2000. doi:10.1109/9780470546475.

D. O’Shaughnessy. Speech Communications: Human and Machine, chapter 4, pages 127–128. Wiley-IEEE Press, 2000. doi:10.1109/9780470546475.

T. L. Paine, P. Khorrami, S. Chang, Y. Zhang, P. Ramachandran, M. A. Hasegawa-Johnson, and T. S. Huang. Fast wavenet generation 8 algorithm. arXiv preprint arXiv:1611.09482, 2016.

R. D. Patterson, I. Nimmo-Smith, J. Holdsworth, and P. Rice. An efficient auditory filterbank based on the gammatone function. In a meeting of the IOC Speech Group on Auditory Modelling at RSRE, volume 2, 1987.

R. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand. Complex sounds and auditory images. In Auditory physiology and perception, pages 429–446. Elsevier, 1992. doi:10.1016/B978-0-08-041847-6.50054-X.

K. J. Piczak. Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2015. doi:10.1109/MLSP.2015.7324337.

H. Purwins, B. Li, T. Virtanen, J. Schl ̈uter, S.Y. Chang, and T. Sainath. Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, 2019. doi:10.1109/JSTSP.2019.2908700.

M. Ravanelli and Y. Bengio. Speaker recognition from raw waveform with SincNet. 2018 IEEE Spoken Language Technology Workshop (SLT), pages 1021–1028, 2018. doi:10.1109/SLT.2018.8639585.

J. Salamon and J. P. Bello. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3):279–283, 2017. doi:10.1109/LSP.2017.2657381.

H. Seki, K. Yamamoto, and S. Nakagawa. A deep neural network integrated with filterbank learning for speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5480–5484, March 2017. doi:10.1109/ICASSP.2017.7953204.

D. Stoller, S. Ewert, and S. Dixon. Waveu-net: A multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185, 2018. doi:10.5281/ZENODO.1492417.

M. Tan and Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/tan19a.html.

P. Warden. Speech commands: A dataset for limited-vocabulary speech recognition. CoRR, abs/1804.03209, 2018. URL http://arxiv.org/abs/1804.03209.

N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, and E. Dupoux. Learning filterbanks from raw speech for phone recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5509–5513. IEEE, 2018.

N. Zeghidour, O. Teboul, F. de Chaumont Quitry, and M. Tagliasacchi. LEAF: A learnable frontend for audio classification. ICLR, 2021.

E. Zwicker and E. Terhardt. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. The Journal of the Acoustical Society of America, 68 (5):1523–1525, 1980. doi:10.1121/1.385079.