Machine listening in spatial acoustic scenes with deep networks in different microphone geometries

  • Jörn Anemüller University of Oldenburg
Keywords: acoustic source localiztion, microphone array processing, deep neural networks


Multi-channel acoustic source localization evaluates direction-dependent
inter-microphone differences in order to estimate the position of an acoustic
source embedded in an interfering sound field. We here investigate a deep neural
network (DNN) approach to source localization that improves on previous work
with learned, linear support-vector-machine localizers. DNNs with depths
between 4 and 15 layers were trained to predict azimuth direction of target
speech in 72 directional bins of width 5 degree, embedded in an isotropic,
multi-speech-source noise field. Several system parameters were varied, in
particular number of microphones in the bilateral hearing aid scenario was
set to 2, 4, and 6, respectively.

Results show that DNNs provide a clear improvement in
localization performance over a linear classifier reference system.
Increasing the number of microphones from 2 to 4 results in a larger increase of
performance for the DNNs than for the linear system. However, 6 microphones
provide only a small additional gain. The DNN architectures perform better
with 4 microphones than the linear approach does with 6 microphones, thus
indicating that location-specific information in source-interference scenarios
is encoded non-linearly in the sound field.


C. Blandin, A. Ozerov, and E. Vincent. Multi- Source TDOA Estimation in Reverberant Audio Using Angular Spectra and Clustering. Signal Processing, 92(8):1950–1960, aug 2012.

A. Brutti, M. Omologo, and P. Svaizer. Comparison between different sound source localization techniques based on a real data collection. In Hands-Free Speech Communication and Microphone Arrays. HSCMA 2008, pages 69–72. IEEE, 2008.

J. Chen, J. Benesty, and Y. Huang. Time delay estimation in room acoustic environments: An overview. EURASIP Journal on Advances in Signal Processing, 2006(1):1–20, 2006.

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus. CDROM, 1993.

S. Kataria, C. Gaultier, and A. Deleforge. Hearing in a shoe-box: binaural source position and wall absorption estimation using virtually supervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 226–230. IEEE, 2017.

H. Kayser, S. D. Ewert, J. Anemüller, T. Rohdenburg, V. Hohmann, and B. Kollmeier. Database of Multichannel In-Ear and Behind-the-Ear Head- Related and Binaural Room Impulse Responses. EURASIP Journal on Advances in Signal Processing, 2009(1):ID 298605, 2009.

C. Knapp and G. Carter. The Generalized Correlation Method for Estimation of Time Delay. IEEE Transactions on Acoustics, Speech and Signal Processing, 24(4):320–327, 1976.

B. Laufer, R. Talmon, and S. Gannot. Relative transfer function modeling for supervised source localization. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013.

B. Loesch and B. Yang. Adaptive Segmentation and Separation of Determined Convolutive Mixtures Under Dynamic Conditions. Latent Variable Analysis and Signal Separation, pages 41–48, 2010.

M. Omologo and P. Svaizer. Acoustic Event Localization Using a Crosspower-Spectrum Phase Based Technique. Proc. ICASSP 1994. IEEE International Conference on Acoustics, Speech and Signal Processing, ii(2):II/273–II/276, 1994.

R. Takeda and K. Komatami. Sound source localization based on deep neural networks with directional activate function exploiting phase information. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 405–409. IEEE, 2016.