Machine listening in spatial acoustic scenes with deep networks in different microphone geometries

Multi-channel acoustic source localization evaluates direction-dependent inter-microphone differences in order to estimate the position of an acoustic source. We here investigate a deep neural network (DNN) approach to source localization that improves on previous work with learned, linear localizers. DNNs with depths between 4 and 15 layers were trained to predict the direction of target speech in an isotropic, multi-speech-source noise field. Several system parameters were varied, in particular number of microphones in the bilateral hearing aid scenario was set to 2, 4, and 6, respectively. Results show that DNNs provide a clear improvement over the linear classifier reference system. Increasing the number of microphones from 2 to 4 results in a larger increase of performance for the DNNs than for the linear system. 6 microphones provide only a small additional gain. The DNN architectures perform better with 4 microphones than the linear approach does with 6 microphones, thus indicating that location-specific information in source-interference scenarios is encoded non-linearly in the sound field.


Introduction
The human auditory systems routinely performs acoustic source localization, a task is important also in technical systems since it permits detection of relevant event such as speech, facilitates reconfiguration of (auditory) spatial signal processing, and may trigger subsequent actions such as obstacle avoidance in robots.
Location-specific information as measured with multi-channel microphone arrays is encoded in relative transfer functions (RTFs, [8]), dominated by, but not limited to, time-differences of arrival (TDOA) of the direct-path component of the acoustic signal. Classic approaches for source localization are based on TDOA analysis which commonly uses the generalized cross-correlation (GCC) method to yield robust TDOA estimates [2,9].
Data adaptive systems that form an implicit RTF representation through learning on training data, have been proposed as systems that do not rely on direct TDOA estimation. In some realworld scenarios, e.g., when amplitude modulation is characteristically present in target and interference source, they have shown robust localization performance [7,1,11,4].
The present work evaluates a non-linear extension of an earlier linear approach [5] by employing deep feed-forward networks that learn the transformation from multi-channel audio signals to a probabilistic location map. Specific emphasis is put on a systematic comparison across several deep network architectures and with a linear reference network that serves as baseline. We investigate the question as to what extent the density of spatial sound field sampling, i.e., number of microphone sensor channels, influences localization accuracy and whether there might be a trade-off between number of sensors and complexity of the classifiers' architecture.
In conclusion, the results presented here for speech sources embedded in isotropic noise are indicative https://doi.org/10.7557/18.5151 c The author(s). Licensee Septentrio Academic Publishing, Tromsø, Norway. This is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/). of a qualitative difference between non-linear (deep network) and linear localizers that cannot be overcome by the inclusion of additional sensor channels.

Probabilistic Acoustic Source Localization with Deep Nets
The discriminative approach to source localization builds on a standard classification framework that is employed to build decision models for directional sound source presence. Relevant acoustic parameters are learned implicitly, thus no direct impulse response measurements and no additional assumption on the acoustics are required. Source presence is indicated by cross-correlation function features ρ ij (τ ), cf. section 2.2, containing a main peak centered around the TDOA τ ij (ζ) corresponding to location ζ. The cross-correlation functions should therefore permit a classifier to adaptively learn to discriminate patterns that imply source presence from those that occur when no source is active in the direction of interest. They are denoted by where D ij is the maximum absolute delay between two sensors that is included and φ denotes the feature vector concatenating all cross-correlation vectors from all pairs of M sensors i, j. During classifier training, example feature vectors φ are labeled as positive examples for their respective source direction ζ, whenever a source is present at the corresponding location during the time-frame across which the feature vector has been computed. We here employ deep feed-forward neural network classifiers in order to build implicit direction-dependent models during training. Their output layer contains a set of N output units, one for each direction ζ.
When trained with the categorical cross-entropy cost-function, network outputs converge to aposteriori probability estimates for the respective classes. Hence, the output of a trained deep network localization algorithm provides us with a spatio-temporal probabilistic localization map P source (ζ, t) that indicates the probability of a source being active for each time frame t and each direction ζ. Maximum a-posteriori estimates are computed from the probabilistic location map according toζ Multi-source DOA estimation is achieved by evaluation of the J most probable occurrences of sound source positions. Note that estimation of the number of sound sources is not attempted here although the probabilistic information about the directional source distribution may lend itself to such an approach.

Feature Extraction
Deep-network localization is based on input features that capture the spatial covariance structure of the sound field as observed at the microphones, using generalized cross-correlation phase transform (GCC-PHAT) [10] functions , ω frequency and Ψ(ω) = H i (ω)H * j (ω) a spectral weighting. The phase transform (PHAT) weighting has been shown to be robust against noise and reverberation, and is often used in direction-of-arrival (DOA) estimation Thus, Ψ(ω) equalizes the amplitude of the signals with a uniform spectral weighting.

Training and Evaluation Data
Data for training and evaluation of the proposed algorithm was generated from a database of multi-channel head-geometry room impulse response function [6] and the TIMIT speech corpus. Target speech sources were placed in 5 degree intervals at one of 72 azimuth angles at distance 80cm for which impulse responses were available. Singlechannel speech signals from the TIMIT corpus were convolved with 6-channel behind-the-ear hearing aid impulse responses set to obtain multi-channel sound field data for the corresponding speaker location. Depending on the experiment, between two and six channels were used during training and testing, resulting in three different array geometries G1, G2, G3, as described in Table 1   speech-simulating noise field data was generated by placing 72 randomly selected speech sources simultaneously at all 72 azimuth positions, ensuring spectral and temporal properties of each interfering source to be (on average) identical to those of the target source. 6-channel speech-and noisefields were superimposed at signal-to-noise-ratios (SNR) of clean (∞ dB), 20 dB, 10 dB, 0 dB, and −10 dB. In total, 10 hours of multi-channel training data were generated from the training portion of the TIMIT dataset for each SNR condition , comprising 144 unique speaker-utterance combinations (72 male, 72 female) per direction. Evaluation data amounted to 5 hours from 72 unique speaker-utterance combinations (36 male, 36 female) from the testing portion of TIMIT. Thus, the total amount of data for training and evaluation was of sufficient size to train large deep-network architectures.

Algorithm setup
GCC-PHAT coefficients were computed from 10 ms windows and band-limited with an upper cut-off frequency of 4 kHz. A moderate window-shift of 5 ms was chosen to generate training and test data for the evaluation setting. After reducing the length of the GCC-PHAT vectors to 4 ms around the center, limiting their maximum delay to ±2 ms, feature vectors with dimensionality between 193 and 1158 were obtained as input vectors for DNN processing. Depending on the number of microphones, we used 1, 3, or 6 pairwise crosscorrelations that were subsequently arranged in a single feature vector, cf. Table 1 for a summary.
A number of deep feed-forward network architectures were chosen with different depths and number of units per layer, while holding the total number of neuron units approximately constant. In total, four networks (Net 1, ..., Net 4 ) as indicated in Table 2, with parameters listed in Table 3, were evaluated for each scenario. A linear reference network Net R served as a baseline for comparison with linear discriminative source localization [5].

Performance evaluation
Performance of the trained localizers was evaluated as its F 1 score, the harmonic mean of precision and recall, the latter being averaged across all azimuths, To compute relative effects across architectures and geometries, van Rijsbergen's effectiveness E, defined as was used, which attains a value of zero for perfect classification.

Experiments
Experiments were carried out with the goal to systematically investigate the effect that different sensor geometries and deep network architectures as outlined above have on localization performance. Several additional parameters were varied in the experiments: Signal-to-noise ratio (SNR) ranged from clean to −10 dB. The maximum a-posteriori direction estimate has been computed on (unaveraged) localization probability outputs of the networks on a 10 ms time-scale, as well as after temporal pooling of probabilities across 100 ms frames. Spatial precision with which a correct vs. false localization decision of the systems was evaluated ranged from ±2.5 • , i.e., within a single azimuth bin, to ±15 • , permitting an azimuth range around the true location as a faithful estimate. Results from a subset of experiments are reported below, which highlight the observed effects in a number of typical acoustic scenarios.

Results
Results obtained with deep networks architectures Net 1, ..., Net 3 and the linear reference network Net R are shown in Fig. 2 for the 2-microphone behind-the-ear geometry (G1) without temporal pooling, and in Fig. 3 for the 6-microphone behindthe-ear geometry (G3) with temporal pooling of 100 ms. These two scenarios also represent the hardest and least-hard settings in which the algorithms have been evaluated, with all other scenarios (data not presented here due to space limitations) achieving performance measures in between. Network Net 4 with the largest number of layers, but the smallest number of units per layer, resulted in poor performance on the localization task (data not shown here), likely indicating that wider processing layers are required, given the parameters in Table 3 which include 50% dropout units. Thus, Net 4 was excluded from subsequent analysis. The dominant effect of increased performance shown in Fig. 3 is due to the combination of temporal pooling and an increased number of sensors. Note that chance level is at 1/72, thus most of the datapoints shown fall well above chance. Scenario G3 shows localization performance near or above 90% for SNRs at 10 dB or better, which may be the most relevant range for real-world applications. For subsequent analysis, we have chosen to closer investigate results at 10 dB with an azimuth accuracy of ±5 • due to its relevance in practice.  • azimuth resolution, computed for all combinations of temporal resolution τ (10 ms, 100 ms), network architecture (Net 1, 2, 3, R), and microphone geometry (G1, G2, G3 ).
forming poorer than the DNN localizers in situation G2. Thus, information about source location in an interfering noise field may require non-linear processing for decoding, an effect that linear methods cannot compensate for by denser spatial sampling, cf. situation G3 with Net R. Table 5 investigates the effect of increasing the number of recording channels, showing relative improvement of geometries G2 and G3 over the 2-microphone geometry G1 (with the respective network architecture and pooling time-constant being held equal). The results show that DNN=processing obtains a larger benefit from an additional microphones compared to the linear network Net R.

Summary and Discussion
In the present contribution, we have proposed a deep network approach to acoustic source localization in a hearing aid scenario with multiple behindthe-ear microphones mounted bilaterally on a head. While our previous work has shown that source localization in this setup can be carried out with high accuracy using learned linear filters, results presented here show that performance can be further increased through the use of non-linear learning algorithms such as deep feedforward networks. While the specific network architecture appeared to be of lesser significance, it may be of interest that  Table 5: Effect of increasing number of recording channels from 2 microphones (geometry G1 ) to 4 (G2 ) and 6 (G3 ), respectively. Relative improvement in effectiveness E compared to baseline geometry G1. Non-linear processing with DNNs more effectively extracts information conveyed in the additional channels than linear reference network Net R.
the improved performance of non-linear localization cannot be achieved with linear methods even if the sensor number is increased further: Linear models on 6-channel data were incapable of reaching the performance that non-linear networks achieved on 4-channel data. A saturation effect at 4 microphone setups that had been observed in previous studies with linear classifiers, i.e., use of 6 microphones would lead to only a comparably small increase in localization performance, has been confirmed in the present study for non-linear networks. Thus, practical applications should consider a reasonable number of microphones and devote additional resources to non-linear signal processing approaches in order to achieve optimum performance. Doing so has been shown here to result in relative improvements of effectiveness E of up to 46.7% over linear approaches. Depending on the specific situation, this is approximately equivalent to a 5 dB increase in signal-to-noise-ratio (SNR) of the recording condition.