Consistent and accurate estimation of stellar parameters from HARPS-N Spectroscopy using Deep Learning

Consistent and accurate estimation of stellar parameters is of great importance for information retrieval in astrophysical research. The parameters span a wide range from effective temperature to rotational velocity. We propose to estimate the stellar parameters directly from spectral signals coming from the HARPS-N spectrograph pipeline before any spectrum-processing steps are applied to extract the 1D spectrum. We use residual networks and an attention-based model to estimate the stellar parameters. The models estimate both mean and uncertainty of the stellar parameters through the parameters of a Gaussian distribution. The estimated distributions create a basis to generate data-driven Gaussian confidence intervals for the estimated stellar parameters. We show that residual networks and attention-based models can estimate the stellar parameters with high accuracy for low Signal-to-noise ratio (SNR) compared to previous methods. With an observation of the Sun from the HARPS-N spectrograph, we show that the models can estimate stellar parameters from real observational data.


Introduction
There exists great variation in the different techniques used to estimate stellar parameters, ranging from decision tree architectures to tailor-made algorithms made for specific astrophysical surveys [14,15,22]. Previous research projects that have applied artificial neural networks [1] and deep learning [6] have focused on estimating effective temper- * Corresponding Author: fbohy@dtu.dk ature (T eff ), surface gravity (log g), and metalicity (Z). Traditionally the stellar parameters are estimated using the extracted 1D stellar spectra from their original CCD spectral images [1,2,3,6]. The methods used for the extraction of the 1D spectrum introduces biases and assumption into the spectrum resulting in biased estimation of stellar parameters, which leads to different research groups obtaining different results when observing the same stars [18]. We argue that one should strive to use an endto-end deep learning approach, which can estimate appropriate pre-processing steps in the modelling. However the original CCD-spectral images are inherently processed by the spectrograph to make them useable by scientists. The closest thing to an end-to-end approach is using the data from the HARPS-N spectrograph pipeline [17]. In this paper, we propose to use the 2D spectral signal coming from the HARPS-N pipeline to estimate the stellar parameters and present methods for doing so. The main contributions of our approach are: • The elimination of spectral pre-processing to extract the 1D spectra, as we apply our deep learning models directly to the 2D signal from the HARPS-N pipeline.
• Inclusion of stellar rotational velocity (V sin i) estimation and showing predictions accuracy on par with other data-driven methods.
• Quantification of the uncertainty in estimations of the stellar parameters. The estimated distribution provides a basis to create datadriven confidence intervals.
• An attention-based model which attends to the underlying elements of an input spectrum.

Data
Previous related research on data-driven stellar parameters estimation has generated synthetic spectra to train and estimate the performance of machine learning models [1,2,6]. The use of synthetic spectra provides a unique opportunity to generate a large set of labelled data, as the details of the spectra are known a priori and the SNR can be varied to mimic different telescope exposure times [1,6]. A drawback of this approach is that the trained weights will be biased towards the physical model generating the data, as there exists a synthetic gap between the feature distributions of the synthetic and the observed spectra [6].

Data generation
We sample the synthetic spectra from a grid of model atmospheres using the ATLAS9 code [12]. The original code is described in detail in [13] and was updated to include new opacity distribution functions as outlined in [4]. The grid was extended by including different rotational velocities (V sin i) using Gray's methods [7]. The sampling from the grid can be seen in Table 1. We normalise each stellar parameters, so the parameters equally contribute to the loss. Echelle orders To generate synthetic spectra images similar to those coming from the HARPS-N pipeline, we split the spectra up into theéchelle orders, and stitch them back as an image. We limit the wavelength interval between 5050 to 5350Å, which corresponds to 8 differentèchelle orders. We interpolate the wavelengths of the 1D spectrum to match the 2D wavelengths, such that they correspond to the HARPS-N pipeline. We add a linear slant across each order to mimic the observations from the HARPS-N pipeline. We sample the model spectra without any noise, so we can vary the SNR by adding Gaussian noise during training and testing [6]. An example of the a generated spectra can be seen in Figure 1. The parameters generated are discrete, but we linear interpolate between samples to create observations with continuous parameters [6]. The final dimensional of the spectra are 8 rows with 4096 pixels in each.

Method
This section describes how we capture heteroscedastic uncertainty in the estimated stellar parameters, present the attention-based architecture and provide a framework to denoise samples with low SNR.

Heteroscedastic Uncertainty Estimation
Heteroscedastic regression assumes that the uncertainty of observations vary with input x [10]. This uncertainty in the observations can be quantified by the distribution p(y|x), where the expected value is considered the best estimate of the parameters, and the variance of the distribution describes the uncertainty [19]. We learn the distribution using parameters θ and parameterize the learnt distribution as a Gaussian.
Where µ θ (x) is a vector of size 4, and Σ θ (x) aims to learn the covariance of the 4 parameters. In order to estimate the parameters in the function in Equation 1, we minimise the negative loglikelihood [10].

Attention-based model
The soft-attention model used in this work is inspired by the architecture from [9], in combination with the attention blocks presented in [20,21]. We construct an attention architecture which uses any number of intermediate feature maps x n from a convolutional neural network in combination with a global feature map g (The last layer in the convolutional neural network), to compute an attention map in attention blocks [9]. The attention map α n ∈ [0, 1] is used to identify salient features in the input as the output of an attention block is the element-wise multiplication of the input featuremap and the attention map:x n = a n · x n . We denote the channels of an given feature map F x and F g as the channels from the global feature map. F int is the channels of the convolutional weights in the attention block to ensure g and x n have the same channels. Formally we can compute the attention map α n as follows.
is the activation function of the neural network, and σ 2 (x) is the softmax operation, such that the attention map sums to one [20]. The set of parameters Θ att contains the convolutional weights W x ∈ R Fx×Fint and W g ∈ R Fg×Fint , which are used to linearly transform the input tensors using a channel-wise 1×1×1 convolution. The weights ψ ∈ R Fint×1 combines the features from all the channels into a 1-channel attention map. In addition we also include a bias term b ψ ∈ R Fint in Θ att . The parameters of the attention block (the convolutional layers) are trained using standard back-propagation [20]. Each attention block outputs µ θ (x) and Σ θ (x) Aggregation strategy To ensure that all the attention maps α n learn meaningful features, we aver-age the outputs of the attention blocks. The overall attention architecture is presented in Figure 2.
Attention block Attention block Attention block We train an attention network with three attention blocks and 11 convolutional layers. After each attention block, we apply 2 × 2 max pooling. The pooling, ensures that the later feature maps contains more local features. The initial block will find global features of the input, and subsequent blocks will attend to more local features [20]. We compare the attention model to a residual network consisting of 5 residual blocks [8] and three fully connected layers. We limit the convolutional kernel sizes be (1 × 7) and add zero padding, which limits the artefacts generated by the convolutional kernels to only be within the spectral image.

Denoising Auto-encoder
In practice, the obtained SNR of observations is often lower than expected, which can result in an unsatisfactory performance on the stellar label estimation. We propose to use a Denoising autoencoder (DAE) [23] to remove noise and ensure that samples with low SNR can be used in stellar label estimation. We assume the noise of a sample x is equal to Gaussian white noise across the entire spectrum and denote the corrupted sample x , which is a fair assumption given Gaussian noise is used to vary the SNR of a sample [6]. We learn φ and θ, by optimising the encoder f φ (x ) and the de-  Table 2: Mean-Absolute error based on the mean prediction from models. Using the DAE we can remove noise and increase the performance of our models. † models are trained on a limited data-set to match the parameter ranges presented in previous related work [6]. The limited data-set only contains T eff between 4000K and 6000K, while the other parameters stay the same.
coder g θ (z), where z is the latent code extracted by f φ , by minimising the reconstitution error between the corrupted sampled x and the original sample x [23].

Experiments and results
The results are obtained on a test set of 5667 synthetic model spectra. Gaussian noise is added to match the SNR of the training set. We trained the attention-network and residual network with a diagonal covariance matrix and a full covariance matrix. The attention and residual network are optimised using the Adam [11] variant of stochastic gradient (SGD) descent with a learning rate of 0.0001. The models are trained for 750 epochs. The DAE is trained for 1500 epochs using the AdamW [16] variant of SGD with a learning rate of 0.0003. For all models the learning rate is decayed by a γ of 0.1 for the last 50 epochs

Experimental Results
In Table 2, we present the model performance on the Mean-Absolute error (MAE ) on the 4 stellar parameters. The MAE obtained in this work is similar to the results obtained by other related work [3,6], however, we have eliminated the need for extraction of the 1D spectra. The reader should be aware of the fact that the results obtained by related work are achieved from a different datapipeline. The DAE can remove noise from samples and have a positive impact on model performance when the SNR is low ( Table 2). The incorporation of the DAE makes the model robust to noisy observations. This suggest that a DAE might also improve performance for traditional methods, which typically do not work well when data exhibits low SNR. When estimating a full covariance matrix, the model performance is slightly lower compared to learning a diagonal covariance (Table 2). Based on the results presented in Table 2, We will continue to show the results from the attention-model using a diagonal covariance matrix with a auto-encoder (DAE Attention-network). We evaluate the uncertainty with Gaussian confidence intervals, using the diagonal elements in Σ θ (x).

Residuals
The residuals appear to be equally distributed around 0 ( Figure 3). We detect a little bias in some of the estimation. The model appears to underestimates the temperature for really high temperatures stars. The variance within each stellar parameters does not appear to be constant across the entire domain. For low values of Z, the models estimates have high variance across all the four labels, it appears that for spectra with low Z, there is high variance in the residuals (Figure 3). Standard deviations The estimated standard deviations are obtained from the diagonal elements of Σ θ (x). The models estimate high uncertainty for high temperature stars ( Figure 4) which is expected as the stellar features are sparse for hot stars [6]. The same pattern for low metaliticy stars (Figure 4), where the estimated uncertainty is higher for low values, compared to high values of metalicity, which support our assumption of heteroscedasic variance.  Figure 4: The estimated standard deviation from the residuals presented in Figure 3 as a function of the true parameters.

Uncertainty estimation
We asses the quality of the estimated standard deviations by evaluation of residuals within 1 or 2 standard deviations. The estimated distributions approximate the Gaussian theoretical values (Table 3). One might argue that the residual networks are overestimating the variance of the learned distribution. We argue this happens because we train the residual networks with dropout in the fully connected layers, so the model is trained with uncertainty in x.  Table 3: Table showing percentages of observations that are within µ ± Σ θ (x) and µ ± 2Σ θ (x). The models are trained with SNR ≈ 20.

Test on HARPS-N observation
In order the evaluate the synthetic gap, we assess the models on an observation of the Sun coming from the HARPS-N spectrograph. Due to the extreme apparent brightness of the Sun, these observations obtain high SNR of ≈ 200 [5]. All real observations contain telluric lines (absorption lines coming from the Earths atmosphere), which is not present in the synthetic data.  The residual network can estimate the parameters of the Sun, and use the estimated uncertainty to setup confidence intervals (Table 4).

Visual evaluation of attention
Visual evaluation of the estimated attention feature maps α shows that the model attends to some of the composite elements in the spectra ( Figure 5). Magnesium b have spectral lines at 5172Å and is often used by traditional methods when estimating the stellar parameters. Based on the high activation of the attention feature map α at this absorption line, the attention-network attends to this element in the spectrum. The intermediate feature maps are up-sampled after the pooling layers, to fit the input spectral image, this leads to attended features outside the signal.

Conclusion
We have focused on a data-driven estimation of stellar parameters based on the spectral signal directly from the HARPS-N pipeline. Based on the results obtained by using denoising models, the use of such models should be applicable not only in a deep learning approach but also for more traditional physics-based methods which typically underperform with data that exhibits low SNR. The estimation of a multivariate Gaussian also lays the groundwork for future research to explore full Bayesian approaches such as Markov chain Monte Carlo or variational inference methods. The attention models provide a way to reason about the importance of the different composite elements of a spectrum, as the models presented here attend to some of the salient underlying elements in a spectrum. Since the telluric lines are absent from our data-set future work could include these to reduce the synthetic gap. The physical knowledge required to analyse the spectra holds tremendous value, and it is as essential as the estimations. We, therefore, encourage future research to continue the path towards an end-to-end deep learning method, while acknowledging the importance of the physical composition of the underlying spectra.