Heterogeneous Change Detection on Remote Sensing Data with Self-Supervised Deep Canonically Correlated Autoencoders Candidate:

Change detection is a well-known topic of remote sensing. The goal is to track and monitor the evolution of changes affecting the Earth surface over time. The recently increased availability in remote sensing data for Earth observation and in computational power has raised the interest in this field of research. In particular, the keywords “multitemporal” and “heterogeneous” play prominent roles. The former refers to the availability and the comparison of two or more satellite images of the same place on the ground, in order to find changes and track the evolution of the observed surface, maybe with different time sensitivities. The latter refers to the capability of performing change detection with images coming from different sources, corresponding to different sensors, wavelengths, polarizations, acquisition geometries, etc. This thesis addresses the challenging topic of multitemporal change detection with heterogeneous remote sensing images. It proposes a novel approach, taking inspiration from recent developments in the literature. The proposed method is based on deep learning involving autoencoders of convolutional neural networks and represents an exapmple of unsupervised change detection. A major novelty of the work consists in including a prior information model, used to make the method unsupervised, within a well-established algorithm such as the canonical correlation analysis, and in combining these with a deep learning framework to give rise to an image translation method able to compare heterogeneous images regardless of their highly different domains. The theoretical analysis is supported by experimental results, comparing the proposed methodology to the state of the art of this discipline. Two different datasets were used for the experiments, and the results obtained on both of them show the effectiveness of the proposed method.


Colophon
This master's thesis was submitted to the University of Genoa. However, the main part of the work was carried out during the author's exchange as an Erasmus student to UiT The Arctic University of Norway, where he was supervised by members of the Machine Learning Group in the Department of Physics and Technology.
The thesis has later resulted in a conference paper presented at the IEEE International Geoscience and Remote Sensing Symposium 2020. It was also awarded the Premio Nazionale di Laurea "Euginio Zilioli" ("Euginio Zilioli" national thesis award) for 2020 by the Italian Association of Remote Sensing (AIT) and the Institute for Electromagnetic Sensing of the Environment of the Italian National Research Council (CNR-IREA).
The work is published to serve as a reference for further publications fostered through the collaboration between the author and the thesis supervisors at the University of Genoa and UiT The Arctic University of Norway.
To be cited as:

Abstract
Change detection is a well-known topic of remote sensing. The goal is to track and monitor the evolution of changes affecting the Earth surface over time. The recently increased availability in remote sensing data for Earth observation and in computational power has raised the interest in this field of research. In particular, the keywords "multitemporal" and "heterogeneous" play prominent roles. The former refers to the availability and the comparison of two or more satellite images of the same place on the ground, in order to find changes and track the evolution of the observed surface, maybe with different time sensitivities. The latter refers to the capability of performing change detection with images coming from different sources, corresponding to different sensors, wavelengths, polarizations, acquisition geometries, etc.
This thesis addresses the challenging topic of multitemporal change detection with heterogeneous remote sensing images. It proposes a novel approach, taking inspiration from recent developments in the literature. The proposed method is based on deep learning -involving autoencoders of convolutional neural networks -and represents an exapmple of unsupervised change detection. A major novelty of the work consists in including a prior information model, used to make the method unsupervised, within a well-established algorithm such as the canonical correlation analysis, and in combining these with a deep learning framework to give rise to an image translation method able to compare heterogeneous images regardless of their highly different domains.
The theoretical analysis is supported by experimental results, comparing the proposed methodology to the state of the art of this discipline. Two different datasets were used for the experiments, and the results obtained on both of them show the effectiveness of the proposed method.

Introduction
Nowadays, data are one of the leading assets in our society, and their analysis is a driver for new researches and new investments. Thanks to the new generations of satellites, and increased capacity in storage and computation, remote sensing is gaining more and more importance in research studies of many fields. The number of satellites is continuously growing, and this leads to more acquisitions which allow for an easier Earth and environmental monitoring. Remote sensing images are catching on in our daily life, ranging from common web mapping and weather forecast services to advanced studies on climate change, environmental monitoring, disaster risk management, etc. This thesis deals with the topic of multi-temporal change detection with heterogeneous remote sensing images. It is an emerging and highly prominent topic in publications and journals. The attention to this matter is related to the vast availability of images acquired by different missions and sensors. In the past, almost only same-sensor (i.e. homogeneous) acquisitions were used for multitemporal change detection. However, it is becoming of the utmost importance to be able to compare heterogeneous images to take advantage of the variety of satellite observations: multispectral, panchromatic and radar with different wavelength bands, radar frequencies, polarisations, acquisition geometries, etc. Let us give a couple of examples: to track changes in time, it is necessary to assure the backward compatibility with older acquisitions systems; maybe old data were acquired by a retired satellite with outdated technology, and here comes the necessity of heterogeneous change detection. Furthermore, in case of disaster recovery, it is essential to use the first available image to assess the damages to roads and infrastructures, and it may not be possible to wait for the next same-satellite acquisition, which can be a few days later. Other social valuable applications of change detection are land usage and urban monitoring, post-catastrophe assessments, crop 1 monitoring and surveillance. This hot topic in research is also challenging; the heterogeneous change detection aims to compare two acquisitions which are semantically different, for example, an optical image and a synthetic aperture radar (SAR) image, but also two optical images acquired by different optical sensors with distinct channels are classified as heterogeneous. The core of the problem is to tackle the complexity in comparing two different physical quantities because different sensors measure different quantities. The problem cannot be solved using visual inspection for many reasons, the first is that very specialised knowledge would be needed and also the quantity of data to be analysed would be extremely time-consuming if addressed through a photointerpretation effort. An automatic approach is developed in this thesis. The path chosen in this work falls entirely within the framework of unsupervised techniques of machine learning. More specifically, some concepts of classical learning have been used in pair with deep learning strategies. The research has been focused on bi-temporal acquisitions. The core idea of the methodology proposed is the image translation across two domains, in order to bring the two heterogeneous acquisition towards a common domain in which they can be compared. For this purpose, A deep neural network is deployed to learn the translation function from one domain to the other and vice-versa. The domain translation is guided by prior information extracted automatically off-line from the images through a graph-theoretic approach based on local affinity matrices. The proposed deep neural network is formed by a pair of autoencoders, coupled together by a processing block performing the canonical correlation analysis (CCA) in the latent space to force code space alignment.

Contribution
The candidate's contribution is both theoretical and practical. Firstly, this study investigates the literature about change detection in general and CCA. Secondly, it proposes a new heterogeneous change detection method based on the integration of the CCA method and its derived techniques, of a deep learning architecture based on two autoencoders, and of a priori knowledge extracted through local affinity matrices. In this respect, the present work extends the approach developed in [Luppino et al., 2020], in which an adversarial approach was used to favor the alignment in a common domain. Moreover, the work conducted within the thesis activity also included the development and integration of the code to carry out the experiments, using different tools: Docker to create a virtual environment for the testing of the project; Python, TensorFlow and Keras to develop and integrate the ma-chine learning code; experiments settings and testing to run the experiments on a server. This thesis was carried out within an internship at UiT -the Arctic University of Norway and resulted in the following publication: [Figari Tomenotti et al., submitted]

Outline
This thesis is organised into four chapters. Chapter 2 provides a general introduction to remote sensing, giving importance to data acquisition systems and providing detailed explanations of different methodologies of change detection. Chapter 3 presents some basic theory concepts and technical background in order to understand the machine learning methodologies used: Canonical Correlation Analysis and some Deep learning frameworks are presented. Chapter 4 explains in detail the proposed methodology. Chapter 5 presents and discusses the experimental results and the comparisons. In the last Chapter, 6, conclusions are drawn.
Introduction to remote sensing and change detection

Remote Sensing
Remote sensing is the scientific and technical discipline whose aim is the information acquisition about a target without accessing directly to it. In other words, without touching or reaching it physically, it is possible to retrieve some parameters which allow determining some physical quantity of the object under analysis (such as shape, chemical composition, speed). All these methods take advantage of different electromagnetic techniques and data processing algorithms.
Despite the very generic name and the broad description given above, in this work, we will refer in particular to remote sensing for Earth observation. Earth is the place where we live, and we extract our resources from it: food, fuels, water; therefore, monitoring our planet is of the utmost importance. The results of remote sensing for Earth observation is also essential in many research studies for climate changes: and the major space agencies of the World play an active role in deploying new instruments and developing novel ways of studying these phenomena [NASA]. Moreover, industrial and agricultural applications of remote sensing studies are popular and already employed. Some of the most interesting and valuable applications in this discipline are briefly presented.
• Land cover mapping. It permits to monitor urban development as well as farming lands. For example, searching for building alteration or new construction is vital for authorities in order to collect taxes 5 and monitor the security of the country. Besides, soil usage allows for observation of crop subdivision over territory and for statistical purposes [Moser et al., 2012].
• Bio-and Geophysical parameters retrieval. Very useful in environmental monitoring: biomass concentration retrieval in forests, analysis of plant species dispersion in a territory or surveillance of their health status; oceans studies and supervision (such as chlorophyll density, temperature [Minnett et al., 2019], Figure 2.1 shows an example). Mapping soil moisture and type for agricultural planning. Measuring wind speed or air temperature in a wide range of places, also in the middle of the oceans (e.g. allowing feasibility studies for wind farms).
• Disaster Management. Remote sensing permits authorities to have a clear idea of the entity of a natural (or anthropic) disaster just after it: comparing images of the same zone before and after the event [Inglada and Giros, 2004]. Of course, at least one post-catastrophe image needs to be acquired.
• Arctic wildlife monitoring. Scientists have found interesting to monitor animals, especially white animals who live in the Arctic, easy to spot by satellites [Lavigne, 1976].
• Weather forecast. It is of uttermost importance both in the short period:"tomorrow there will be a hurrican"; as in the long one: "temperature will increase of 2.5 K in the next 50 year". [Racah et al., 2016] Figure 2.1: Example of remote sensing application: World sea surface temperature, November 2018. Credit: [Minnett et al., 2019] All the mentioned applications are not new, neither impossible before the use of remote sensing techniques; however, they were too much expensive or time-consuming to be accomplished in an extensive way as nowadays. In the last decades, remote sensing for Earth observations has become increasingly popular due to the new techniques of data processing and the new capabilities in terms of available computational power. Furthermore, the number of satellites for this purpose has hugely increased, and some resources are now available for free. Remote sensing consists of two different operative moments: data acquisition and data processing, summed up in Figure

Data Acquisition System
Remote sensing can use different means in order to acquire data; however, nowadays, the majority of data are collected from satellites. These satellites are equipped with special sensors which permit to scan the Earth surfaces in many distinct ways and to look for various targets. An example of the working principle is illustrated in Figure 2.3. Technically speaking, passive sensors measure the electromagnetic radiation (related to power density [W att/m 2 ]) reflected from the Earth surface or spontaneously emitted by the surface itself. Each satellite is equipped to capture the electromagnetic radiation in several bands, where each band is a determined interval of contiguous frequencies. The information carried by each frequency is different, the optical (visible) portion of the spectrum carries information about colours of the target (the same structure we can appreciate with eyes); further on, the thermal infrared frequencies give information about the temperature of the objects. (There are many applications which use this concept to study the temperatures of the oceans or also of the mainland in remote regions of the planet [Handcock et al., 2012]). An example of the working principle is illustrated in Figure 2.3. An acquisition system is very complex and many challenges need to be overcome in order to have it in place and working. Disregarding the physical structure and the instruments, we would like to spend a few words to explain the difficulties of the information retrieval process. First of all, the atmosphere is formed by many elements in the gas state, and it is hundreds of kilometres thick. The electromagnetic radiation, while passing through it, interact with these elements, and they can lose power due to absorption and distortion. Secondly, the behaviour of the atmosphere is not static, and so it should be modelled as a dynamic system, applying the right corrections to the signal. The atmosphere behaviour respect to the electromagnetic radiation is summed up in Figure 2.4; it is easily understandable that specific frequencies are entirely absorbed, and others remain unchanged crossing the atmosphere. This atmospheric behaviour forces to use only a small portion of the electromagnetic spectrum for our purposes. Indeed, the following

Sensor and characteristics
There are many types of sensor for remote sensing, and they can be classified in different ways. The biggest classification is dividing sensors between passive and active: • passive: they measure the spectral signature of the electromagnetic radiation emitted or reflected. The electromagnetic profile acts like a signature which allows to identify materials. More clearly, we can state that every material has its property of reflectance, and analysing the behaviour in a collection of frequency it is possible to separate it from all the others. An example is represented in fig 2.5.
• active: they illuminate the Earth with an electromagnetic source (usually in the microwaves) and measure the backscattered energy, this is known as radar technique. The most advanced type used is the SAR (synthetic aperture radar) system. Radar is extremely convenient because it uses longer wavelength compared to optical sensors. It ensures the signal to pass easily through clouds, smoke and to work day and night. However, it can not rely on a specific spectral signature for all materials. Optical sensors are also characterised by some quantities which define the quality of the final image: the spatial resolution (size of the smallest distinguishable target on Earth), the spectral resolution (width of the bandpass around each scanned frequency), the radiometric resolution (quantisation of each band), the temporal resolution (revisiting time over the same zone).
It is also possible to classify the passive sensors based on the number of used bands in the electromagnetic spectrum: • panchromatic sensor: it is a single-channel detector which usually spans all the visible range; the acquired images are black and white pictures from the space. The actual spatial resolution can reach 0.3 meters.
• multispectral sensor: it is a multi-channel detector, with 5-7 bands; usually the visible region is included.
• superspectral sensor: it acquires an image which is a superposition of different intensity measures in many separate and narrow bands of the spectrum. This type of sensor usually has more than 10 bands.
• hyperspectral sensor: it is also known as imaging spectrometer, and it deploys many bands, usually hundreds, with a very narrow bandwidth.

Data processing
Remote sensing is not only data acquisition but above all, data manipulation and processing. The acquired images undergo two different steps: the pre-processing phase and the processing proper. The pre-processing includes some calibration, correction of geometric or radiometric distortion and georeferencing. Instead, the processing phase aims to extract the useful and desired information also combining them with ancillary information, maybe ground measurements or some a priori information. Data processing for change detection makes use of machine learning algorithms; both supervised and unsupervised settings found their application in remote sensing. For now, let us only say that supervised algorithms need some extra input to reach their goal correctly. On the other hand, unsupervised ones do not need any other input more than the satellite data.
In the data processing framework, there are many possibilities in order to achieve different goals. The target of this thesis is to perform change detection, which is described in the following sections.

Data types
Remote sensing is about acquiring data and process them to get useful information. Before entering deeply into the processing part, we shall statistically characterise the data. First of all, data are always affected by errors; in this application, they are mainly due to noise during the acquisition process. In particular, optical data have two major noise types: additive uniform noise and salt and pepper noise. On the contrary, radar images are affected by speckle, which is a multiplicative noise-like phenomenon. Properly speaking speckle is not noise but an inborn result of the radar process acquisition, however, it makes images look noisy. The argument will be examined more in-depth in the next chapter.

Change detection
This discipline aims at finding differences given a series of images of the same place, taken in different time instants. It is useful to highlight changes on the ground (e.g. new buildings, change of crop). The simplest case is when only two images are present X t 1 and Y t 2 , where t 1 , t 2 are two generic time instant, with t 1 < t 2 .
Having a couple of images representing the same place (e.g. an urban area), maybe in RGB colours or in b/w, it does not sound like a hard task spotting differences between them. Even though our brain is capable of distinguishing differences, it performs this operation in a very sophisticated way. For example, it would neglect some features that we know not proper of the terrain or the buildings, for example, the shadows. However, a machine does not know what is a shadow, that it can turn with the Sun movements, and that is not a proper change on the ground. Taking pictures from the satellite implies to count for the differences in illumination, time of the day or angle of view. These are not hard tasks for our brain; however, they are for a computer.
On the contrary, a human can only analyse some km 2 of terrain; instead, a machine can analyse entire regions in a small amount of time. This argument is also more persuasive if we think to compare hyperspectral images when the number of channels is quite high, and the information carried in some bands, outside the visible region, can be meaningless to us, or better we are not able to appreciate changes. On the downside, a computer needs to know what is looking for and what type of difference to neglect. Because, as partially already stated, the Sun elevation, parallax effects, registration error and noise can generate spectrally appreciable changes, but without belonging to a specific or semantic class transition [Volpi, 2013]. In other words, a critical point in change detection is the influence of image changes which do not represent real variations in the structure of the analysed environment; we have mentioned shadows, but further, we can say clouds (in optical images) and clouds shadows on the terrain; atmospheric interaction during different seasons or time of the day.
To cope with all these problems some countermeasures have been adopted; the basic one is the assumption to use acquisition where relevant changes are more significant in intensity than signal changes due to other reasons (e.g. atmospheric conditions). The next two sections investigate two different change detection framework: the Homogeneous and the Heterogeneous. The former utilises types of images acquired by the same sensor, and in similar conditions of light, orbit direction and angle. The latter, instead, is more challenging because it concerns images from different sensors and also from different domains, for example, from optical and radar sensors. The final goal of change detection is a change map, that is a 2-class classifi-cation of the original image; in other words, each pixel must be labelled as changed or not changed.

Homogeneous Change Detection
Homogeneous change detection means combining and comparing information acquired by the same sensor, or at least the same sensor type. It deals with the comparison of images which lay in the same domain, so acquired with the same frequency, polarisation, geometry, etc. The key point is to have a homogeneous domain where the measurements taken by the instruments represent the same quantity: intensity, reflectance, radiance. Different methodologies have been developed to obtain the change map in a Homogeneous case. Nevertheless, the most simple way is through mathematical and comparison operators: difference for optical images, and ratio for radar images. The approach is different because the two types of images suffer from different noise patterns. For homogeneous change detection, there are two typical approaches as highlighted in [Bovolo and Bruzzone, 2015]: fusion at the feature level and fusion at the decision level. Fusion at feature level is intended as a comparison in the raw data domain. It is possible to extract the multitemporal information needed, analysing the different signatures in the two time instants. This class of techniques is mainly used with unsupervised algorithms. To cite some of them: differentiation/ratio (also known as Univariate Image Differencing or Change Vector Analysis for optical images) with thresholding and automatic thresholding algorithms [Moser and Serpico, 2006]; non-linear feature extraction is also feasible but more complex; further, the Principal Component Analysis can be applied to the single time image or to the stacked features as in [Fung and LeDrew, 1987]. Fusion at decision level is quite different from the previous because it assumes to classify and to segment the two images and then perform change detection on the result of the segmentation. In this case, the segmentation can be done relying on each image separately or exploiting the mutual information between them to construct the segmented images.
It is evident how the two methodologies are prone to errors in different cases; however, when well-tuned and relying on good images (correctly registered, calibrated, etc.), they can achieve good performances. Moreover, there are many areas of interest where the homogeneous change detection framework is easily applicable and very convenient. For example, to monitor some medium-long term changes: because -even if the revisit time of a satellite is long or some acquisitions are useless due to weather conditionit is possible to obtain an excellent final result. The main drawback of the Homogeneous methodology, in the most strict circumstances, it is that the algorithm can be applied only to the measures taken from one sensor in one specific operational modality. In case a satellite (or a family of them) has been retired and substituted with a new one, the compatibility of the old method in order to compare old and new images is not assured.Moreover, the instruments of many recent missions can be operated in a variety of modalities, which differ in their geometry, polarization, or frequency. Concluding, it is essential to say that all the previous methods do not always fit with very-high-resolution images.

Heterogeneous Change Detection
Heterogeneous change detection (HCD) is an emerging topic in earth observation. It answers the increasing availability of remote sensing data by offering methods that allow to combine images of radically different nature and still extract reliable information about changes on the surface. The images could be acquired by multimodal sensors, such as optical instruments and synthetic aperture radar (SAR), or they can be recorded with different sensor parameters or under distinct environmental conditions, cases that would otherwise not be comparable unless possibly through meticulous pre-processing and co-calibration. In the bitemporal setting (two images available), HCD is particularly useful to obtain situational awareness after sudden change events such as a natural disaster. That is when it is important to use the first image source of opportunity to map changes, instead of waiting for the next acquisition that permits a comparison of homogeneous images. Furthermore, for monitoring long-term trends, the joint analysis of heterogeneous sources allows us to extend the time frame of the analysis or to increase the temporal resolution. Lastly, SAR images are available also in case of cloud cover (tropical and sub-tropical areas are very prone to this phenomenon) or smoke cover of the sky, because microwave penetrates in them.
Regardless of the motivation, HCD relies on the fundamental assumption that the changed areas have a distinct signature for all the sensors involved, even though the physical origin of this signal may be different. Moreover, since an absolute reference is lacking when we contrast heterogeneous data, the problem is inherently ill-posed, and the labelling of pixels or segments as changed and unchanged is generally ambiguous. It is necessary to assume some additional prior information in order to discern the change class. A typical prior assumption is that the change concerns small regions or a minority of the pixels in an image or another one is when the characteristic signature of one of the classes involved in the transition is known. The mentioned minority assumption is common in generic methods, while signature assumptions can be advantageous to customise an algorithm for a thematic application.
While the first works on HCD were developed in the supervised setting, focus in recent years has turned to the unsupervised case [Mercier et al., 2008]. This makes the method more suitable for practical cases since ground truth in Earth observation is sparse and costly to collect. Another trend is that deep learning prevails more and more, as in other areas of computer vision and image analysis. Most current HCD approaches adopt transformations between the input domains, or from these to a common latent domain, to bring data to a space where they can be efficiently compared. Convolutional neural network (CNN) architectures such as autoencoders and generative adversarial networks are flexible and powerful tools that can accomplish these image translation tasks, as reviewed in [Luppino et al., 2019[Luppino et al., , 2020.

Chapter 3
Theory and technical background

Machine Learning Introduction
Machine Learning is a discipline at the intersection of computer science and statistics. It is the ability to use data and models to predict some behaviour, or again to use data to create an high predictive model of a phenomenon. The machine learning core is detecting patterns and regularities underneath the raw data. Machine learning is also a part of the big world of artificial intelligence. To be intelligent, a system needs also to have the capability of adapt to the changes in the environment. So we have stated that machine learning is to create models based on statistical and probabilistic rules. This thesis deploys some classical machine learning algorithms and methods as well as some modern deep learning ones. In the following, the basics knowledge to understand our methodology is presented; and because this work is not a systematic dissertation on machine learning, neither on deep learning only the necessary concepts will be illustrated.

Prior computation: the affinity matrix
An affinity matrix is a statistical object used to show similarity between data points. Is is constructed setting a metric and looking for data which have minimum distances, and represent them with a 1 in the matrix (a 0 means different data); so it uses the concept of distance, but however it is quite the opposite, because when the distance between two instances is 0, the matrix entry is set to 1. The deploy of this concepts let machine to mimic the human action of associating similar things. And this similarity can be every concept, it depend on the metric chosen. Specialising the concept, an affinity matrix can look for repetitive or similar patterns inside pixels and group of pixel. An extension of a binary affinity matrix is a matrix where each entry is calculated as a result of a multiplication of our data with a kernel, in this case values can range, for example, in the set [0, 1].

Canonical Correlation Analysis
Canonical Correlation Analysis (CCA) is a method for reducing the dimensionality of a couple of sets of samples taking into account their mutual correlation. It projects the samples in a common space where the correlation between them is maximized.
The next section introduces the CCA theory for vectors following [Mardia et al., 1979], the other try to define some operative rules.

Theory
Suppose to have two random vectors: x and y respectively q-dimensional and p-dimensional: x ∈ R q , y ∈ R p . Now suppose further that are their means, and Now consider two linear combinations η = a T x and φ = b T y . They are projections of our vectors along the directions of a and b. The correlation between η and φ is Now we want to find a and b for which the correlation is maximised. In other words we try to solve the problem because equation 3.4 does not depend on the scaling of a and b (both the numerator and the denominator depends linearly on the magnitude of the two), hence it is not restrictive to consider a unit-variance constraint on each projection [Alpaydin, 2014].
It is now possible to write our problem as a Lagrangian problem, and then we take the partial derivatives respect a and b and equal them to zero After some calculation, we end up with an eigenproblem, and in its solution, a and b should be eigenvectors of Σ 11 [Hardoon et al., 2004]. Because we are interested in maximixing the correlation, we choose the two eigenvectors with the highest eigenvalues; let us define the two eigenvalues as a 1 , b 1 , of dimensions respectively q and p; the eigenvalues are actually just one, shared by the two matrices (eigenvalues of AB are the same of B A [Alpaydin, 2014]). It is however possible to choose how many pairs of eigenvectors a i , b i to use. If k pairs of eigenvectors are in use, to project our data we must take the matrix q × k whose columns are a i , and respectively the matrix p × k composed by w i as columns. The new space has constituted by non redundant features: all the a i are uncorrelated and each a i is uncorrelated with b j , i j.

Deep learning
Deep learning is quite a new approach to learning, however is based on some rather consolidated ideas, for example artificial neural networks. The term deep broadly indicates a huge neural network, and more precisely refers to a neural network with a high depth, i.e., many hidden layers. It has begun to attract attention since some years now, because the computational power of our machines has become capable to cope with the complexity in managing very big artificial neural networks. Their fame is due to the optimal results obtained by deep nets in many applications in the most different fields.

Artificial Neural Networks
An artificial neural network is a collection of simple units called neurons. Each neuron is composed by a summing unit and an activation function. Suppose to have some inputs x j ∈ R, j = 1, ..., d and for each of them, a connection weight w i ∈ R. The output in the simplest case is a weighted sum of the inputs: where w 0 is a bias. It is possible to write in a more compact notation using the dot product y include the bias. The learning is performed looking for the correct vector w. Let us introduce the activation function φ, which can be, for example: This is usually a non linear function, i.e. a sigmoid or a ReLU (rectified linear unit) and outputs just a scalar result. To visually understand the concept we can use Figure 3.1. A single layer of weights can approximate a linear function; instead, a connection between many neurons can also learn some non-linear relations and is called a network. The most simple network is a feedforward neural network which is built up using layers, and each layer is composed by many neurons; there are the input layer, the hidden layers and the output layer. The number of these layers represents the depth of the network: here comes the term deep learning. The training of the network, as of the single neuron, is performed feeding the network with some data instances, one by one, and defining an error (loss) function to guide the procedure. When a datum transverses the net, the activation propagates in the forward direction, the output is calculated and the error function is evaluated. The goal is to minimise the error, which is done by calculating the derivatives of the loss respect all the parameters θ (weights and biases) of the network. Based on this gradient, the parameters are changed according to the result of the operation. This operation is performed applying the derivatives and the chain rule, and its result is backward propagated along the chain till the input layer. In this way the error is propagated from output to input and that is why it is called backpropagation. The error is minimised iteration after iteration (some optimisation algorithm is used). This iterative behaviour suggests that training a network can be long and time consuming, but assures that the learning is continuous in time and the machine adapts to changes. The method presented is called stochastic gradient descent and starts initialising the weights randomly [Allen-Zhu et al., 2019]. However, there are advanced methods for learning, for example batch learning procedure, where the parameters are updated after some input data and not every sample; when all the dataset has passed inside the network an epoch is passed.
Motivation for interest in neural networks is also based on theorems which we are going to state and for whose proofs it is possible to read [Cybenko, 1989], [Csáji, 2001].
Theorem 1 Universal Approximation Theorem. A feed-forward network with a single hidden layer containing a finite number of neurons can approximate any continuous function uniformly on a compact subset of R n , under mild assumptions on the activation function.
Even if it is possible, this may not be practically feasible for whatever function, due to the dimension of the network. Because, the theorem does not say anything about constraints on the number of neurons respect the function complexity. However, under the assumption of a ReLU activation function it has been demonstrated in [Lu et al., 2017] that any Lebesgue-integrable function f from R n to R can be approximated by a fully connected width (n + 4) ReLU network to arbitrary accuracy.

Convolutional Neural Networks
Convolutional neural network are a specialised type of networks for processing data that has a known grid-like topology [Goodfellow et al., 2016]. Examples are time-series data (1-D dimension) and images (2-D grid of pixels). The name arises from the specific type of calculation that is performed in the CNN: a convolution. It is possible to imagine this type of network as a series of different stacks of finite impulse-response filters, each disposed in a different layer. In other words, each filter can be thought as a convolution of the image with a kernel of smaller dimension. This type of network is extremely powerful in the images domain because it succeeds to capture the intrinsic representation of images, as sets of edges, patterns and figures.

Autoencoders
An autoencoder is a specific type of network whose goal is to copy the input on the output while favouring some specific properties. It can be seen as a network composed by two parts: an encoder function e(·) and a decoder d(·). If an input x is provided, it tries to reproduce it in the output y, d(e(x)) = x. In between the two parts, a code layer is present, which provides a transformed (often compressed) representation of the input data. However, this is not a complete description, because a network that has this behaviour is useless. Actually, the aim of the network is to recoverx x, trying to reconstruct only interesting part of the inputs, and discarding some feature we want to get rid of. The topology of the network (Figure 3.2) is similar to a feedforward network, and also the training uses the same techniques, typically minibatch gradient descent following the gradients computed by backpropagation. Unlike the previous networks, also re-circulation may be used, that is an output is re-used as input. The usual architecture has a code layer where the representation of the input information is compressed; this allows the network to extract only the useful features and prevent the network to learn the identity transformation. If the encoder and decoder have too many parameters respect the problem dimension, it can occur that the autoencoder learns the useless identity transformation. To prevent this unwanted situation, some precautions have been adopted and some more properties have been added to the autoencoder: sparsity of the representation, robustness to noise, smallness of the derivative. To implement the first feature, in each cycle of learning not all the samples are fed to the algorithm but only a randomly chosen subset is used. For the second, some noise is added to input samples during the learning and at the output compared with the original de-noised ones. The last aforementioned case, instead, is simply the trick to maintain derivatives small enough, in order to learn better the features which are constant respect to x.

Deep Canonical Correlation Analysis
In the world of deep networks many different architectures have been proposed, among them, we recall here [Andrew et al., 2013] and [Wang et al., 2015]. The common ancestor of most of them is reported in Figure 3.3. Moreover, all the cited works deployed a supervised framework for all of them. In the Deep Canonical Correlation Analysis (DCCA) the metric used to measure the extracted information and the performances is the "quantity of correlation": the sum of the correlation for the top most correlated directions. Indeed, [Andrew et al., 2013] focused on the quantity of correlation which can be extracted with different methodologies and demonstrated that a DCCA extracts much more correlated feature as compared to a CCA and a Kernel CCA (KCCA) [Andrew et al., 2013]. The training for this network, using a gradient descent algorithm, need a custom gradient function which [Andrew et al., 2013] has provided. More-over, because the correlation objective is a function of the entire training set and cannot be decomposed into a sum over the data points, a stochastic approach is not feasible; the article instead proposes mini-batch descent or full-batch optimisation.  [Wang et al., 2015]. It includes two NN (encoders), and at their output a transformation in a maximally correlated domain through the U and V matrices.

Deep Canonically Correlated Autoencoders
Inspired by DCCA [Wang et al., 2015] proposed the Deep Canonically Correlated AutoEncoder (DCCAE), see Figure 3.4. This network implements a trade off: the autoencoder maximises the learning of information between inputs and learned features, instead the CCA maximises the information of the two different views. From this perspective, it can represent more sophisticated interactions between data as compared to a simple DCCA, and also autoencoders alone.
The DCCAE network has been adopted by [Zhou et al., 2019] to perform an heterogeneous change detection task and showed good results and low variance. It especially over-performed CCA and DCCA. The training have been carried on with mini-batch gradient descent, but assuring to use a batch size big enough to be representative in the calculation of the sample covariances.  [Wang et al., 2015]. Two autoencoders are tied together by a CCA performed in the latent space.

Chapter 4
The Proposed Change Detection Method

General idea of the methodology
This chapter is devolved to explain the proposed heterogeneous change detection method. The methodology proposed in this work aims to compare two different types of data, which, on their own, could not be compared. Indeed, it is not possible to use a traditional method to do change detection in this environment; for instance, only subtracting the two images does not have any meaning. As mentioned above the two images lay in two different domains; hence we need to transform them into a common domain and then compare them. Denoting the two images to be compared as X and Y and using the same names for their respective domains, we can summarise as in Figure 4.1. The figure shows that not only it is possible to convert the two images in a common domain, but it is also feasible to convert one image in the domain of the other. Theoretically, it allows to compare the two images in the domain of X, or Y , or in the latent space Z. In this framework (Figure 4.1) the arrows represent regression functions. In particular, each of these is a neural network properly trained for the purpose.
At this stage, we have brought the two images in a shared (or common) space, where it makes sense to use some elementary change detection method (e.g. image differencing). However, we introduced some neural networks, which need training to be used. In remote sensing, some labelled samples are needed to train a network, and they are difficult to retrieve and expen- sive. Therefore, in our case, we want to train the networks to transform one image into the other one, but using nothing more than the input data. What would happen if we trained the network with our two images? We could have done because they provide examples of the two distributions we would like the network to learn. However, we must recall that the two images are taken at different times and generally exhibit changes, and this is a big issue for our learning. We want the network to learn to individuate changes as abnormal patterns, and not as a rule. Thus, an innovative technique is used to automatically retrieve some training samples located in likely unchanged areas from our data [Luppino et al., 2019], turning our procedure to a completely unsupervised method. This stage is conceived as a method to extract information and to return a probability-like score that expresses the chance that each pixel is changed from one acquisition to the other; this is explained in [Luppino et al., 2020].
The second big issue to solve is the problem of being sure that the latent space Z is unique and is a common transformed space for both mappings R (X) and P (Y ) (see Figure 4.1). To assure this consistency, a technique involving Canonical Correlation Analysis is proposed [Figari Tomenotti et al., submitted]. A very attractive feature of CCA is that "if there is noise in either view that is uncorrelated with the other view, the learned representation should not contain the noise in the uncorrelated dimension" [Andrew et al., 2013].
The change detection scheme applied in the rest of the chapter is ex-

Problem setting
Two different sensors scan a geographical area in two different moments in time. We denote the two sensors (and also their respective domains) as X and Y and the respective acquisition times as t 1 and t 2 . The two sensors generate two images with the same height H and width W (up to possible re-sampling and co-registration). The two images generally include different numbers of channels identified as C 1 and C 2 . Thus, the two images are respectively X ∈ R H×W ×C 1 and Y ∈ R H×W ×C 2 . In the following, it is also assumed that a limited part of the image contains changes; this is crucial because we need a reliable non-changed part to train our networks (regression functions).

Affinity-based Change Prior
Our prior information is an affinity-based cross-domain pixel distance proposed in [Luppino et al., 2020], which is interpreted as a probability of change of that pixel. The following procedure is applied to an image patch of dimension k × k, and, when computed, the patch position is shifted in order to progressively reach all pixels in the entire image; it is applied to the images from both modalities. Firstly, we compute the domain-specific affinity matrices A X and A Y , whose elements A X i j and A Y i j are pairwise affinities between pixels i and j belonging to the patch. These are computed from pairwise distance measures d X i j and and by use of the common Gaussian kernel function with kernel widths h X and h Y . The two kernel widths are domain specific, and set equal to the average distance of the K th nearest neighbour, with K = 3 4 k 2 . This method allow to capture an intrinsic distance inside the patch [Luppino et al., 2020]. Moreover, the distance measures d are computed as Euclidean distances. This choice is understandable considering the domain and the data distribution: optical images have a Gaussian behaviour (in intensity), whereas SAR images can be transformed applying a logarithm bringing them to near-Gaussianity [Zhan et al., 2018].
Highlighting the fact that the matrices A are symmetric, the crossdomain pixel distance for pixel i is obtained as which is the average absolute affinity difference between pixel i and n other pixels. This assures that α i ∈ [0, 1], providing small values when pixel relations, within the size n image patch or neighbourhood, remains similar across image domains, and large values otherwise. This is reasonable because only changes between images should present larger values in the difference matrix. This method is very powerful to look for changed patterns inside images and assign to every single pixel a probability of being changed. Even if the method is not too heavy for modern computational power, it can be sped up using a sliding window which moves faster than a pixel per time.
Of course, this comes with a resolution degradation. To examine in depth the prior retrieval discussed, it is possible to refer to [Luppino et al., 2020]] where a useful toy-example is presented.
We will utilise α i to suppress the influence of pixels with a high probability of change, and therefore we must define a weighting function Π(α) : [0, 1] → [0, 1] that is monotonically decreasing. Hence, the higher is Π(α), the lower is the probability of that pixel to be changed from one acquisition to the other, and the higher is the confidence to use it as a learning sample. We use the simple function however other decreasing functions can be adapted and used. The computation of this matrix is meant to be performed offline: the Π(α i ) values can be calculated, stored and used when needed.

CCA formulation
The Canonical Correlation Analysis has been formulated as in [Wang et al., 2015] but adding the prior information in it. It is clear from the Section 3.3 that the CCA is a linear method and extract the covariances Σ 11 , Σ 22 and the cross-covariance Σ 12 . The approach we choose is to insert here the result of 4.4. So modifying the equations 3.1, using and also using N as the numbers of samples (pixels) taken into account, we obtain where the stands for the Hadamard product (or element-wise multiplication), which does not change the dimensions of the matrices nor the order of the main eigenvalues, provided that the two matrices are positive-definite. Σ are positive semi-definite for construction, but to avoid zeroes in the computations of inverses, a small δ has been substituted when needed. The result of the CCA block are the optimal matrix projection, U = [u 1 , ..., u L ] and V = [v 1 , ..., v L ].

Deep Canonical Correlation Analysis with Autoencoders 4.5.1 The network topology
The chosen topology is similar to the one in [Luppino et al., 2020], and it is inspired by [Wang et al., 2015] for what concerns the CCA block. In our methodology we are interested in taking advantage of our prior information inside the just mentioned framework. As far as we know we are the first to deploy a DCCAE methodology in an unsupervised fashion. The architecture is composed of two autoencoders, coupled in a novel fashion through different losses computation. The four networks are Deep Convolutional Neural Networks, and they implement image regression functions. The encoders take images as input and they transform them in a common domain called Z; the functions are R(·) : R H×W ×C 1 → R H×W ×C z and P(·) : R H×W ×C 2 → R H×W ×C z so the image dimensions are preserved also in the latent space and the feature dimension is a common parameter. The decoders S(·) and Q(·) perform the inverse transformation, taking the images from the Z domain to the two original domains. Additionally, the CCA block performs a linear Correlation between the output of the two encoders, thus highlighting the most canonical correlated features and calculating the correlation itself for each feature. Figure 4.3 presents the network topology.

Training and losses definition
The training phase of the network is crucial for the system itself, and has been studied in depth, in order to assure a fast and robust training. The training parameters are the network weights, defined in a vector called ϑ. The overall loss function has been designed ad hoc, and it consists of four loss terms with respective weights.
L tot = λ CC A · L CCA + λ Recon · L Recon + λ α · L α + λ Cross · L Cross (4.8) Canonical Correlation. The canonical correlation loss is computed on the output of the encoder, and the loss term is defined as follows (analogous but not identical to representation as 3.3).
where n is the total number of pixels in a patch, U, V are the optimal transformation matrices, x, y represents co-located patches of the respective images and tr is the matrix trace. U, V are now matrices, and no more vectors as explained in 3.3, because now x, y are multi-channel images. This term forces the two autoencoders to converge to the same latent space, which is the space where the correlation between the retrieved representations is maximised. It is possible to set the latent space dimension C z (feature dimensions) as big as desired, respecting the constraint Reconstruction of the input. It is obvious, having autoencoders, we want to have our outputs as much similar to the inputs as possible; in other words, our reconstruction from the latent space should be as faithful as possible. Stating we would like to have X X = Q (R (X)) and analogously for Y . Recalling that ϑ is the weight vector of the entire network, and calling x and y the vectors collecting the data of an image patch centered on the same pixel in the two image domains, the loss term is defined as It is clear from 4.10 that the raw difference between input and output should be minimised. In Figure 4.4 it is illustrated the operation to obtain the parameters of the loss. Prior weighted similarity. This is one of the novelties that were recently proposed in [Luppino et al., 2020], it encapsulates the use of the prior information about the probability of each pixel of being changed in the translation of the images. In other words, we would like our network to learn the transformation from one domain to the other, sô X Q (P (Y )) must hold true. However, it is necessary that our network learns only from unchanged pixels, and so during the learning phase a correction term should be used, as stated in Equation 4.11. In order to define this loss, it is necessary to define the following notation: where a i is a generic feature vector representing the i-th pixel in a patch, its modulus is the sum squared of all the features (Euclidean metric). The weighting of Π is applied pixel-wise on the pixel plane within the patch represented by a vector a. In other words, a 2 Π is the modulus of a, weighted on Π pixel-wise.
where F(·) S(R(·)) and G(·) Q(P(·)).  Consistency cycle. As pointed out in [Zhu et al., 2017], domain translations should maintain consistency cyclically; it means that after the data have been transformed once, they can be re-transformed to their original domain without becoming meaningless or losing properties. If the regression functions are rightly tuned the following must hold X Q(P(Ŷ )) = Q(P(S(R(X)))) and to force our network to maintain this alignment we introduced 4.12 2 + E X,Y F (G (y)) − y 2 2 (4.12) s.t.

DEEP CANONICAL CORRELATION ANALYSIS WITH AUTOENCODERS35
where r 1 , r 2 are regularisation parameters of the CCA. The constraints have been taken into account into the CCA evaluation, and they assure to have uncorrelated directions inside each matrix projection; this leads to maximise the information kept in the transformed space. The expectations in the loss contributions 4.10, 4.11, and 4.12 are estimated as sample means on a random ensemble of fixed-size image patches drawn from the two image domains X and Y .

The back-propagation
Backpropagation of the network is obvious, for what it concerns the Neural Networks strictly speaking, however, to minimise also with respect to α, a manually written procedure have been added. It was required in order to use a gradient-based optimisation, as we have done in this thesis. Indeed, the gradient of corr (H 1 , H 2 ) is required, and the paper by [Andrew et al., 2013] has been followed. Demonstration of the 4.14 formula can be found in that paper.
where H 1 and H 2 are the transformed X and Y in the Z domain, andH 1 = H 1 − 1 m H 1 1 is the centered data matrix (and respectivelyH 2 for H 2 ). Instead, UDV T is the singular value decomposition of the matrix T, which is the solution of the canonical correlation.

Thresholding
The final result is obtained differentiating the transformed images and applying a suitable thresholding algorithm to them, in order to retrieve the two classes: changed, not-changed. The difference image is obtained as a weighted mean (on the number of channels) of |X −X | and |Ỹ −Ŷ |. The algorithm used to detect the optimal threshold in an unsupervised way is the Otsu threshold algorithm [Otsu, 1979]. It is an iterative method which works on minimising intra-class variance, defined as a weighted sum of variances of the two classes, where the weights are the probability of fixing the threshold on a level. This exploits the fact that minimising intra-class variance is equal to maximising the inter-class variance, as stated in [Otsu, 1979].

Filtering
Lastly, to obtain a clear and meaningful result, a filtering algorithm is applied. It helps against spurious results, especially to filter out isolated pixels, i.e. pixels classified as changed in a contiguous and extended non-changed region or vice versa. Of course, this is done because it is very unlikely to have a single-pixel classified differently from all its neighbours. The method proposed in [Krähenbühl et al., 2011] is used. It exploits spatial context to filter with a fully connected conditional random field model. It defines the pairwise edge potentials between all pairs of pixels in the image by a linear combination of Gaussian kernels in an arbitrary feature space. Iterative optimisation of the random field has one main downside: it requires the propagation of all the potentials across the image. However, this highly efficient algorithm reduces the computational complexity from quadratic to linear in the number of pixels by approximating the random field with a mean-field whose iterative update can be computed using Gaussian filtering in the feature space. The number of iterations and the kernel width of the Gaussian kernels are the only hyper-parameters manually set, and we opted to tune them according to [Luppino et al., 2019]: 5 iterations and a kernel width of 0.1.

Extensions of the proposed approach 4.6.1 Linear CCA
The most simple alternative that has been investigated is the use of a simple linear Canonical Correlation Analysis, as it is. The procedure is: a) feed the images to the CCA; 2)CCA finds a representation where the correlation is maximised; 3)project the two images in the new space; 4)subtract one image from the other to obtain a difference image. This alternative is supposed to work if the two images under examination have some common or contiguous bands; this method is unlikely to work in a completely heterogeneous environment. This peculiarity is because the transformation applied by CCA is linear, so it can compensate for differences in certain data types, or extract the more similar feature from two images, but it is not able to perform a non-linear transformation.

Variations on Deep Canonical Correlation Analysis
Another interesting network explored is the Deep Canonical Correlation Analysis proposed by [Andrew et al., 2013]. We have of course modified the CCA to use our prior information as described above, and the network does not include any feedback, and the only loss considered is the maximisation of the canonical correlation of the two views. So, it is possible to say that the two networks have learned a non-linear type of CCA because the nonlinearity shows up from the Neural network itself.

Variation of DCCAE
A variation of DCCAE has been implemented for testing, its main architectures is proposed below.  .9 represents a very similar network to the previous, but the core difference here is that the projection of the encoder results along with the directions of maximum correlations. The operation is performed multiplying the encoder's output by U and V , which are the optimal matrices for the projection. In theory, it is also possible to try this configuration with different dimensions for the latent space Z. Nevertheless, this alternative implementation has a drawback: the projection along the most correlated direction can change abruptly from one iteration to another, leaving encoders and decoders largely unpaired. This shortcoming happens because CCA is just a linear method looking for correlation; indeed, during the training of the network, the direction which maximises the CCA functional can change and does not let the network adapt to this change. This behaviour was straightforward to appreciate, because during the training, when the network was learning and losses were decreasing, suddenly a peak occurred. The peaks were related to this change in the CCA transformation. Another approach which has been tried is to train encoders and decoders separately, or performing the CCA only once per epoch or similar; however, the limitation persists. So this network has been abandoned; however, it leaves an open question, i.e., finding a CCA-similar algorithm which changes slowly from one representation to the other.

DCCAE with latent space differentiation
The last architecture we would like to mention is a DCCAE (as in Section 4.5.2) with the differencing procedure for the change detection performed in the latent space Z, where the information should be maximally correlated. However, even if it was built as in Figure 4.3 or 4.9, it has some limitations, as shown in preliminary tests: the code space is not assured to be always perfectly aligned. This deficiency is probably due to the weak constraints of the network in that point, which does not allow to have a robust and persistent representation in the Z domain.

Chapter 5
Experimental Results

Environment
All the experiments were run on a graphics processing units (GPU) server in the domain of the Machine Learning Group of the UiT. The code was implemented in Python and Tensorflow 2.0, and will be available on line soon.

Datasets description
Flood in California Figure 5.1a displays the RGB channels of a Landsat 8 acquisition 1 of Sacramento County, Yuba County and Sutter County, California on 5 January 2017. The Operational Land Imager (OLI) and TIRS (Thermal InfraRed Sensor) sensors on Landsat 8 together acquire data in 11 channels, from deep blue up to thermal infrared. This area was affected by a flood in Febraury of the same year, and the second acquisition, in Figure 5.1b, has been taken by Sentinel-1A 1 and recorded in polarisations VV and VH on 18 February 2017 by a single C-band SAR. The ratio between the two intensities is included both as the blue component of the false colour composite in 5.1b and as the third channel provided as input to the networks. All these SAR channels were log-transformed. The ground truth in Figure 5.1c has been provided by [Luppino et al., 2019] and was manually annotated. Originally these images were of 3500 × 2000 pixels, but they have been resampled to 850 × 500 pixels because of computational time constraints.

Forest fire in Texas
Bastrop County in Texas was struck by a forest fire during September-October, 2011. The Landsat 5 Thematic Mapper (TM) acquired the image pre-event, a multispectral optical image with 7 bands. The Earth Observing-1 Advanced Land Imager (EO-1 ALI) acquired the post-event multispectral optical image with 10 bands. The resulting co-registered and cropped images of size 1520 × 800 are displayed in false colour in Figure 5.2a and Figure  5.2b 2 . Some of the spectral bands of the instruments (7 and 10 in total, respectively) overlap, so the signatures of the land covers involved are partly similar. Volpi et al. [Volpi et al., 2015] provided the ground truth shown in Figure 5.2c. Table 5.1 shows a comparison in the bands of the two instruments.
Landsat-5 TM Earth-Observing-1 ALI band λ  Table 5.1: bands of Landsat-5 TM and Earth-Observer 1 OLI. The bands partially overlap. λ stands for the wavelength, and GSD for the Ground Sample Distance, which is related to the resolution on the ground.

Affinity-based Prior
The prior was computed off line, even if it is not a heavy task. The procedure has been explained in the previous chapter, here the results are presented in Figure 5.3: the greyscale (black to white) indicates increasing probability of not been changed. Having in mind the ground-truths above (Figures 5.2c,5.1c), it is possible to compare those to these images which represent this prior information: how much to trust each pixel in learning the transformation. However, the visual inspection may be misleading, because the histogram of the images have been stretched for visualisation; so the real prior has values which are far more similar (refer to Figures 5.4 and 5.5) than the illustrated ones. Nevertheless it is worth noting how -especially in the Texas one-the change is very well localised, and precisely identified. This leads to say that our prior information is very reliable, remembering the fact that it has been extracted only from the images. The drawback of this prior is the big number of false alarms, considered as pixels not changed, and represented here in black. Considering the fact that the prior is used to learn a transformation, a more correct consideration may be to use only highly trustworthy pixels, even if they are a minority.
However not every class, present in the image, can be represented by highly reliable pixels. Expressing the same concept in a more statistical sense, the prior procedure struggles to find a bi-modal representation of the data, were highly reliable pixels are far away from highly unreliable ones. And the fact that not every class is surely represented in one of the two modalities can be a weakness of the algorithm. It is noteworthy, from the visual inspection of the real histograms (not-stretched) in Figures 5.4 and 5.5 that the Texas data set shows a slightly bimodal distribution; it means two classes seem to be separable, changed and unchanged, and of course the changed are represented in black, which means values towards zero, and it is well shown in the histogram how this class is a minority. On the contrary, in the California prior histogram, from a visual inspection is not possible to infer the two classes, there is only a small asymmetry in the distribution towards the unchanged side. However, this is a really useful and powerful prior information, and it is out of its scope to be used as a stand-alone procedure for change detection, nevertheless in some cases it can be useful for an immediate and visual feedback to locate and identify the changes.

Evaluation metric
To evaluate the performance of the proposed methods, the Cohen's Kappa (κ) was adopted along with other standard coefficients. The Cohen's Kappa κ -first used by [Cohen, 1960] -is used to measure inter-rater reliability and it is considered more robust than a simple percent agreement calculation. However, there are some critics about its use; the controversy is due to the fact that is not symmetrical, and also about the difficulty of interpreting its value in some situations. However, the κ were adopted to measure the similarity between the retrieved solution of the change detection process and the ground-truth. They were considered as two distributions and compared. The comparison is done considering four classes: true positive, true negative, false positive and false negative. The correctly classified pixels belong to the first two classes, whereas the wrongly classified fall in the two others. The confusion matrix is used for the calculation and is constructed as shown in Figure 5.6. and N is the total number of observations (or pixels). Thus −1 ≤ κ ≤ 1. This coefficient tries to remove from percent agreement the agreement by chance, assuring a more balanced judgement.

Settings
The developed architecture was composed of encoders and decoders, each made of two hidden layers, with 100 filters each. The chosen activation function was a Leaky ReLU with a negative slope chosen to be γ = 0.3. [Maas et al., 2013], except for the last layer of NN which was a fully convolutional layer with a tanh(·) activation function. Furthermore, the optimisation method chosen was batch gradient descent, with a decreasing learning rate with exponential decay. During the training, data augmentation (patch rotations and flipping) was applied; moreover a dropout procedure was applied as well with a dropout rate of β = 0.2, to increase generalisation capacity. During the optimisation, gradient clipping was set to 1, in order to have always a gradient value under a certain threshold, even if our it was quite huge.
The overall network was trained for 100 epochs, where each epoch was composed of 5 batches and each batch of 20 patches. Each square patch, in turn, was composed of 100 pixels per side, so in total 10 4 pixels. All these dimensions were fixed in order to have a right amount of pixel to compute meaningful (and representative) sample covariances and having in mind the machine memory size constraints. The latent space dimension was fixed to 3, in order to be suitable for both datasets: the California data set has an image with only 3 features; and for the Texas data set it seemed a fair code space dimension, especially because of the Figure 5.18, which will be explained later on. All the others setting were manually set with a trial and error procedure, according to both the datasets at our disposal. Recalling the loss function 4.8, the λ was defined as in Table 5.2: Regarding the last three parameters, λ r egularisation 1 and λ r egularisation 2 are the CCA regularisation coefficients to avoid zeroes in the matrices and are set as found in the literature [Wang et al., 2015]. Furthermore, λ l2 is the regularisation parameter to prevent overfitting in the gradient descent method.

Results
The results of our experiments are shown in the following figures and tables. Data were taken running each experiment 100 times, it is not a huge sample, but enough to be representative. Results for the California dataset are presented in Figures 5.8, 5.9, 5.10; whereas for the Texas dataset there are Figures 5.11, 5.12, 5.13. More specifically, Tables 5.7a and 5.7b describe the κ coefficient of our DCCAE compared to other networks which have been considered as the state of the art (our implementation of the SCCN [Liu et al., 2018] and cGAN [Niu et al., 2019]) and the two networks developed in [Luppino et al., 2020]. The graph is in form of boxes, which contain the 25 to 75 percentiles, whiskers extend to the 5 and 95 percentiles, and remaining data points are plotted as red +.
The two network taken as reference are the SCCN and the cGAN, a brief description of them is here provided. The SCCN works with two CNNs (pretrained with a deep belief network) which learn a representation for optical and SAR images in terms of common pixel-wise features, and directly comparing them obtains the change map. The cGAN, instead, is a method based on finding a common representation for SAR and optical images with a generative adversarial framework. It is evident how the ACE-net reaches state-of-the-art performance, whereas X-Net performs slightly worse, but more stably in terms of the κ variance. The proposed DCCAE network with the affinity prior, is labeled DCCAE and combines the best features from both the ACE-Net and the X-net, reaching a very high κ as well as exhibiting small variance, which indicates a stable performance. It is worth noting that the seemingly accurate performance of the SCCN algorithm on the California dataset is a side-effect of degenerate behaviour, as explained in [Luppino et al., 2020]. Summing it up, the SCCN network -due to its simplicity (few parameters) -learns to recognise only the background in the first image, so presenting the image with a big flood it detects the changes in that area. The aim of the DCCAE network was, indeed, to achieve accurate peak performances as the ACE-net, but with a smaller variance, as the X-Net. From these two boxplots which help to summarise the behaviour of the network it is possible to state that the self-supervised DCCAE network outperforms the state-of-the art, achieving very accurate performances both for peak values in classification and for the stable behaviour highlighted by the small variance.
Moving on to another evaluation method adopted to assess the performances of the DCCAE network, confusion maps were built for both datasets as an immediate tool to visually evaluate the performances of the method. Recalling the legend of the confusion maps: true positives (white), true negatives (black), false positives (green) and false negatives (red). Figure 5.8b shows the result obtained on the California dataset, which presents some false positives, especially around the contours, thus suggesting that a 'thinner' filter may lead to a further improvement in the results. The other confusion map, relative to the Texas dataset 5.11b, highlights an excellent localisation of the changes, and very little impact of false negatives, which are only on the contours of the changed area. An explanation takes into account the fact that the prior is calculated with a kernel which extrapolates a spatial context: on an edge, it is not obvious its response and its behaviour cannot be as thin as a pixel. Especially because natural damages and effects are not bounded by sharp line on the ground which divides burnt vegetation not-burnt. It is also interesting to note how here the algorithm found two preserved areas inside the fire scar, and a line (bottom left of the scar) which can be a street or a small river. The linear CCA was not able to detect these unchanged zones, but the nonlinear method succeed to extract them.
For what concerns the false positives, we also recall that the ground truth of each data set is focused on the ground changes due the corresponding major event (a flood and a forest fire) that happened in between the two acquisitions, but does not acknowledge the possible presence of further changes (maybe a farmer mowed the lawn or did some agricultural job). These further changes, if present, could be detected by the considered methods and, if so, they would erroneously be considered false positives as compared to the ground truth map In order to explore the results more in depth, Table 5.3 is proposed.It is worth mentioning the difference between the overall accuracy (OA) calculated as the number of pixels correctly classified over their total number and the κ coefficient. This highlights also the fact that in the California dataset we have a medium κ, however a big percentage of pixels are correctly classified.
Let us move on discussing the intermediate results, that are the representations in the latent space and after the CCA correlation. The images are at the end of this section, and are shown in false colour because pixel intensities have no physical meanings in this domain. It is interesting to note how images in the latent space Z -labelled as X CODE and Y CODE -are very well aligned and differences between images are significant also here. The alignment information is inferred looking at the colours, equal colour palette suggest the same domain of the images. These images were fed to the decoder to retrieve the final result. Moving towards the CCA domain, we can appreciate the X CC A and Y CC A representations. They are not used inside the process, but only retrieved to check the behaviour of the network. Surely, they are more similar to each other than the Z representation. Maybe it is not significant here because we have taken the last iteration of the network; however, during the training, the evolution of the CCA space is far more aligned than the Z space, especially in the first 10, 20 epochs. Even if, as already mentioned, the CCA space can change abruptly its representation, this is not a drawback because the network takes into account only the maximisation of the correlation information, which increases despite this nonlinear behaviour.
We consider now the last representation used in the network, which is also the core of its image differencing part:X,X,Ỹ,Ŷ . These representations are explained in the previous chapter lay in the X or Y domain. The interesting thing of these representations is that the changes are clearly visible inX andŶ . This false asymmetry is easily understandable using the schemes provided in Section 4.5.2;X is the reconstruction of X , which is the image before the event, so it is the pre-event reconstruction, whileX is the post-event image coming from the Y domain and transformed in the X domain. In a parallel way, the method works for Y , the only difference is thatỸ,Ŷ exchanges their roles, because the Y domain contains the postevent image.  not-changed and changed pixels; green are not-changed pixels wrongly classified as changed; red are changed pixels wrongly classified as not-changed.  black/darker pixels are expected to be unchanged; white(ish) pixels are likely changed. (b) Confusion Map: black and white are correctly classified notchanged and changed pixels; green are not-changed pixels wrongly classified as changed; red are changed pixels wrongly classified as not-changed.

Comments
Some comments on our results and some proposal of improvement are here presented.
First of all the alternative implementation called DCCAE with latent space differentiation is not feasible due to the non-perfect alignment of the Z and CCA domains. This can take the reader to doubt also about the alignment needed to achieve good results in the DCCAE network, however the misalignment is not big enough to be an obstacle for the training of the couple of autoencoders. To demonstrate this statement, Figure 5.14, which illustrates a DCCAE result on the Texas dataset, is taken as an example; the change detection objective is reached, indeed, the confusion map is almost perfect. But, if it had been computed in the latent space, it would not have been as good; and it is possible to see it by a visual inspection of the images extracted from the latent space Z. It is clear how Figures 5.14b and 5.14c, representing the exit of the encoders, are not aligned at all (the colour palette is different), and a subtraction between the two can not lead to a good result. The second point to be highlighted of the method results is the presence, in Figure 5.7b, of many outliers, one which has a κ 0. In these cases, it is common that the algorithms does not learn how to reconstructX from the change image Y , even if it is able to retrieveŶ from X. This fact influences the cycle reconstruction and the network cannot learn a proper reconstruction. It is possible to see a small anomalies in the loss L α , and then the network fatigue to recover or cannot recover at all. To better understand the problem it is possible to refer to the Figure 5.15, where 5.15b and 5.15c are the two components of the L α loss (see 4.11). It is possible to see the mentioned anomaly, which happens at epoch number 7. Moreover, looking for a similar pattern in the L Cycle , it has been found in epoch number 8 and 9, proving the fact that this small misalignment transverses the network invalidating the training process. Cross-x loss function, which is the prior weighted similarity from X to Y . (c) Cross-x loss function, which is the prior weighted similarity from Y to X.
At last, a comment on a good property of the DCCAE: even if the training goes on for many epochs, more than necessary, it does not need any early stopping criterion. Thus, highlighting the optimal stability gained through the cross balancing of the losses. When it reaches good performances, it is very unlikely it can decrease. Figure 5.16 provides an example, which shows the good learning capability in the first twenty-thirty epochs, and them it remain stable, without overfitting.

Linear CCA
This section provides result for the problem applying only the linear Canonical Correlation Analysis on the datasets as a benchmark result. It is obvious to foresee better result for the Texas dataset. This is expected because the Texas has data acquired with the same modality (optical), even if in different frequencies. This results in a heterogeneous change detection problem, but in a relatively simple version. The results for this dataset are shown in figures below, Figure 5.17. The transformation is performed calculating all the correlations between features, and taking the more correlated feature for each choice, as also displayed in the graph of Figure 5.19. It highlights that the first feature alone contain 50% of the total correlation, and the subsequent three the remaining part; the last three instead, does not contain any correlation information.
The different experiments shown in Figure 5.17 are performed giving as input the dimension of the transformed space. Thus, canonical correlation analysis was performed, and the two input images were projected in the space defined by the most correlated features. At last, the change map is retrieved by subtracting the two images (as in standard procedures) and thresholding the result.
It is worth noting how the performances increase by increasing the dimension of the space of projection. In theory, a feature space of dimension 1, as illustrated in the Figure 5.17a is enough to represent two classes, but not to extract the information for the classification, even if it is the one presenting the more correlation. In fact, from the table in Figure 5.18 we can see the κ 0, which indicates a nearly chance-equivalent. Instead, after using more than 3 features (the more correlated of the 7), the κ value becomes 0.9.   On the California dataset, which is more "difficult", and expected to be not solvable with a linear transformation, the linear CCA does not obtain anything more than κ 0, so we preferred to omit figures for brevity. In the following, where CCA will be mentioned, its behaviour is the same as explained here, because the same functional block is nested in the DCCAE architecture.

DCCA
The DCCA framework was explored and obtained rather accurate results up to applying an appropriate mechanism of early stopping. Only for completeness, the result on the Texas dataset in term of κ is reported in Figure  5.20a. As it is possible to see, after 20 epochs the performances start to decrease. Furthermore, looking at the loss, the moment when the network is learning is easily visible, but going on with the learning leads to overfitting and can be identified quite precisely in the Cohen's Kappa graph.

Conclusion
In the present thesis, the challenging problem of unsupervised change detection with heterogenous remotes sensing images has been addressed. After a first phase of detailed study of the related scientific literature, a novel approach rooted in deep learning has been developed. The proposed DCCAE network has been successfully implemented, and alternative configurations have been studied as well. Experimental results with two heterogeneous data sets involving three optical and one radar sensor have pointed out the capability of the method to achieve accurate detection of the changes regardless of the fundamentally different nature of the considered multitemporal acquisitions.
In particular, the goal to achieve accurate peak performances and small accuracy variance within the random ensemble of the output results, was achieved thanks to the proposed network topology in which the CCA acts as an alignment block for the training of a pair of autoencoders. The prior information, extracted through local affinity matrices and incorporated into the method, has revealed itself vital in order to formulate an unsupervised technique.
The results of the experiments are fully satisfactory, and the proposed method outperforms the state-of-the-art, which is based on deep architectures using generative adversarial and stacked autoencoder components, in the illustrated cases. It is worth recalling that these results were obtained consistently within the validation with two very different dataset, with different features and complexity, thus suggesting the flexibility of the proposed approach.
Some points of weakness of the developed methodological solution are the always present objections to deep neural networks: they require time to 67 be fine-tuned, their behaviour generally depends on several hyper-parameters; we do not completely know how they work on a purely methodological standpoint, and this partial knowledge limits the protection capacity against a possible misbehaving. Furthermore, the computational power needed to train a deep network is quite big, yet nowadays easy affordable.
Further works should address an improvement of the prior information, in order to assure a bimodal distribution also with difficult datasets, as the California dataset proposed. Moreover, another problem to address is the the presence of many outliers in the presented experiment on the Texas dataset, whose cause is not fully understood.
One final comment it about the training policy of the proposed method: for its formulation, there is the necessity to train the network every time a pair of images need to be used; it is not meant as a a network you can pretrain and use again and again. On one hand, this allows to have a network trained only for the specific task at hand and consistently optimized for it. On the other hand, the need for training on every input multitemporal pair of images may be inconvenient. Yet, the possible integration with incremental learning or with further domain adaptation concepts may reduce this training requirement in future extensions of the method.