Reducing Objective Function Mismatch in Deep Clustering with the Unsupervised Companion Objective

Preservation of local similarity structure is a key challenge in deep clustering. Many recent deep clustering methods therefore use autoencoders to help guide the model’s neural network towards an embedding which is more reflective of the input space geometry. However, recent work has shown that autoencoder-based deep clustering models can suffer from objective function mismatch (OFM). In order to improve the preservation of local similarity structure, while simultaneously having a low OFM, we develop a new auxiliary objective function for deep clustering. Our Unsupervised Companion Objective (UCO) encourages a consistent clustering structure at intermediate layers in the network – helping the network learn an embedding which is more reflective of the similarity structure in the input space. Since a clustering-based auxiliary objective has the same goal as the main clustering objective, it is less prone to introduce objective function mismatch between itself and the main objective. Our experiments show that attaching the UCO to a deep clustering model improves the performance of the model, and exhibits a lower OFM, compared to an analogous autoencoder-based model.


Introduction
Deep clustering is a subfield of deep learning [7] which considers the design of unsupervised loss functions, in order to train deep learning models for clustering. The loss functions developed in this field have made it possible to train deep architectures to discover the underlying group structure in large datasets, containing data types with complex geometrical structure, such as images [22,2,5] and time series [18]. The ever growing amount of unlabeled data has caused unsupervised learning to be identified as a main next goal in machine learning research [7].
Many of the recent deep clustering models include deep neural networks that have been pre-trained as autoencoders [20,1,21,10]. In these models, the unsupervised clustering loss is attached to the code space of the autoencoder, and the model is fine tuned using either the clustering loss alone, or both the clustering loss, and the reconstruction loss from the autoencoder.
Deep Embedded Clustering (DEC) [20] is a cornerstone method in deep clustering. In DEC, the network is pre-trained as a denoising autoencoder. After pre-training, DEC fine-tunes a set of cluster centroids in order to compute the final cluster assignments. However, Guo et al. [1] argue that discarding DEC's decoder in the fine-tuning stage hinders the preservation of the local similarity structure between samples. They therefore propose Improved DEC (IDEC), which aims to alleviate this issue, by retaining the decoder and reconstruction loss during fine-tuning.
Despite the popularity of the autoencoder approach, we hypothesize that the representation produced by the autoencoder does not necessarily emphasize properties which are desirable for clustering. These models can therefore suffer from objective function mismatch (OFM) [12]. OFM occurs when the optimization of an auxiliary objective (e.g. reconstruction) has a negative impact on the optimization of the main objective (e.g. clustering). In fact, Mrabah et al. [13,14] show that the aforementioned IDEC suffers from OFM during training -supporting our hypothesis on OFM occurring in 1 autoencoder-based deep clustering models.
Deep Divergence-based Clustering (DDC) [5] is another deep clustering approach, which in contrast to DEC and IDEC, does not rely on an auxiliary autoencoder to train its neural network. In fact, DDC can be trained end-to-end from randomly initialized parameters using only its clustering loss -outperforming DEC on several deep clustering tasks [5,18]. Despite the promising performance of DDC however, we do not know whether it is prone to suffer from the same issues regarding similarity preservation, as was observed in DEC.
In this paper, we present an Unsupervised Companion Objective (UCO), whose task is to preserve the similarity structure between samples in deep clustering models. Inspired by Deeply-supervised Nets [9], the UCO consists of a set of auxiliary clustering objectives attached to intermediate layers in the neural network, which encourage a common cluster structure at the output of these layers. Since the UCO is based on clustering, we expect a low OFM between the UCO and the main clustering objective. Our experiments show that the UCO both exhibits reduced OFM when compared to a reconstruction objective, and improves the overall clustering performance of a deep clustering model.

Method
Throughout this paper, we will assume that the deep clustering model follows this general design: where f θ denotes the neural network producing a learned representation z, from the input x, and g φ denotes the clustering module producing the cluster membership vector α. θ and φ represent the parameters of the neural network, and the parameters of the clustering module, respectively. See Figure 1 for an overview of the assumed clustering model. For generality, we define blocks to be generic computational units in a deep neural network. Layers are perhaps the most familiar examples of blocks, but a block can also represent, for instance, a collection of adjacent layers, or individual components within a specific layer. Since f θ represents a neural network, it can be decomposed block-wise as: where f l θ l is the mapping performed by block l, and L is the number of blocks in f .
If we let y l be the output of block l, we have: with y 0 = x and y L = z. Finally, we assume that f θ and g φ are optimized jointly with a clustering loss, L cluster , which is computed by a clustering module. The focus of this work is on the proposed UCO, and the neural network f , to which it is attached. We therefore use the clustering module and loss from DDC [5], and treat this as fixed.

Unsupervised companion objective
The unsupervised companion objective, L UCO , consists of the terms L 1 UCO , . . . , L L UCO , which are added to the final clustering loss, in order to help preserve the similarity structure between samples. All of these terms are designed to encourage the same cluster structure at their respective blocks, as the one found by the clustering module. Following [5], we do this by maximizing the Cauchy-Schwarz divergence [4] between clusters in the space of intermediate representations. For block l, we have Here, k denotes the number of clusters, α ai is the soft assignment of sample a to cluster i. k l ab denotes a Gaussian kernel evaluated at (y l a , y l b ): where σ is a hyperparameter. Since the intermediate representations can be arrays with more than one dimension (e.g. 3 for convolutional layers), we apply the flat(·) function to reshape the arrays into vectors before evaluating the Gaussian kernel.
. . The UCO can now be obtained as a weighted sum of the terms for each block: where w l is the weight for the term attached to block l. In order to avoid that the number of hyperparameters (weights) scales linearly with the depth of the network, we adopt the alternative weighting strategy: where λ is a base weight, and ω : {1, . . . , L} → [0, 1] is a function computing the relative weight for block l.

Setup
Models. In order to evaluate the performance of the proposed unsupervised companion objective, we train the DDC model with and without the UCO, using the same network architecture and hyperparameter setup for both models. We refer to these configurations as DDC-UCO and DDC, respectively. Drawing inspiration from IDEC [1], we also train an autoencoder-based DDC model, which includes Block Layer type Table 1: Layers in the CNN used to train DDC, DDC-AE, and DDC-UCO. All Conv layers use ReLU activation functions. Layers using batch-norm apply the normalization before the activation function. a decoder network whose task is to reconstruct the input data. We refer to this model as DDC-AE. Implementation. As in [5], we use a small convolutional neural network for our experiments. Our CNN consists of two sequential blocks, both having two convolutional layers, followed by max pooling operations. An overview of the layers in the network is shown in Table 1. In DDC-AE, we create the decoder network by mirroring the CNN, and replace convolutions with transpose convolutions, and max-pooling operations with nearest-neighbor upsampling.
The models are implemented in the PyTorch framework [16]. All models are trained on stochastic Name Image size Color n k MNIST [8] 28 × 28 Gray 60000 10 USPS 16 × 16 Gray 9298 10 F-MNIST [19] 28 × 28 Gray 60000 10 COIL-100 [15] 128 × 128 RGB 7200 100 Table 2: Overview of the datasets used for evaluation. n and k denote the total number of images, and the number of categories, respectively. mini-batches of size 120, using the Adam optimizer [6], with a learning rate of 0.0001. Following [5], we train each model 20 times, and report the accuracy (ACC) and normalized mutual information (NMI) of the model resulting in the lowest value of the total loss function. For each block, the σ hyperparameter is set to 15% of the median pairwise distance between the intermediate outputs for samples in the minibatch [5].
Since hyperparameter tuning can be difficult when labeled data is unavailable, we set λ = 0.1 and ω(l) = 10 l−L for all datasets. This choice is further investigated below.
Datasets. We test the models on the MNIST, USPS, Fashion-MNIST (F-MNIST), and COIL-100 datasets. These datasets represent clustering tasks which are often encountered in computer vision, and are thus widely used in the deep clustering literature [20,1,17,5,22]. An overview of the datasets can be found in Table 2.

Results
Quantitative results. Table 3 shows the clustering results for DDC, DDC-AE, and DDC-UCO on the baseline datasets. These results indicate that adding the UCO to DDC tends to improve the clustering performance of the model. Adding a decoder and a reconstruction loss however, leads to a drop in performance on all datasets. Finally, we observe that DDC-UCO outperforms DEC and IDEC on MNIST and USPS.

Objective function mismatch.
In order to demonstrate that the UCO leads to reduced OFM when compared to an autoencoder-based method, we use the following metric to measure the OFM during training: where ∆ F D is the feature drift [13]: ∆ F D measures the "agreement" between the gradient of the clustering objective, L cluster , and a pretext task, L pretext , with respect to the weights in the network, θ. For DDC-AE, the pretext task is reconstruction, and for DDC-UCO, it is the set of auxiliary clustering objectives introduced by the UCO. OFM is a re-parametrization of ∆ F D , which is scaled to be in the range [0, 1], and negated, such that higher values indicate a larger degree of mismatch between the objectives. Figure 2 shows the observed OFM when training DDC-AE and DDC-UCO. The models were trained on 2000 randomly selected images from the first four digits of the MNIST dataset. The plots indicate that the OFM is significantly lower in DDC-UCO, compared to DDC-AE, at all epochs. This both confirms our hypothesis about the presence of OFM in autoencoder-based deep clustering models, and shows that replacing the reconstruction loss with the UCO reduces the OFM.
Interestingly, we observe that the OFM in DDC-UCO tends to increase during training. We hypothesize that this is because, in the early stages of training, the clustering loss and UCO work together to strengthen the cluster structure at intermediate  outputs throughout the network, and determine the appropriate cluster memberships. Later in the training, the clustering loss encourages the clusters to be as compact and separable as possible, which can be achieved by mapping one or more clusters to a single point. The UCO counteracts this behavior by enforcing the cluster structure to be more similar to the cluster structure found at earlier blocks in the network, and thus is more reflective of the cluster structure found in the input data. This, in turn, leads to a mismatch between the clustering objective and the UCO in later stages of training. This hypothesis will be further investigated in future work.
UCO weighting scheme. The base weight λ and the relative weighting function ω are important hyperparameters in the UCO. To investigate the impact of these hyperparameters, we vary λ between 0.01, 0.1 and 1. Then, for each of these values, we train DDC-UCO with the following three ω-functions: (i) Constant: ω(l) = 1; (ii) Linear: ω(l) = l/L; and (iii) Exp: ω(l) = 10 l−L .
The results for the different configurations are shown in Table 4. These results show that the choice of λ and ω does influence the performance of the model. Setting λ = 1 for instance, leads to slightly worse overall performance, compared to λ = 0.01 and λ = 0.1. This shows that equally weighing L cluster and L UCO , results in too much emphasis being put on L UCO , which in turn leads to a slightly worse final clustering. Despite these variations in performance, none of the configurations cause the model to fail completely. This is an important property, due to the difficulties of validating hyperparameters in a fully unsupervised setting. Lastly, we note that there is not one best configuration for all datasets. We could therefore   The representations were projected to two dimensions using UMAP [11].
improve the results of DDC-UCO in Table 3 by tuning the hyperparameters individually for each dataset. However, we refrain from dataset-specific hyperparameter tuning as it would not be a viable option in a general unsupervised setting. Separability of intermediate representations. Figure 3 shows the intermediate outputs from the first block in the neural network, for DDC, DDC-AE, and DDC-UCO, trained on MNIST. The plots show that both DDC-AE and DDC-UCO produce more separable clusters, compared to DDC. This illustrates that DDC does indeed benefit from the improved similarity preservation introduced by the respective auxiliary objectives. When comparing the representations from DDC-AE and DDC-UCO, we observe that although both models are able to separate most of the digits from each other, the clusters corresponding to 4, 7, and 9, seem to be somewhat more separable in the representation produced by DDC-UCO.

Conclusion
We've presented the unsupervised companion objective (UCO). It is a new auxiliary objective for deep clustering, which consists of several clustering losses attached to different blocks in the model's neural network. By encouraging a consistent cluster structure throughout the network, we make the network embedding more preservative of the local similarity structure in the input space. Furthermore, the clustering-based nature of the UCO makes the resulting model less prone to suffer from objective function mismatch, compared to deep clustering models based on autoencoders. Our experiments show that attaching the UCO to DDC results in an improvement in the overall clustering performance. We also demonstrate that the UCO reduces the OFM in DDC, when compared to an analogous autoencoder-based approach.