Notes on the Symmetries of 2-Layer ReLU-Networks

Symmetries in neural networks allow different weight configurations that lead to the same network function. For odd activation functions, the set of transformations mapping between such configurations has been studied extensively, but less is known for neural networks with ReLU activation functions. We give a complete characterization for fully-connected networks with two layers. Apart from two well-known transformations, only degenerated situations allow additional transformations that leave the network function unchanged. Finally, we present a non-degenerate situation for deep neural networks in which transformations exist that leave the network function intact.


Introduction
Let f W (x) = w L ((. . . (w 1 x+b 1 ) + +b 2 ) + +. . .) + +b L denote a neural network with ReLU activation function (x) + = max{x, 0}. We consider the map φ : W → F from the set of weights to a set of realizable network functions. In other words, for a given collection of weights W, we denote by φ(W) = f W the neural network function defined by weights W. Due to symmetries in the network, this function is not injective. Following terminology of [3], we call transformations on the weight space that leave the network function identical equioutput transformations. 1 Definition 1. Let W denote a weight selection of a neural network function f . A transformation α * Corresponding Author: henning.petzka@math.lth.se This work was supported by the European Research Council Consolidator grant SEED, CNCS-UEFISCDI (PN-III-P4-ID-PCE-2016-0535, PN-III-P4-ID-PCCF-2016-0180) 1 In contrast to [3], our transformations are not necessarily analytic.
on the weight space is called an equioutput transformation, if f W (x) = f α(W) (x) for all x and W.
(1) π: A permutation π of neurons (together with all its in-and outgoing weights) within a layer.
(2) ρ λ : A transformation ρ λ given by multiplying all ingoing weights of a neuron by λ > 0 and multiplying all outgoing weights by 1 λ . This leads to the same network function because of the positive homogeneity of the ReLU function: If λ > 0, then λ(x) + = (λx) + , hence for any scalars a, b we have that a(bx) + = (aλ)( b λ x) + . Knowledge of these transformations is important for the understanding of the loss function. For example, it follows that a global minimum never comes alone: For a given global minimum W, any equioutput transformation of W leads to a different global minimum. Further, knowledge of these equioutput transformations can be useful for the study of properties of neural networks. Assume a certain property should depend only on the function, but not on its specific representation. If we aim to describe this property in terms of its representation in form of weights, then this property must be independent under all equioutput transformations, otherwise the property is somewhat illdefined. Such a dependence was observed by Dinh et al. [4] for the property of generalization and Hessian-based measures of flatness. While generalization performance only depends on the network function, the common measures depend on the representation given by specific weight values. Similarly, Neyshabur et al. [6] argue that for ReLU networks it is not sensible to use the usual gradient descent, because the steepest direction is then defined by a maximal reduction in loss for equal (infinites-  imal) step length in the l2-norm. But this measure of steepness depends on the specific parameterization. Hence, in this example, the property of interest is the update rule during optimization (which should be independent of the representation) but the direction of gradient descent is dependent on neuron-wise reparameterizations ρ λ .
A natural question that arises is the following: Are there any other equioutput transformations than the ones described above? This has been studied quite thoroughly for odd activation functions, which we recap below. In short, there are many situation-dependent equioutput transformations that exist only for specific parameter values, but there are only the two types of equioutput transformations that are analytical and equally work no matter the given parameter values. To the knowledge of the authors, there has been little work on the case of ReLU activation functions. Only recently, by considering hyperplane configurations, [8] shows that under certain conditions it is often possible to compute the network parameters from the network function up to a composition of π and ρ λ . [2] study whether the proximity of network functions implies proximity of their parameterizations. Finally, [7] consider networks of decreasing width.
Our work gives a complete characterization for fully-connected 2-layer regression networks with ReLU activation functions. For these networks, we outline all situation-dependent equioutput transformations, and we show how networks can be reduced to an irreducible form. Up to degenerate cases, irreducible networks with identical network function have weights that differ only by a composition of transformations π and ρ λ . For deep neural networks, we describe additional situationdependent equioutput transformations that are not degenerate, though unlikely to appear with proper initialization.

Odd activation functions
For neural networks with all activation functions given by the tangens hyperbolicus tanh(), equioutput transformations have been studied in several works. In this case, the well-known equioutput transformations are given by permutations π as above and by sign-flips. The latter is the negative-scalar variant ρ −1 and uses tanh(−x) = −tanh(x). Sussmann [9] considers 2-layer networks. For certain weight-configurations, he identifies natural reduction steps that allow removing nodes while keeping the network function constant. Two irreducible networks then determine the same network function if and only if one weight configuration can be transformed to the other one by a composition of permutations π or sign-flips ρ −1 . Chen et al. [3] extended the result of Sussman to deeper networks to show the following theorem.
Theorem [3] For parameter values W and input x, let f W (x) denote a neural network function with all activation functions tanh(). All analytical (i.e. expandable in a power series around points) equioutput transformations on the weight space W are compositions of interchanges of neurons π within a layer and sign flip transformations ρ −1 .
The restriction to analytical transformations aims at the exclusion of equioutput transformations that only exist for specific parameters W. We will call transformations that rely on certain weight values situation-dependent. In Sussmann [9], these situation-dependent transformations were excluded by the reduction steps to irreducible form. Since the transformations π and ρ −1 always exist, they have relevant consequences in practical applications. Situation-dependent transformations that exist only in degenerate cases are less likely to play a significant role in practice. However, effects are still possible when the degenerate situations only hold approximately.
Considering activation functions other than tanh(), Albertini et al. [1] extended the analysis to infinitely differentiable functions satisfying σ(0) = 0, σ (0) = 0, σ (0) = 0. Kurková and Kainen [5] generalize the result for two layer regression networks where the activation functions need not to be continuous, but bounded and asymptotically constant (this includes sigmoids, for example). neuron and one hidden layer, We call a neural network reducible, if nodes can be removed without changing the network function, possibly by a redefinition of some weights. Two types of reducibility are as follows: (R1) If all ingoing or outgoing weights of a neuron are zero, then this neuron is never active or never processed and can be removed without changing the network function. If all ingoing weights except for the bias are zero, then the neuron can be removed and its contribution to the next layer can be added to the biases of the following layer.
(R2) If n 1 (x) = λn 2 (x) for some λ > 0 (two neurons in the hidden layer are equal up to a positive multiplicative factor), then one of the nodes can be removed using a linear combination of the corresponding outgoing weights, e.g. v 1,new = v 1 +λv 2 for n 1 (x) when removing the second one.
Many equioutput transformations exist for nonreduced neural networks. If all weights into a neuron are zero, then the outgoing weights can be changed arbitrarily. If λ · n 1 (x) = n 2 (x) for some λ > 0, then weight be can be shifted from the outgoing weights of one neuron to the other neuron (i.e. v 1 (n 1 (x)) + + v 2 (n 2 (x)) + = (v 1 − z)(n 1 (x)) + + (v 2 + z λ )(n 2 (x)) + for all z). We investigate in this section whether there exist other symmetries than the one just named after reducing two-layer fully connected ReLU regression networks to an irreducible form. Our working hypothesis, which later needs to be adjusted slightly, is that ReLU networks show the same behavior as neural networks with odd activation functions.
Working hypothesis: Let f W1 and f W2 denote a pair of (R1,R2)-irreducible two-layer fully connected ReLU regression networks. If f W1 (x) = f W2 (x) for all x, then W 1 can be attained from W 2 by a composition of an arbitrary number of symmetries π and ρ λ as above.
We will take advantage of two short lemmas and the definition an of activation pattern.
Then the following properties are equivalent.
We show a strong form of linear independence for neuron functions with activation (n i (x)) + assuming a little more than pairwise different activation patterns. This strong form of linear independence was named in Albertini et al. [1] as the independence property (IP): The set of functions {(n i (x)) + | i = 1, . . . , m} has (IP) if pairwise inequality of the functions (n i (x)) + = (n j (x)) + implies linear independence of {1, (n i (x)) + | i = 1, . . . , m}. This property is required to find a pair of neurons with the same zero hyperplane whenever neurons are linearly dependent after their activation. The proof of Theorem 1 will take advantage of this property by matching neurons in different representations of the same network function.
After exclusion of a degenerate case, we obtain property (IP) for {(n i (x)) + } even when the constant function 1 is replaced by an arbitrary nonzero linear function. The degenerate case we exclude consists of two neurons n k , n l with identical zero hyperplanes, i.e., H k = H l with H k = H(n k ).
We now give an example that the assumption of pairwise different hyperplanes H j was necessary and it would not have been sufficient to assume pairwise different activation patterns. If H k = H l for some k = l, then we still have a pairwise different activation pattern when R + k = R − l . In words, neuron k is active exactly when neuron l is inactive and the other way around (except for the points on the hyperplane H k = H l where none of the neurons is active). We say that n k , n l are a pair of neurons with opposite activation pattern. Example 1. In the network in Figure 1 all neurons have nontrivial and pairwise different activation patterns, yet the network implements the constant zero function.
This example shows two things: (i) In such a situation where H k = H l for some k = l, we see that new situation-dependent equioutput transformations can appear. (ii) Single layers of neural networks with ReLU activation can implement linear functions.
If H k = H l for two neurons k, l with opposite activation pattern, then we can transform them into a linear neuron (no activation function) and a neuron with ReLU activation function as follows: With outgoing weights v 1 and v 2 , we have v 1 (n k (x)) + + v 2 (−λn k (x)) + = v 1 n k (x) + (λv 2 + v 1 )(−n k (x)) + . But we could have equally well decomposed the sum into a linear function plus rest to get λv 2 n k (x) + (v 1 + λv 2 )(n k (x)) + . More generally, we can flip the activation region of a neuron n i (x) by introducing a linear neuron: (n i (x)) + = n i (x) + (−n i (x)) + . Accordingly, irreducible ReLU networks will only be defined up to the (optional) existence of a single non-constant linear neuron N(x) (i.e., the linear function may only consist of the bias of the output layer).
(R3) Write each pair of neurons with opposite activation pattern as above as sum of a linear neuron {N i } and a neuron with ReLU activation function. Add all linear neurons plus output bias to a single linear neuron N(x) = αx + c, which is subsequently split into two ReLU neurons with opposite activation pattern (αx) + − (−αx) + and bias c.
In the following, we represent an (R1,R2,R3)irreducible network as f W (x) = j v j (n j (x)) + + N(x) with N(x) a linear function and all n j (x) having pairwise different zero hyperplanes H j . We also find an additional equioutput transformation ψ − which operates on a single neuron and the linear function: ψ − multiplies all incoming weights of a neuron n i (x) by −1, leaves the outgoing weights intact, and adds v i n i (x) to N(x). More generally, for each subset of indices J ⊆ {1, . . . , m}, we define ψ J − as the application of ψ − to all neurons n j (x) with j ∈ J. Note that ψ J − may add two neurons with opposite activation pattern (forming a linear neuron) to the hidden layer when it is applied to a network representation without opposite activation patterns (N(x) = b L a constant) and the resulting larger network can still be (R1,R2,R3)-irreducible. We discuss later how we can recover a representation with constant N(x) as bias whenever it exists.
The following theorem states that irreducible two-layer ReLU-networks are unique up to equioutput transformations π, ρ λ and ψ J − . The proof can be found in the appendix.
Theorem 1. Let f W1 and f W2 denote a pair of (R1, R2, R3)-irreducible two-layer fully connected ReLU regression networks. If f W1 (x) = f W2 (x) for all x, then W 2 can be obtained from W 1 by a composition of an arbitrary number of equioutput transformations π, ρ λ and ψ J − as above.
For any subset J of indices, applying ψ J − twice yields the identity function. Further, ψ J − is the only equioutput transformation that can change the linear function N 1 (x) in f W1 (x) = j v j 1 (n j 1 (x)) + + N 1 (x). This implies that a (R1,R2,R3)-irreducible ReLU network function f W1 (x) has a representation f W2 (x) = j v j 2 (n j 2 (x)) + + N 2 (x) with a constant N 2 (x) = b L , if and only if there is a transformation ψ J − that cancels N 1 (x) up to a constant. In other words, we can find a representation without opposite activation patterns if and only if there exists a subset J of indices such that N 1 (x) + j∈J v j 1 n j 1 (x) = c for a constant c. In this case, we apply ψ J − and change the bias at the output layer to c. (R1 subsequently removes the ReLU neurons that were used for the linear function N 1 (x).) These observations prove the following result that, apart from degenerate situations, π and ρ λ are the only equioutput transformations. 2 x y f (x, y) = 0 j=1 v j 2 (n j 2 (x)) + + b 2 denote a pair of (R1, R2)-irreducible two-layer fully connected ReLU regression networks such that for i = 1, 2 (i) all n j i (x) have pairwise different zero hyperplanes; and (ii) there exists no non-empty subset of for all x, then W 2 can be obtained from W 1 by a composition of equioutput transformations π and ρ λ as above.

Deep neural networks
In the proof for the 2-layer case, we take advantage of the assumption that the input is all of R d . In deeper layers, we lose this property, because ReLU activations reduce the domain for future layers to the first quadrant where all coordinate values are positive. Neurons for which all ingoing weights are positive behave like linear neurons in deeper layers. This considerably complicates the treatment of reducibility and symmetries in the case of deep neural networks. There are now more than merely degenerate situations in which different parameter choices induce the same network function.
Example 2. Suppose two neurons in a layer with index l > 1 (i.e., not the first hidden layer) have only positive weights. Since the output of the previous ReLU layer with index l − 1 is always positive, this implies that these two neurons are always active and hence behave like linear neurons. Let A ∈ R n l−1 ×2 denote the incoming weights into these two neurons and C ∈ R 2×n l+1 denote the outgoing weights. Then for any positive invertible matrix B ∈ R 2×2 + , we can replace A by BA and C by CB −1 . Since B is positive, the neurons still have only positive incoming weights BA and still behave like linear neurons. Together, they implement the same function CAx = (CB −1 )(BA)x.
The question arises whether two neurons that are always active can be reduced to a single neuron that is always active, just as we reduced degenerate cases in the 2-layer case to obtain a minimal network. An example shows that this is not always possible. Example 3. We present an example, where two neurons that are always active cannot be combined to a single ReLU neuron: As before, let A ∈ R n l−1 ×2 denote the incoming weights into two always-active neurons and B ∈ R 2×n l+1 denote their outgoing weights. For our example, it suffices to consider n l−1 = 2 and n l+1 = 1: To combine the two hidden neurons into one single neuron, we are looking for D ∈ R 2×1 and d ∈ R such that B(Ax) + = BAx = d(Dx) + . The sign of d defines the sign of the contribution of d(Dx) + to the output neuron independently of x ∈ R 2 . But for x = (x 1 , x 2 ) we have that the contribution of BAx is positive when x 1 > x 2 and negative when Hence, no such D and d can exist.
In the 2-layer case, all situation-dependent equioutput transformations appeared at degenerate weight constellations. Hence, up to degenerate cases, the situation is quite simple. For deep neural networks, this situation changes drastically. Example 2 does not describe a degenerate condition. Within a neighborhood of such a weight configuration, weights can be changed arbitrarily and the resulting network will still possess two neurons that are always active and hence allow the described equioutput transformations. Empirically, however, the existence of neurons with only positive incoming weights is very unlikely to appear with proper initialization. We experimented with a simple MNIST network with He initialization. After training, we consistently observed for each neuron of the network that approximately half of the incoming weights are negative. This seems to empirically exclude the existence of neurons with only positive incoming weights. On the other hand, we note that there are other (more complex) possibilities to obtain neurons that are always active, in which case equioutput transformations as in Example 2 different to π, ρ λ and ψ J − exist.

Conclusion
Different to the well-studied case with odd activation functions, the appearance of equioutput transformations of ReLU networks is more complicated. After removing neurons that are never active and merging of neurons with identical activation pattern, we have to additionally consider neurons with opposite activation patterns. These pairs of neurons with opposite activation pattern can implement linear functions and cannot be completely removed. This allows equioutput transformations different to the well-known ones. Constructing linear neurons in deep networks by using positive weights shows that equioutput transformations do not only exist in degenerate cases.
Therefore, we must have w 1 = λw 2 for some λ. Assume λ < 0. We introduce µ ∈ R such that b 1 = µλb 2 . (This exists without loss of generality, if necessary we can interchange the roles of b 1 and b 2 .) If λ < 0, then By assumption that R + (n 1 ) = R + (n 2 ), we get that Since the first set contains γw 2 for all sufficiently small γ, but the the set of all γ such that γw 2 is contained in the last set is bounded below, this leads to a contradiction for any µ. Hence λ cannot be negative. So we must have w 1 = λw 2 for some λ > 0, and we again write b 1 = µλb 2 . Then , we must have µ = 1, which completes the proof.

B Proof of Lemma 2
Proof. We consider the hyperplanes with each dividing the space R d into the two regions R + j = R + (n j ) and R − j where n j (x) > 0 and n j (x) < 0 respectively. By assumption, the regions R + j are pairwise different. Note that the union j H j is a closed set of measure zero. Suppose for some scalar coefficients a j and all x. We need to show that a j = 0 for all 0 ≤ j ≤ m.
For any z we let I z = {j ∈ {1, 2, . . . , m} | z ∈ R + j } denote the set of indices of those R + j that contain z . Then (2) reduces to a 0 (z) + j∈Iz a j (w j · z + b j ) = 0.
If z 0 / ∈ j H j , then by standard continuity arguments there is > 0 such that for any ||z 1 − z 0 || ≤ we have I z0 = I z1 . Therefore, Since z 1 is arbitrary in the neighborhood of z 0 , this shows that Consider a hyperplane H k and choose a point x k ∈ H k \ i =k H i . The latter set is non-empty in the case of our assumption of pairwise distinct hyperplanes.
The R + i are also pairwise different by assumption, so there is a sufficiently small ball B (x k ) of radius = 0 + a k (w k · y k + b k ).
But y k ∈ R + k , so (w k · y k + b k ) > 0 and we must have a k = 0. As k was arbitrary, this shows that a k = 0 for all k ≥ 1. Then (3) immediately shows that also a 0 unless w 0 = 0.
If w 0 = 0, then c = 0 as we assumed (x) to be non-zero. Choose an arbitrary x 0 with n j (x 0 ) = 0 for some j. Then Since c = 0 we must have a 0 = 0.

C Proof of Theorem 1
Proof. The functions we consider are (using R2, R3) of the form f Wi (x) = j v j i (n j i (x)) + + N i (x) with neuron functions n j i (x) = w j i · x + b j i with pairwise different hyperplanes H j i for fixed i. Further N i (x) = α i x + c i is the linear function given by combining the bias of the output layer with a possible linear function implemented by neurons with opposite activation pattern. D Example of a pair of degenerate (R1,R2,R3)-irreducible networks that implement the same network function The following (R1,R2,R3)-irreducible networks f W (x) = 3 j=1 v j (n j (x)) + have each pairwise different hyperplanes and implement the same network function. They are degenerate in the sense that the sum 3 j=1 v j n j (x) = 0 adds up to the constant zero. , which flips the sign of all weights of the first layer and adds a linear neuron with function x + y + (−x − y) = 0, which can be removed by R1.
A more intricate example of two (R1,R2,R3)-irreducible networks f W (x) = 4 j=1 v j (n j (x)) + + b with pairwise different hyperplanes and implementing the same network function is the following. They are degenerate in the sense that the sum j∈{1,2,4} v j n j (x) adds up to a constant.