(Notes from a reading of Cohen and Welling: Steerable CNNs.)

Consider R2 as an affine hyperplane in R3, embedded via the map x(x,1). Then the Euclidean motion group ˜G=R2 has a convenient matrix representation. Let r be a rotation and t a translation. Then

\begin{split} r &= \begin{pmatrix} R & 0 \\ 0 & 1 \\ \end{pmatrix} \\\ t &= \begin{pmatrix} I & T \\ 0 & 1 \\ \end{pmatrix} \end{split}

where R\in O(2) and T\in \mathbb{R}^2. Given x\in \mathbb{R}^2, we may identify it with a translation map in \tilde G via

\bar x = \begin{pmatrix} I & x \\ 0 & 1 \\ \end{pmatrix}.

In image recognition, \mathbb{Z}^2 parametrizes the pixels of an image with infinite width and height, and a discrete subgroup G of \tilde G acts on this parametrization. In particular, G = \mathbb{Z}^2\rtimes D_4, so that our parametrization is the homogeneous space \mathbb{Z}^2 = G/D_4. Let’s call this parametrization the pixel space.

Images are described by feature maps, which are functions f:G/D_4 \rightarrow \mathbb{R}^K, with each dimension of the target interpreted as a quantification of some feature for each pixel. Standard examples are given by feature maps that assemble various color channels. For instance, K=3 may correspond to an RGB image, and K=4 may correspond to a CMYK image. A representation of a real-world image would thus be a compactly supported feature map.

Over the pixel space, we have a homogeneous vector bundle G\times_{D_4} \mathbb{R}^K with an action by the discrete motion group given by g'\cdot(g,v) = (g'g, v). Let \mathcal{F} = \Gamma(G\times_{D_4} \mathbb{R}^K) be the space of all feature maps. G acts on it by left translation: \pi(tr) f(xD_4) = (tr)\cdot f((tr)^{-1}\cdot xD_4).

Let \Psi: \mathcal{F}\rightarrow \mathbb{R}^{K’} be a filter bank. This is thought of as a collection of K' linear functionals on the space of feature maps. Each linear functional is an operation that is performed on the value of a fixed pixel (which we may as well assume is the origin). Each linear functional outputs a weighted sum of the value associated to the fixed pixel and its neighbors. By translating pixels to the origin, we can construct a feature map \Psi\ast f\in \mathcal{F}’, where \mathcal{F}’ = \Gamma(G\times_{D_4} \mathbb{R}^{K’}) is another space of feature maps. This is defined as follows:

(\Psi\ast f)(x) = \Psi(\pi(\bar x)^{-1} f).

Thus we get a map \Phi:\mathcal{F}\rightarrow \mathcal{F}’ via \Phi(f) = \Psi\ast f. This map is called a convolutional neural network.

Let (\mathbb{R}^{K’},\rho) be a representation of D_4 on the model fiber of \mathcal{F}’. Our objective is to construct a representation (\mathcal{F}’,\pi’) such that if the filter bank \Psi intertwines the dihedral group representations (\mathcal{F},\pi) and (\mathbb{R}^{K'},\rho), then the convolutional neural network \Phi intertwines the discrete motion group representations (\mathcal{F},\pi) and (\mathcal{F}',\pi'). First we need some algebra:

Lemma. Let r be a rotation, t a translation, and \bar x the translation corresponding to the position x\in\mathbb{R}^2. Then

(tr)^{-1} \bar x^{-1} r = \overline{(tr)^{-1}\cdot x}.

The proof is a straightforward application of the given matrix representation. This result leads to the following equivariance rule.

Proposition. Let r be a rotation, t a translation, and \bar x the translation corresponding to the position x\in\mathbb{Z}^2. If \Psi\pi(r) = \rho(r)\Psi for all r\in D_4, then

(\Psi\ast \pi(tr) f)(x) = \rho(r)(\Psi\ast f)((tr)^{-1}\cdot x).

Proof. By using an identity trick, we can exploit our two versions of the same map to produce the equivariance law:

\begin{split} (\Psi\ast \pi(tr) f)(x) &= \Psi(\pi(\bar x)^{-1}\pi(tr)f)\\ &= \rho(r)\Psi(\pi(r)\pi(r)^{-1}\pi(\bar x)^{-1}\pi(tr)f) \\ &= \rho(r)\Psi(\pi(r^{-1}\bar x^{-1}tr)f)\\ &= \rho(r)\Psi(\pi((tr)^{-1}\bar xr)^{-1}f)\\ &= \rho(r)\Psi(\pi(\overline{(tr)^{-1}\cdot x})^{-1}f)\\ &= \rho(r)(\Psi\ast f)((tr)^{-1}\cdot x) \end{split}

With this calculation in mind, we are now in a position to define a representation of G on \mathcal{F}':

\pi'(tr)(\Psi\ast f)(x) = \rho(x)(\Psi\ast f)((tr)^{-1}\cdot x).

To verify this, we note that one checks the relation f(t_1r_1t_2r_2) = f(t_1r_1)f(t_2r_2) by using the fact that the conjugate of a translation by a rotation is again a translation, and that

t_1r_1t_2r_2 = t_1r_1t_2r_1^{-1}r_1r_2.

As a consequence, we get an intertwining property:

Corollary. If \Psi\pi(r) = \rho(r)\Psi for all r\in D_4, then \Phi\, \pi(g) = \pi'(g)\, \Phi for all g\in G.

Whenever we can find representations (\mathcal{F},\pi) and (\mathcal{F}',\pi') for which \Phi is an intertwiner, we say that \Phi is a steerable convolutional neural network.

To determine the homogeneous vector bundle of which \Psi\ast f is a section, it is enough to calculate the action of D_4 on (\Psi\ast f)(0):

\pi’(r)(\Psi\ast f)(0) = \rho(r)(\Psi\ast f)(\pi(r)^{-1}\cdot 0) = \rho(r)(\Psi\ast f)(0).

This means that \Psi\ast f\in \Gamma(G\times_\rho \mathbb{R}^{K'}). From representation theory, we know that

\Gamma(G\times_\rho \mathbb{R}^{K'}) \cong Ind_{D_4}^G(\rho)

thus we may interpret \Phi as a map into an induced representation of the discrete motion group. Moreover, we can treat \Gamma(G\times_{D_4} \mathbb{R}^K) as an induced representation as well.

On \mathcal{F}, the action of D_4 should only rotate the pixel of the image; there should be no linear transformations within fibers (i.e. no transformations of color channels). This means that D_4 has a trivial action on the value of f at the origin. Hence we are regarding \mathcal{F} as \Gamma(G\times_{\rho_0} \mathbb{R}^K), where \rho_0 denotes the trivial representation of D_4. To summarize, \Gamma(G\times_{\rho_0} \mathbb{R}^K)\cong Ind_{D_4}^G(\rho_0)\cong C(\mathbb{Z}^2). We can now restate this picture in a representation theoretic context.

Theorem. Let (\mathbb{R}^K,\rho_0) be a trivial representation of D_4 and Ind_{D_4}^G(\rho_0) a space of feature maps. Let \Psi : Ind_{D_4}^G(\rho_0) \rightarrow \mathbb{R}^{K'} be a D_4-equivariant filter with respect to (\mathbb{R}^{K'},\rho). Then \Phi: Ind_{D_4}^G(\rho_0) \rightarrow Ind_{D_4}^G(\rho) is a steerable convolutional neural network.