This is something a bit misleading: cross-correlation and convolution are actually different operations in signal processing. However, in deep learning, convolutions are implemented as cross-correlations (not that it matters a lot, but it doesn’t hurt to clarify it)

Given 1-dimensional functions \(f\), \(g\). We have that:

Correlation

\[ (f \star g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t + \tau) d\tau \]

Measure of similarity between function \(f\) and \(g\).
We are measuring the area under \(f\) (weighted by \(g\): shifted by \(t\))
People do this difference:
- Cross-correlation: \(f \star g\)
- Auto-correlation: \(f \star f\) Self-similarity of a signal (for every possible shift)

Convolution

\[ (f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t - \tau) d\tau \]

We are doing the same operation as cross-correlation but “flipping” \(g\)
My interpretation¹ is that we need an operation that considers this “flip” in time in order to process signals as they happened in time. I.e. that if there is a case -> effect type of relationship in the input data, the filter processes it in the correct temporal order:

¹ Disclaimer, I know nothing about signal processing and it might be a terrible guess, but I think it makes sense.

Cross-correlation will make the filter “see” the effect and then the cause
Convolution will make the filter “see” the cause and then the effect

So, depending on the application we might consider one or the other.

The convolution theorem

Turns out that the fourier transform of a convolution is the product of fourier transforms of each function individually:

\[ \mathcal{F}\{f(t) * g(t)\} = \mathcal{F}\{f(t)\} \cdot \mathcal{F}\{g(t)\} \]

Also:

\[ \mathcal{F}\{f(t) \cdot g(t)\} = \mathcal{F}\{f(t)\} * \mathcal{F}\{g(t)\} \]

Summary

For some reason, in deep learning the name “convolution” became more popular even though the operation actually being run is cross-correlations. In deep learning, since the values of the kernel are learnt, it doesn’t matter if they are flipped or not (as long as it is consistent): They would equally be learned in a flipped way.

This is a visualization of cross-correlation / convolution operation ²:

² In this case, since \(g\) is symmetric, it doesn’t matter if it gets flipped