Fast Fourier Transform

02^nd Oct. 2023 - 20m read

Polynomial multiplication is a mundane yet computationally expensive task, traditionally multiplying two polynomials of degree n would take O(n^2) complexity.

The FFT algorithm brings down the complexity to O(n\log n), almost magically, with the power of complex numbers and some neat little algebra hacks.

Representing a polynomial

Consider a polynomial of degree n-1,

A(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_{n-1} x^{n-1}

It is cumbersome to write down the algebraic representation of the polynomial, instead we can represent A(x) in other cleaner ways.

Coefficient representation: The vector A containing every coefficient of A(x) in increasing order of power of x, A = \{a_0,\ a_1,\ a_2,\ \dots,\ a_{n-1}\} can uniquely represent the polynomial A(x).
Sample representation: Given a set X of n distinct samples, X = \{x_0,\ x_1,\ x_2,\ \dots,\ x_{n-1}\} We evaluate A(x) at each sample in X, then the set S containing all the sample-value pairs is called the sample representation, S = \{(x_i,\ A(x_i))\ |\ \forall\ x_i \in X\} S = \{(x_0,\ A(x_0)),\ (x_1,\ A(x_1)),\ (x_2,\ A(x_2))\ \dots,\ (x_{n-1},\ A(x_{n-1}))\} This representation stems from the polynomial interpolation theorem, which guarantees that a polynomial of degree n-1 is uniquely specified by its value on n distinct samples. The proof for existence and uniqueness of A(x) is beyond the scope of this blog.

Multiplying two polynomials

Consider two polynomials A(x) and B(x), both of order n-1,

A(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_{n-1} x^{n-1}

B(x) = b_0 + b_1 x + b_2 x^2 + \cdots + b_{n-1} x^{n-1}

Our goal is to multiply them to obtain C(x),

C(x) = A(x)*B(x)

The most obvious approach is to do the traditional multiplication operation, but its boring, tedious and it takes O(n^2) time complexity.

Instead let’s try to see what this operation looks like in the alternate polynomial representations that were discussed.

In the coefficient form, much like the traditional approach, it would take quadratic time (O(n^2)). This is because each element in vector A gets multiplied by every element in vector B, like so
C = \{(a_0b_0),\ (a_0b_1+a_1b_0),\ (a_2+b_2+a_1b_1),\ (a_3+b_3+a_1b_2 +b_1a_2), \ \dots\}
In the sample form, this operation would be blazing fast. We just need multiply the corresponding values of A and B at each sample, which takes linear time (O(n)) C = \{(x_0,A(x_0)*B(x_0)),\ (x_1,A(x_1)*B(x_1))\ \dots,\ (x_{n-1},A(x_{n-1})*B(x_{n-1}))\}

Clearly multiplication is crazy fast in the sample representation. However more often than not we do not have the sample representation computed at hand.

We could leverage the speed of polynomial multiplication in sample representation, if we could:

Convert both A(x) and B(x) to sample representation (in reasonable time).
Do the multiplication in linear time.
And convert back to coefficient representation.

Well.. we’re in luck, because the FFT algorithm does exactly that, it provides an elegant O(n\log n) approach to switch between coefficient to sample representations and vice-versa.

The Vandermonde matrix

Let’s take a look at step 1. Converting a polynomial in its coefficient form to the sample form. To help visualize this conversion, we use the Vandermonde matrix.

Given a set X of n distinct samples, the Vandermonde matrix is illustrated below.

V = \begin{pmatrix} 1 & x_0^1 & x_0^2 & \cdots & x_0^{n-1} \\ 1 & x_1^1 & x_1^2 & \cdots & x_1^{n-1} \\ 1 & x_2^1 & x_2^2 & \cdots & x_2^{n-1} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{n-1}^1 & x_{n-1}^2 & \cdots & x_{n-1}^{n-1} \end{pmatrix}

Notice how each row in the Vandermonde matrix is just one sample in increasing order of its exponent. So any row in the above matrix when multiplied with the coefficient vector would evaluate to the value of the polynomial at that particular sample, i.e.

V_i*A = A(x_i)

Where V_i represents the ith row in the vandermonde matrix and A is the coefficient vector of A(x).

So when we multiply the Vandermonde matrix with A, the resultant vector will be A(x) evaluated at each sample in X, and that’s the conversion we want.

\begin{pmatrix} 1 & x_0^1 & x_0^2 & \cdots & x_0^{n-1} \\ 1 & x_1^1 & x_1^2 & \cdots & x_1^{n-1} \\ 1 & x_2^1 & x_2^2 & \cdots & x_2^{n-1} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{n-1}^1 & x_{n-1}^2 & \cdots & x_{n-1}^{n-1} \end{pmatrix} \cdot \begin{pmatrix} a_0 \\ a_1 \\ a_2 \\ \vdots \\ a_{n-1} \end{pmatrix} = \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ \vdots \\ y_{n-1} \\ \end{pmatrix}

But.. just taking a look at this operation, we can tell it is obviously going to take O(n^2) complexity. How does the FFT even manage to bring this down to O(n\log n)? Let’s explore.

The Algorithm

In essence, FFT is a divide and conquer algorithm, and its one of the most beautiful applications of complex numbers. We start with the assumption that the degree of A(x) is a power of 2 (we’ll see why).

Divide: The first step is to divide A(x) into two sub-polynomials.

A_{even}(x) = a_0 + a_2x + a_4x^2 + a_6x^3 + \dots A_{odd}(x) = a_1 + a_3x + a_5x^2 + a_7x^3 + \dots

Notice that these terms do not add up to A(x), we are merely separating out the even and odd coefficients.

Conquer: This is the recursion step, we recursively compute A_{even}(y) and A_{odd}(y) for, y\in X^2 = \{x^2\ |\ x\in X\}

We go down the recursion tree till the polynomial can no longer be divided (i.e. just left with a single coefficient). But notice that we are changing the set of samples that we are evaluating at. We are using the set X^2 instead of X.

The combine step requires that the samples are in their squared form, which is why the samples are squared at the recurison step before passing it down.

Combine: At the end of the recursion step we are left with even and odd terms that need to be combined. Take a look at A_{even} and A_{odd} again, we can use the expression below to get back A(x)

A(x) = A_{even}(x^2) + xA_{odd}(x^2)

Observe that the samples need to be squared in the combined step. Which is why we evaluate A_{even} and A_{odd} at the squared samples in the recursion step.

Complexity analysis

The time complexity of this algorithm is dependent on two parameters, the degree of A(x) and the size of the sample vector X.

So let’s say it takes T(n,\ |X|) time to run the algorithm, where n is the degree and |X| is the number of samples.

It takes O(n) time to perform a linear scan for the divide step and O(|X|) time to combine the results for each sample in X. And clearly, in each recursion step we are reducing the degree of the polynomial by a factor of 2, however we are evaluating two separate polynomials, so

T(n,\ |X|) = 2T(n/2,\ |X|) + O(n+|X|)

But this is a bad recursion.. assuming that |X| is equal to n at the start, the recursion tree branches in a binary fashion with each node taking O(n) complexity. Since there are \log n levels in the tree, the overall complexity would become O(n^2) and we’re back at square one. All that work for nothing?

There’s just one missing piece, if only the size of |X| could also reduce in the same fashion at which n is reducing. Assuming |X| is equal to n at start, since both parameters are reducing by the same factor we lose dependency on |X|.

Which makes T(n,\ |X|) = T(n) and we would have a familar equation

T(n) = 2T(n/2) + O(n)

which solves to O(n\log n) using the master method.

The Missing Piece

Okay, our task now is to find a set X which when squared, becomes half the size it originally was, and continues to hold this property when squared repeatedly. Let’s take an example, assuming n=8:

X = \{-4,\ -3,\ -2,\ -1,\ 1,\ 2,\ 3, 4\}

squaring all the terms,

Y = X^2 = \{1,\ 4,\ 9,\ 16\}

this is good, we managed to reduce the number of terms by half, however if we square the new set Y again, the property won’t hold.

Looks like starting with a set of 8 numbers isn’t the right way to figure out this set. What if we started with the final set that just has one element in it and then work our way backwards to find X? Let’s say,

X^8 = \{1\}

Now the question to ask here is which two numbers when squared become 1? It should be the square roots of 1, and conveniently every number known to humanity has two square roots, two! Two is good because we need to double the size of our set each time.

X^4 = \{-1,1\}

Doing it again we get

X^2 = \{i,-i,-1,1\}

Ah! complex numbers, but don’t be scared, infact it’s the bread and butter of FFT. And finally,

X = \left\{\frac{1+i}{\sqrt 2},\ -\frac{1+i}{\sqrt 2},\ \frac{1-i}{\sqrt 2},\ -\frac{1-i}{\sqrt 2},\ i,\ -i,\ 1,\ -1\right\}

See the pattern? what we ended up with here are the 8th roots of unity. And this is the reason why we chose n to be a power of 2.

In general, the nth roots of unity are:

X = \{e^{i\tau/n},\ e^{2i\tau/n},\ e^{3i\tau/n}, \ \dots\ e^{(n-1)i\tau/n},\ 1\}

So the kth element in the jth row of our vandermonde matrix would be

V_{jk} = \left(e^{ij\tau/n}\right)^k

That’s the missing piece, our divide and conquer algorithm now runs in O(n\log n) time.