Faster Than the Fastest FFT for Arduino

8579 Views, 33 Favorites, 0 Comments

Faster Than the Fastest FFT for Arduino

The Fast Fourier Transform (FFT) is an efficient algorithm to compute the discrete Fourier transform of a signal. If you search for algorithm implementations, you will find this great Instructable. However, while it provides an effective way to implement the FFT, it is possible to be even more efficient while keeping a high precision. In the present Instructable, I propose to re-use the implementation I made in Julia here (English version available here on my blog) to beat ApproxFFT. Let's dig in!

Supplies

All you need to test the implementation is an Arduino card (I will use my old Arduino Uno) and a USB cable to upload the code.

What Is the Main Difference With ApproxFFT ?

ApproxFFT uses a central trick to calculate the FFT: it approximates the sines and cosines needed for calculation using a lookup table. On the contrary, the code provided in this instructable will calculate an FFT with close to zero approximation using a trigonometric identity.

Another issue with ApproxFFT is that it does not perform the calculation in place! Indeed, take a look at lines 66 and 67 in the original file. You'll notice two array allocations that triple the program's memory footprint and prevent using arrays bigger than 256 cells on my Arduino Uno!

In this Instructable, I will provide two new implementations of the FFT on Arduino. One will use the float type, thanks to some floating point arithmetic tricks (described in this blog post - in french, sorry). The second and third will use fixed point arithmetic, and the type used for storage will be int, just as in the original Instructable. However, the calculations will be made faster thanks to the trigonometry identity, a handcrafted saturating addition routine and a fixed point multiplication based on the fmuls instruction provided by the AVR assembler.

All the codes are hosted on the dedicated GitHub repository. There is also a blog post on my website explaining in detail the inner workings of the implementation. This Instructable focuses more on the benchmarks and how to get them.

How to Compare the Two Programs ?

To compare the two implementations, we need to set up a somewhat systematic test method. In the Arduino sketch, I will use the following `setup` and `loop` :

int data[256]={};
int N = 128;
float frequency = 44000; // Actually useless for our test.

// our timer
unsigned long timestart;
unsigned long timestop;
unsigned long totaltime;

void setup() 
{
 Serial.begin(250000); 
 while (!Serial) {
   ; // wait for serial port to connect. Needed for native USB port only
 }
}

void loop() {
 Serial.println("Ready");
 // First, we read the length of the test. An int is two bytes on an Arduino Uno.
 Serial.readBytes((char*)(&N), sizeof(int));
 Serial.println(N, DEC);
 // Then, we read the input data
 Serial.readBytes((char*)(data), N*sizeof(int));
 Serial.println("Received data. Starting FFT calculation.");
 // We can now start the FFT calculation !
 timestart = micros();

 // Here we perform the FFT calculation.

 timestop = micros();
 totaltime = timestop-timestart;
 Serial.println("FFT calculated, I will now send the result");

 // Then we send back the data to the computer.
 Serial.write((char*)(data), N*sizeof(int));

 // And finally we write the execution time.
 Serial.write((char*)(&totaltime), sizeof(int));
 delay(1);
 Serial.println("And we're done !");

}

The data and time benchmarks can be easily retrieved in Julia using the Serial link. If you are curious about how this is done, head to the blog post or read the program_tester.jl file on GitHub!

For comparison purposes, I also benchmarked the original ApproxFFT code, so I had to change a bit the loop function and the end of the FFT code to be able to retrieve the FFT and send it back to the computer.

Designing an FFT for Floats

The main improvement in speed here is clearly from the trigonometry trick. But I also added some fun IEEE-754 tricks for computing quickly the multiplications, halving and computing the modules. These tricks all work on the same principle: use IEEE-754 representation of a number as an approximate of their base two logarithms. I also used the dedicated AVR instruction to multiply fractional numbers, mainly because it was fun.

You can see in the attached plot that the computed FFT is correct, with an acceptable mean squared error.

Going Faster, Using a Fixed-Point FFT

The two other implementations use fixed-point arithmetic. The first one uses 16 bits numbers (that is equivalent to ApproxFFT), and the second one 8 bits numbers. That means you could theoretically use this code for an array of sizes 512 on the former and 1024 on the latter. Theoretically, because you would need to make minor changes to the buffers and, most importantly, for 1024 bytes-long arrays, you would need to implement 32-bits fixed-point arithmetic for trigonometry to keep working. All this is detailed in the blog post.

Nonetheless, Fixed16FFT is about as accurate as ApproxFFT, and Fixed8FFT is not far behind.

Running All the Benchmarks

One nice thing with this project is you should be able to reproduce the benchmarks using the script program_tester.jl. First, you need to download Julia. Although I don't personally use it, I know a lot of people like Julia for Visual Studio Code. If you don't already have an editor use this one. You also need arduino-cli.

Then launch Julia and install the dependencies. It can take a while!

]add CairoMakie, DataFrames, FFTW, FileIO, FixedPointNumbers, LibSerialPort, LinearAlgebra, Sixel, Statistics, Printf

Note that if your platform does not support Sixel, you can ignore it. Don't launch lines 3 to 10 of the script. In VS-Code particularly, you don't need it.

Then you can run the script. It might fail on the first run; simply run it again. You should see the different error comparison plots displayed, and then the benchmarks.

How Fast and Accurate Can the FFT Be?

Hurray! We have two implementations of the FFT that are faster than the Fastest FFT! As you can see, Fixed16FFT only needs 30ms to run a 256 points FFT when ApproxFFT needs 50ms. But even better, Fixed8FFT runs in 11ms for the same length of input array! You can also notice that FloatFFT is not that far behind. All these implementations have roughly the same mean squared error when compared to the reference implementation that runs on my computer. I have also included a last implementation, ExactFFT, that computes the FFT using floats without any trigonometry or floating-point tricks. You can see that although it is much slower than the others, it is billions of times more accurate.

Conclusion:

This was a fun project. In the end, we get a 5-fold improvement in execution time compared to ApproxFFT, with the same precision. Also, it is much more space-efficient. There are plenty of minor implementation gotchas that may interest you on the blog. For example, did you know you can calculate complex modules using a chainsaw?

Now that all this code is public (see on GitHub), I hope someone will improve on it and make it even faster, so you can be faster than faster than the fastest FFT for Arduino!

Please, reach out if you have some comments or feedback or need some clarification. And most of all, have fun with your current projects!