gcc – Cross compiling FFTW for ARM Neon

I am trying to compile FFTW3 to run on ARM Neon (More precisely, on a Cortex a-53). The build env is x86_64-pokysdk-lunix, The host env is aarch64-poky-lunix. I am using the aarch64-poky-linux-gcc compiler.
I used the following command at first:

./configure --prefix=/build_env/neon/neon_install_8 --host=aarch64-poky-linux --enable-shared --enable-single --enable-neon --with-sysroot=/opt/poky/2.5.3/sysroots/aarch64-poky-linux "CC=/opt/poky/2.5.3/sysroots/x86_64-pokysdk-linux/usr/bin/aarch64-poky-linux/aarch64-poky-linux-gcc -march=armv8-a+simd -mcpu=cortex-a53 -mfloat-abi=softfp -mfpu=neon"

The compiler did not support the -mfloat-abi=softfp and the -mfpu=neon. It also did not let me define the path to the sysroot this way.
Then used the following command:

./configure --prefix=/build_env/neon/neon_install_8 --host=aarch64-poky-linux --enable-shared --enable-single --enable-neon "CC=/opt/poky/2.5.3/sysroots/x86_64-pokysdk-linux/usr/bin/aarch64-poky-linux/aarch64-poky-linux-gcc" "CFLAGS=--sysroot=/opt/poky/2.5.3/sysroots/aarch64-poky-linux -mcpu=cortex-a53 -march=armv8-a+simd"

This command succeeded with this config log and this config.h. Then I used the command make then make install. I then copied my shared library file into my host env and used fftwf_ instead of fftw_ in my code base. The final step was to recompile the program. I ran a test and compared the times for both algorithm using <sys/resource.h>. I also used the fftw[f]_forget_wisdom() on both algorithms so that It can be fair. However, I am not getting a speedup. I believe that using an SIMD architecture (NEON in our case) would accelerate the FFTW library.
I would really appreciate if anyone can point out something that I am doing wrong so that I can try a fix and see if I can get the performance boost I am looking for.

Read more here: Source link