Poor perf from aocc+amd-fftw in linux with AMD Gen…

I have a very simple 3D in-place FFT transform code with FFTW and openmp multi-thread support. I tried to get the best performance in a linux machine (Ubuntu with AMD Genoa CPUs -2 sockets). I built it with AMD compiler, aocc 5.0, and AMD-FFTW (optimized with openmp, avx-512) like

clang++ bench_fftw.cpp -o bench_fftw -fopenmp -march=znver4 -O3 -flto -mavx512 -ffast-math -L/opt/AMD/amd-fftw/lib -lfftw3f_omp -lfftw3f -lm -I/opt/AMD/amd-fftw/include

Regarding how to run it, I typically set

export OMP_NUM_THREADS=8 #for 1 CCD/NUMA

export OMP_PLACES=cores #only using the physical core

export OMP_PROC_BIND=close

I also have a version with MKL FFT interface, it is built with Intel compiler icpx and MKL-FFT.

icpx bench_mkl.cpp -o bench_mkl -qopenmp -O3 -0fast -ffast-math -axCORE-AVX2,CORE-AVX512 -qmkl

The binary built with icpx+mkl-fft performs much better than that with aocc+amd-fftw, almost twice faster.

Any advice on how to tune this code in AMD Genoa?

stackoverflow.com/questions/79410148/poor-perf-from-aoccamd-fftw-in-linux-with-amd-genoa-cpu…