Since you need to capture 2s of audio initially, this will set a lower bound on latency. Even with your 50% overlap you will still have a minimum latency of 1s. The FFT and other processing will only add to this, but hopefully not by a significant amount (otherwise use a faster FFT library). The only way you will be able to reduce this latency is by sacrificing frequency resolution.
Using an FFT method gives you a time-frequency trade-off. If you want lower latency, you will have to use less data, which with an FFT (either shorter or zero-padded) will give you a less accurate frequency estimate.
Zero-padding will just give you a high-quality interpolation. But this may provide a better peak frequency estimate than just using the center of the peak bin of a shorter FFT.
Read more here: Source link