audio – WebRTC VAD in Java

I am developing a Java application that handles voice interactions, typically in an AI-driven context where the application takes phone calls, transcribes the caller’s voice, sends the transcription to an AI model, and then sends the AI’s response back as audio to the caller.

I’m facing a challenge in detecting whether the person is currently speaking or has stopped, especially in noisy environments. Additionally, the phone calls are typically in low-quality WAV format, which makes accurate detection more difficult.

I’ve tried few ways in Java, but they haven’t been efficient, particularly with noise. But WebRTC VAD has shown good performance in C plus it was accurate, but integrating it with Java via JNI seems inefficient due to the high frequency of function calls needed, which would likely introduce latency.

My question is: What is an efficient way to use WebRTC VAD in Java?