c++ – Efficiency of using DirectX GPGPU for processing 1.2GB image data per frame in a ring buffer
Problem Background: I am developing a high-throughput image inspection system. Currently, I manage 600MB image buffers in a ring buffer structure. Specifically:
-
I have 3 slots in the ring buffer.
-
Each slot contains two different types of images (600MB x 2 = 1.2GB per processing cycle).
-
Total memory footprint for the buffer is 3.6GB.
-
The system needs to cycle through these buffers continuously and perform various image processing algorithms.
Current Issue: The image processing is currently handled by the CPU. When processing 1.2GB of data per cycle, the CPU load spikes significantly, leading to a drop in overall system performance and failure to meet the required tact time.
Proposed Solution: I am considering migrating the entire image processing logic to the GPU using DirectX 11 Compute Shaders (GPGPU). My plan is:
-
Allocate a 3.6GB buffer (as a
StructuredBufferorByteAddressBuffer) directly in GPU-accessible memory. -
Use Zero-copy (via Integrated GPU) or Dynamic Buffer Mapping to minimize transfer overhead.
-
Perform all projection and pixel-wise calculations on the GPU.
Questions:
-
Is migrating to DirectX/GPGPU a recommended approach for handling this scale of data (1.2GB/frame, 3.6GB total)?
-
Given the large data size, I am concerned about the PCIe transfer overhead (if using a Discrete GPU) vs. Memory bandwidth contention (if using an Integrated GPU). Which architecture would be more efficient for this specific use case?
-
In DirectX 11, what is the most efficient way to manage a 3.6GB ring buffer to avoid GPU stalls during
Map/UnmaporDispatchcalls? -
Are there any specific pitfalls when the buffer size approaches the VRAM limit (e.g., on a 4GB GPU)?
Environment:
Read more here: Source link
