I wrote an compute shader to do the convolution algorithm and run it on Intel Apollo Lake GPU by using Vulkan API. When the convolution is a heavy workload, GPU hang occurred.
==== Test environments:
Vulkan SDK: 18.104.22.168
CPU: Intel Celeron J3455
GPU: HD Graphics 500 (Apollo Lake, 12 EU)
==== Steps to reproduce:
git clone https://github.com/wzw-intel/vulkan_minimal_compute.git
==== What does the test program do
This program will run a convolution shader 10 times serially. Each run will be synced by a dedicated VkFence object. GPU hang may occur at any iteration and print log "INTEL-MESA: error: vulkan/anv_device.c:2091: GPU hang on one of our command buffers (VK_ERROR_DEVICE_LOST)"
Not every run for program triger the GPU hang. If not hang, try more.
==== Other foundings:
- Setting "LIGHT_WORKLOAD=1" environement variable (it makes the total GFLOPS reduced by 50%) make GPU hang disappear. It seems that GPU hang only occur for heavy workload
- No GPU hang for high end Intel GPU.
I tested this program on i7-6770HQ (GPU: Iris Pro Graphics 580, GT4e, 72 EU), no GPU hang. But on Intel Celeron J3455 (GPU: HD Graphics 500).
and Intel Soc with HD Graphics 530 GPU, GPU hang occurs.
Created attachment 142731 [details]
A probable guess is that your shader is taking too long to complete, so the i915 driver declares that your workload has hung the GPU even though it's still in process.
You can recompile your kernel with an adjusted value for DRM_I915_HANGCHECK_PERIOD or disable the hangcheck by giving the i915.enable_hangcheck=0 parameter on the kernel command line.
If that solves your problem, I'll reassign the issue to i915.
Problem solved. I tried "i915.enable_hangcheck=0" kernel option, and no GPU hang anymore. Thank you.
In that case, I'm closing this bug. The compute shader is just taking too long to run and triggering the kernel watchdog timer.
NOTOURBUG -> NOTABUG, as the described behavior is expected / what it should do (kernel aborts loads that exceed the configured threshold as hanging ones by reseting the GPU, to allow other GPU using processes, such as UI, to run).