Bug 108947

Summary: GPU hang when running heavy compute workload
Product: Mesa Reporter: Wu Zhiwen <zhiwen.wu>
Component: Drivers/Vulkan/intelAssignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: RESOLVED NOTOURBUG QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: critical    
Priority: medium CC: jason
Version: 18.3   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: vulkaninfo.log

Description Wu Zhiwen 2018-12-05 07:24:33 UTC
I wrote an compute shader to do the convolution algorithm and run it on Intel Apollo Lake GPU by using Vulkan API. When the convolution is a heavy workload, GPU hang occurred.

==== Test environments:
    Ubuntu 16.04
    Mesa 18.3
    Vulkan SDK: 1.1.85.0
    CPU: Intel Celeron J3455
    GPU: HD Graphics 500 (Apollo Lake, 12 EU)

==== Steps to reproduce:
    git clone https://github.com/wzw-intel/vulkan_minimal_compute.git
    cd vulkan_minimal_compute
    mkdir build
    cd build
    cmake ..
    make
    cd ../
    ./build/vulkan_minimal_compute

==== What does the test program do
    This program will run a convolution shader 10 times serially. Each run will be synced by a dedicated VkFence object. GPU hang may occur at any iteration and print log "INTEL-MESA: error: vulkan/anv_device.c:2091: GPU hang on one of our command buffers (VK_ERROR_DEVICE_LOST)"
    Not every run for program triger the GPU hang. If not hang, try more.

==== Other foundings:
    - Setting "LIGHT_WORKLOAD=1" environement variable (it makes the total GFLOPS reduced by 50%) make GPU hang disappear. It seems that GPU hang only occur for heavy workload
    
    - No GPU hang for high end Intel GPU.
      I tested this program on i7-6770HQ (GPU: Iris Pro Graphics 580, GT4e, 72 EU), no GPU hang. But on Intel Celeron J3455 (GPU: HD Graphics 500).
     and Intel Soc with HD Graphics 530 GPU, GPU hang occurs.
Comment 1 Wu Zhiwen 2018-12-05 07:29:20 UTC
Created attachment 142731 [details]
vulkaninfo.log
Comment 2 Lionel Landwerlin 2018-12-05 12:38:19 UTC
A probable guess is that your shader is taking too long to complete, so the i915 driver declares that your workload has hung the GPU even though it's still in process.
You can recompile your kernel with an adjusted value for DRM_I915_HANGCHECK_PERIOD or disable the hangcheck by giving the i915.enable_hangcheck=0 parameter on the kernel command line.

If that solves your problem, I'll reassign the issue to i915.
Comment 3 Wu Zhiwen 2018-12-06 02:37:21 UTC
Problem solved. I tried "i915.enable_hangcheck=0" kernel option, and no GPU hang anymore. Thank you.
Comment 4 Jason Ekstrand 2018-12-06 03:42:12 UTC
In that case, I'm closing this bug.  The compute shader is just taking too long to run and triggering the kernel watchdog timer.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.