This case is to run gfxbench manhattan in qemu VM with virtio-gpu enabled and qemu uses "-display sdl,gl=es". It would cause host i915 gpu hang but recoverable. I tried to take a apitrace for this which can produce this hang without need to setup qemu guest. https://drive.google.com/open?id=1T6A2d--VnVZBMvM0CvjSa6ecWd50OCmi Any idea or help on how to debug on this would be appreciated. btw, I'm not sure if this is related to bug 108898.
Could you attach the error state generated after the hang? Also some details about the version of kernel & Mesa you're running on both host & guest would be helpful information.
Created attachment 144807 [details] error dump
This can still be produced on latest mesa tip 9c611fb38119d308c73dc777a1d7d1336b22fab5 and host kernel is several days ago drm-tip as, commit 43aa8c3633274d7cf0a6dca4b8734d84d9928cf9 (HEAD -> drm-tip-0711, drm-tip/drm-tip) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jul 11 07:50:17 2019 +0100 drm-tip: 2019y-07m-11d-06h-49m-27s UTC integration manifest Replay the trace can trigger this.
Thanks a lot, reproduced locally.
Do other gfxbench workloads (egypt, car chase, trex, aztec) generate gpu hang on the same platform?
Currently we can only generate hang for manhattan case.
any findings?
First hang seems happen in call 233679 glDispatchCompute(...)
Than, potentially, it may be duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=110228
I'm not sure, as that one seems to be vulkan issue. For this one, if you download my trace file, it actually hangs at call 233679, which is a compute shader program dispatch. Although this program has been dispatched for several times, maybe somehow uniform or shader storage buffer input has caused trouble in this shader, as you can see that it has loop internal. And another thing worth check is if there's any overflow of ssbo write. Is there anyway to check that from i965 backend?
(In reply to Wang Zhenyu from comment #10) > I'm not sure, as that one seems to be vulkan issue. > > For this one, if you download my trace file, it actually hangs at call > 233679, which is a compute shader program dispatch. Although this program > has been dispatched for several times, maybe somehow uniform or shader > storage buffer input has caused trouble in this shader, as you can see that > it has loop internal. > > And another thing worth check is if there's any overflow of ssbo write. Is > there anyway to check that from i965 backend? You may try to use 'intel_sanitize_gpu' tool to detect out-of-bounds GPU writes, tool can be built as part of Mesa.
I worked with Scott (author of intel_sanitize_gpu) yesterday. It doesn't catch the hang even after we enable it by disabling the softpin in mesa. I shared some update in the original bug from ChromeOS as reference. https://bugs.chromium.org/p/chromium/issues/detail?id=959370
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1820.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.