111153 – run gfxbench manhattan in virtio-gpu guest causes host i915 recoverable gpu hang

Bug 111153 - run gfxbench manhattan in virtio-gpu guest causes host i915 recoverable gpu hang

Summary: run gfxbench manhattan in virtio-gpu guest causes host i915 recoverable gpu hang

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-07-17 05:17 UTC by Wang Zhenyu
Modified:	2019-09-25 20:33 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
error dump (70.63 KB, text/plain) 2019-07-17 06:18 UTC, Wang Zhenyu	Details
View All

Description Wang Zhenyu 2019-07-17 05:17:00 UTC

This case is to run gfxbench manhattan in qemu VM with virtio-gpu enabled and qemu uses "-display sdl,gl=es". It would cause host i915 gpu hang but recoverable.

I tried to take a apitrace for this which can produce this hang without need to setup qemu guest. https://drive.google.com/open?id=1T6A2d--VnVZBMvM0CvjSa6ecWd50OCmi

Any idea or help on how to debug on this would be appreciated.

btw, I'm not sure if this is related to bug 108898.

Comment 1 Lionel Landwerlin 2019-07-17 05:41:23 UTC

Could you attach the error state generated after the hang?
Also some details about the version of kernel & Mesa you're running on both host & guest would be helpful information.

Comment 2 Wang Zhenyu 2019-07-17 06:18:22 UTC

Created attachment 144807 [details]
error dump

Comment 3 Wang Zhenyu 2019-07-17 06:20:55 UTC

This can still be produced on latest mesa tip 9c611fb38119d308c73dc777a1d7d1336b22fab5 and host kernel is several days ago drm-tip as,
commit 43aa8c3633274d7cf0a6dca4b8734d84d9928cf9 (HEAD -> drm-tip-0711, drm-tip/drm-tip)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 11 07:50:17 2019 +0100

    drm-tip: 2019y-07m-11d-06h-49m-27s UTC integration manifest

Replay the trace can trigger this.

Comment 4 Lionel Landwerlin 2019-07-17 08:33:56 UTC

Thanks a lot, reproduced locally.

Comment 5 Mark Janes 2019-07-17 15:22:38 UTC

Do other gfxbench workloads (egypt, car chase, trex, aztec) generate gpu hang on the same platform?

Comment 6 Wang Zhenyu 2019-07-22 07:14:55 UTC

Currently we can only generate hang for manhattan case.

Comment 7 Wang Zhenyu 2019-08-06 02:19:37 UTC

any findings?

Comment 8 Wang Zhenyu 2019-08-06 06:31:06 UTC

First hang seems happen in call 233679 glDispatchCompute(...)

Comment 9 Sergii Romantsov 2019-08-07 10:09:32 UTC

Than, potentially, it may be duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=110228

Comment 10 Wang Zhenyu 2019-08-07 10:22:37 UTC

I'm not sure, as that one seems to be vulkan issue.

For this one, if you download my trace file, it actually hangs at call 233679, which is a compute shader program dispatch. Although this program has been dispatched for several times, maybe somehow uniform or shader storage buffer input has caused trouble in this shader, as you can see that it has loop internal. 

And another thing worth check is if there's any overflow of ssbo write. Is there anyway to check that from i965 backend?

Comment 11 Tapani Pälli 2019-08-07 10:49:04 UTC

(In reply to Wang Zhenyu from comment #10)
> I'm not sure, as that one seems to be vulkan issue.
> 
> For this one, if you download my trace file, it actually hangs at call
> 233679, which is a compute shader program dispatch. Although this program
> has been dispatched for several times, maybe somehow uniform or shader
> storage buffer input has caused trouble in this shader, as you can see that
> it has loop internal. 
> 
> And another thing worth check is if there's any overflow of ssbo write. Is
> there anyway to check that from i965 backend?

You may try to use 'intel_sanitize_gpu' tool to detect out-of-bounds GPU writes, tool can be built as part of Mesa.

Comment 12 Jianxun Zhang 2019-09-10 21:40:08 UTC

I worked with Scott (author of intel_sanitize_gpu) yesterday. It doesn't catch the hang even after we enable it by disabling the softpin in mesa.

I shared some update in the original bug from ChromeOS as reference.

https://bugs.chromium.org/p/chromium/issues/detail?id=959370

Comment 13 GitLab Migration User 2019-09-25 20:33:53 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1820.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.