Bug 108461

Summary:	[kbl] GPU hang on Mad Max vulkan
Product:	Mesa	Reporter:	Vova <vova7890>
Component:	Drivers/Vulkan/intel	Assignee:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Status:	RESOLVED FIXED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs, jason
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	gpu error dump

Description Vova 2018-10-16 19:36:19 UTC

Created attachment 142050 [details]
gpu error dump

I'm just run Mad Max on integrated video, and caught an a segfault, dmesg says file this bug to here, and here we are :)

[ 2125.632326] [drm] GPU HANG: ecode 9:0:0x84d77efc, reason: no progress on rcs0, action: reset
[ 2125.632329] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2125.632330] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2125.632330] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2125.632330] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2125.632331] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 2125.632354] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0

Comment 1 Denis 2018-10-18 10:15:56 UTC

hi Vova, it was one time segfault or in your case you can easily reproduce it?

Comment 2 Denis 2018-10-18 16:08:19 UTC

not sure that I got the same crash, but I got something. Below all outputs:

>game log:
MadMax: dumped to "/home/den/.local/share/feral-interactive/Mad Max/crashes/3b8053c2-376a-1c03-55c46659-1ad90210.dmp"
MadMax: crash reporter "/home/den/.steam/steam/steamapps/common/Mad Max/bin/feral_linux_crash_reporter" launching
Game crashed with signal 6
Vulkan call failed: -4

If possible, launch Steam from command line to check the output when the game is run.
Then, contact support@feralinteractive.com with the details of the output, your Steam System Info, as well as the dump file:
/home/den/.local/share/feral-interactive/Mad Max/crashes/3b8053c2-376a-1c03-55c46659-1ad90210.dmp


>dmesg output:
[ 6334.854969] [drm] GPU HANG: ecode 9:0:0x85d7fcfb, in WinMain [6644], reason: No progress on rcs0, action: reset
[ 6334.855042] i915 0000:00:02.0: Resetting rcs0 after gpu hang


continue investigation (btw, looks like using openGL there is no crash).

Comment 3 Denis 2018-10-19 11:06:31 UTC

ok, here is new peace of information:

Checked 3 mesa versions:
>mesa-vulkan-drivers from repository (18.0.5-0ubuntu0~16.04.1)
works fine

>built from git mesa from 21.09
hangs exist

>built from git latest mesa (from 19.10)
works fine again

So, to summarize - with latest git mesa game should work fine. Vova, could you please clarify, what mesa version do you have?

From my side, to double-check, I will wait for new mesa release and check on it again.

Comment 4 Vova 2018-10-19 18:28:48 UTC

It stable reproduced, all vulkan games will hang GPU. I think it a more kernel-space problem, because userspace do not should hang GPU.

Comment 5 Vova 2018-10-19 18:30:33 UTC

mesa Version: 18.2.2-1

Comment 6 Vova 2018-10-21 20:32:06 UTC

can confirm that mesa from git is not crash GPU

Comment 7 Jason Ekstrand 2018-10-22 00:58:13 UTC

Best guess, it was fixed by this: https://gitlab.freedesktop.org/mesa/mesa/commit/0fa9e6d7b304f6a8064ed78a4b9c557e1026e7e5

Comment 8 Denis 2018-10-22 17:00:22 UTC

I tried mesa without that commit, but it worked fine.
Are we interested in bisecting it? It is not straightforward because of different branches I think:

The merge base 8d3ccdbb9ba480dfe435023b747714cd517e5028 is bad.
This means the bug has been fixed between 8d3ccdbb9ba480dfe435023b747714cd517e5028 and [2bb05d70afe82fdc5e6d1d7c7bcbd8dc28df4b82].

Comment 9 Jason Ekstrand 2018-10-22 17:09:28 UTC

If it's easy to reproduce, that'd be good.  I don't like magically fixed bugs. :(

Comment 10 Denis 2018-10-23 09:12:38 UTC

Here it is I think, commit which provided fix for the game:



f5bab06428fc7ca6116cf0daf1c237eb86202e7a is the first bad commit
commit f5bab06428fc7ca6116cf0daf1c237eb86202e7a
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Tue Oct 2 17:19:32 2018 -0500

    anv/batch_chain: Don't start a new BO just for BATCH_BUFFER_START
    
    Previously, we just went ahead and emitted MI_BATCH_BUFFER_START as
    normal.  If we are near enough to the end, this can cause us to start a
    new BO just for the MI_BATCH_BUFFER_START which messes up chaining.  We
    always reserve enough space at the end for an MI_BATCH_BUFFER_START so
    we can just increment cmd_buffer->batch.end prior to emitting the
    command.
    
    Fixes: a0b133286a3 "anv/batch_chain: Simplify secondary batch return..."
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107926
    Tested-by: Alex Smith <asmith@feralinteractive.com>
    Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com>

:040000 040000 37d291419a86e6fca5d872b7b53974d72167c57b 5dba5eacf4dcb36ccc35224107fa5d0a2806a937 M	src

Comment 11 Jason Ekstrand 2018-10-23 13:12:17 UTC

That also makes sense. Thanks for bisecting!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.