Bug 92623

Summary:	Differences in prog_data ignored when caching fragment programs (causes hangs)
Product:	Mesa	Reporter:	Pierre Bourdon <delroth>
Component:	Drivers/DRI/i965	Assignee:	Kenneth Graunke <kenneth>
Status:	RESOLVED FIXED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs, kanepyork
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	Xorg configuration used for headless tests /sys/class/drm/card0/error Hackish testcase that switches between two FS programs that differ only in prog_data

Description Pierre Bourdon 2015-10-22 23:12:45 UTC

Created attachment 119120 [details]
Xorg configuration used for headless tests

[root@duke ~]# pacman -Q mesa linux xorg-server libdrm
mesa 11.0.3-1
linux 4.2.3-1
xorg-server 1.17.2-4
libdrm 2.4.65-1

I was setting up an Intel NUC to run dolphin-emu.org graphical regression tests (https://fifoci.dolphin-emu.org/). The setup is simple: "headless" Xorg (xorg.conf attached) and dolphin-emu running in that DISPLAY rendering tests.

Two of our tests consistently end up segfaulting on Intel GPUs with the following log:

    intel_do_flush_locked failed: Input/output error

Accompanied by the following in dmesg:

    [  215.432142] [drm] stuck on render ring
    [  215.433001] [drm] GPU HANG: ecode 8:0:0x84df3c1c, in FIFO-GPU thread [394], reason: Ring hung, action: reset
    [  215.433003] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
    [  215.433004] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
    [  215.433005] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
    [  215.433006] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
    [  215.433007] [drm] GPU crash dump saved to /sys/class/drm/card0/error
    [  215.435124] drm/i915: Resetting chip after gpu hang
    [  221.455258] [drm] stuck on render ring
    [  221.456246] [drm] GPU HANG: ecode 8:0:0x84df3c1c, in FIFO-GPU thread [394], reason: Ring hung, action: reset
    [  221.458382] drm/i915: Resetting chip after gpu hang

The issue happens consistently when running these tests here. Unfortunately an apitrace does not reproduce the problem, so here are some slightly involved repro instructions using our software:

    $ git clone https://github.com/dolphin-emu/dolphin
    $ cd dolphin && mkdir build && cd build
    $ cmake -DDISABLE_WX=ON ..
    $ make -j4
    $ rm -rf ~/.dolphin-emu
    $ mkdir -p ~/.dolphin-emu/Config
    $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini
    $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini
    $ wget https://fifoci.dolphin-emu.org/media/dff/MarioSuperSluggersStarBar.dff
    $ Binaries/dolphin-emu-nogui -e MarioSuperSluggersStarBar.dff  # <- consistently crashes

I couldn't reproduce the same issue on an Haswell Intel HD chip, but that's with slightly different software versions too.

Comment 1 Pierre Bourdon 2015-10-22 23:14:59 UTC

Of course I haven't tested these repro instructions as well as I thought I did...

Replace:

    $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini
    $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini

With:

    $ wget https://raw.githubusercontent.com/dolphin-emu/fifoci/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini
    $ wget https://raw.githubusercontent.com/dolphin-emu/fifoci/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini

Comment 2 Pierre Bourdon 2015-10-22 23:25:42 UTC

Created attachment 119121 [details]
/sys/class/drm/card0/error

Comment 3 Markus Wick 2015-10-23 21:33:31 UTC

I've tried this hang again on mesa-master debug build, 70b06fb: Same freeze as on 11.0, no assert.

But good luck, this hang is reproduce-able with apitrace:
http://markus.members.selfnet.de/dolphin/i965-broadwell-freeze.trace

Everything else but mesa is the same as Pierre's.

Comment 4 Kenneth Graunke 2015-10-24 05:00:19 UTC

FWIW, enabling the "Always re-emit all state" block in brw_state_upload.c:708 seems to fix this problem.  always_flush_batch=true and always_flush_cache=true (way more flushing) have no impact.  This makes me suspect missing dirty flags...

Comment 5 Kenneth Graunke 2015-10-24 05:22:41 UTC

I decided to try and do a binary search across the atoms, and narrowed the range - always re-emitting atoms 44-57 (inclusive) makes the hang go away.  Notably, not forcing 57 (gen8_pma_fix) returns it to hanging.

Disabling the Depth PMA fix also makes the hang go away.

Comment 6 Kenneth Graunke 2015-10-24 06:06:47 UTC

Apparently re-emitting exactly 3DSTATE_WM and the PMA fix works.

Comment 7 Kenneth Graunke 2015-10-24 06:27:47 UTC

Also, not setting the early depth/stencil mode to EDSC_PREPS works around the problem.  (We set this based on ARB_conservative_depth settings...)

Unfortunately, I've really no idea why...

Comment 8 Markus Wick 2015-10-24 10:53:53 UTC

Another dolphin game is also known to freeze the GPU, maybe it helps to debug this issue: http://markus.members.selfnet.de/dolphin/i965-broadwell-freeze-2.trace

Comment 9 Pierre Bourdon 2015-10-26 14:41:13 UTC

Disabling ARB_shader_image_load_store in Dolphin seems to fix the crash, which concurs with your diagnostics. We use ARB_shader_image_load_store to force early depth test in some shaders.

Comment 10 Kenneth Graunke 2015-10-27 18:50:30 UTC

Right, I misspoke - ARB_conservative_depth has a similarly named feature, but it's definitely the ARB_image_load_store early depth test layout qualifier that's tripping us up here.

I still have no idea why.

Comment 11 Pierre Bourdon 2015-10-28 00:53:22 UTC

Another data point, which may or many not be useful (it might be obvious already from what you've debugged): disabling HiZ through ~/.drirc makes the crash disappear.

Comment 12 Pierre Bourdon 2015-10-28 02:44:34 UTC

So I've been looking at this for the past N hours in the hope of learning more about how Mesa and i965 work (and I've already learned a lot \o/). I noticed something that I find interesting though. It might be completely normal and uninteresting, but if it is I'm curious about the explanation.

I randomly figured out that adding BRW_NEW_FRAGMENT_PROGRAM to the dirty bits that trigger upload_wm_state and gen8_emit_pma_stall_workaround works around the crash bug. But both of these should already trigger on BRW_NEW_FS_PROG_DATA being dirty.

In what condition should BRW_NEW_FRAGMENT_PROGRAM be dirty but not BRW_NEW_FS_PROG_DATA? It sounds like getting a new fragment program should in practice always come with a new prog_data as well?

Comment 13 Pierre Bourdon 2015-10-28 03:55:22 UTC

Examples captured in the wild:

+       if (prev_prog_data && brw->wm.prog_data && prev_prog_data->early_fragment_tests != brw->wm.prog_data->early_fragment_tests) {
+               printf("Old early fragments: %d new: %d\n", prev_prog_data->early_fragment_tests, brw->wm.prog_data->early_fragment_tests);
+               printf("Dirty: %d %d | %016lx\n", !!(ctx->NewDriverState & BRW_NEW_FS_PROG_DATA), !!(ctx->NewDriverState & BRW_NEW_FRAGMENT_PROGRAM), ctx->NewDriverState);
+       }

Old early fragments: 0 new: 1
Dirty: 0 1 | 0000000200600a08

Old early fragments: 1 new: 0
Dirty: 0 1 | 0000000200600a00

Old early fragments: 0 new: 1
Dirty: 1 1 | 0000000200600a09

Old early fragments: 1 new: 0
Dirty: 1 1 | 0000000220600a09

Old early fragments: 0 new: 1
Dirty: 1 1 | 0000000220600a09

Old early fragments: 1 new: 0
Dirty: 0 1 | 0000000200600a00

Comment 14 Pierre Bourdon 2015-10-28 06:30:14 UTC

And just confirmed with a small test program (really hacky, code attached): if you switch between two almost identical shaders with early_fragment_tests being the only difference, NEW_FS_PROG_DATA is never set as dirty!

Old early fragments: 1 new: 0
Dirty: 0 1 | 0000000000600a00
Old early fragments: 0 new: 1
Dirty: 0 1 | 0000000000600a00

Now I'm not completely sure how that should be fixed. But I'm sure you will have an idea.

Comment 15 Pierre Bourdon 2015-10-28 06:31:26 UTC

Created attachment 119240 [details]
Hackish testcase that switches between two FS programs that differ only in prog_data

Comment 16 Pierre Bourdon 2015-10-28 07:17:45 UTC

This is indeed confirmed as the source of the problem. It also happens to be a regression introduced by 1bba29ed403e735ba0bf04ed8aa2e571884fcaaf. The commit just before r1bba29ed works, then it starts hanging.

It impacts at least ARB_shader_image_load_store (early_fragment_tests) and ARB_conservative_depth (other depth-related layout() declarations).

Comment 17 Kenneth Graunke 2015-10-28 09:00:33 UTC

Thanks a ton for tracking this down!  The demo application made it really easy to see what's going on.

The fix turned out to be pretty straightforward:
http://lists.freedesktop.org/archives/mesa-dev/2015-October/098490.html

Comment 18 Kenneth Graunke 2015-10-28 09:00:44 UTC

Thanks a ton for tracking this down!  The demo application made it really easy to see what's going on.

The fix turned out to be pretty straightforward:
http://lists.freedesktop.org/archives/mesa-dev/2015-October/098490.html

Comment 19 Kenneth Graunke 2015-10-29 04:35:33 UTC

Fixed on master with:

commit bf05af3f0e8769f417bbd995470dc1b8083a0df9
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Wed Oct 28 00:53:20 2015 -0700

    i965: Fix missing BRW_NEW_*_PROG_DATA flagging caused by cache reuse.

It's nominated for 11.0, too, so Emil will most likely pick it up for the next release.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.