Created attachment 119120 [details] Xorg configuration used for headless tests [root@duke ~]# pacman -Q mesa linux xorg-server libdrm mesa 11.0.3-1 linux 4.2.3-1 xorg-server 1.17.2-4 libdrm 2.4.65-1 I was setting up an Intel NUC to run dolphin-emu.org graphical regression tests (https://fifoci.dolphin-emu.org/). The setup is simple: "headless" Xorg (xorg.conf attached) and dolphin-emu running in that DISPLAY rendering tests. Two of our tests consistently end up segfaulting on Intel GPUs with the following log: intel_do_flush_locked failed: Input/output error Accompanied by the following in dmesg: [ 215.432142] [drm] stuck on render ring [ 215.433001] [drm] GPU HANG: ecode 8:0:0x84df3c1c, in FIFO-GPU thread [394], reason: Ring hung, action: reset [ 215.433003] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 215.433004] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 215.433005] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 215.433006] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 215.433007] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 215.435124] drm/i915: Resetting chip after gpu hang [ 221.455258] [drm] stuck on render ring [ 221.456246] [drm] GPU HANG: ecode 8:0:0x84df3c1c, in FIFO-GPU thread [394], reason: Ring hung, action: reset [ 221.458382] drm/i915: Resetting chip after gpu hang The issue happens consistently when running these tests here. Unfortunately an apitrace does not reproduce the problem, so here are some slightly involved repro instructions using our software: $ git clone https://github.com/dolphin-emu/dolphin $ cd dolphin && mkdir build && cd build $ cmake -DDISABLE_WX=ON .. $ make -j4 $ rm -rf ~/.dolphin-emu $ mkdir -p ~/.dolphin-emu/Config $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini $ wget https://fifoci.dolphin-emu.org/media/dff/MarioSuperSluggersStarBar.dff $ Binaries/dolphin-emu-nogui -e MarioSuperSluggersStarBar.dff # <- consistently crashes I couldn't reproduce the same issue on an Haswell Intel HD chip, but that's with slightly different software versions too.
Of course I haven't tested these repro instructions as well as I thought I did... Replace: $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini With: $ wget https://raw.githubusercontent.com/dolphin-emu/fifoci/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini $ wget https://raw.githubusercontent.com/dolphin-emu/fifoci/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini
Created attachment 119121 [details] /sys/class/drm/card0/error
I've tried this hang again on mesa-master debug build, 70b06fb: Same freeze as on 11.0, no assert. But good luck, this hang is reproduce-able with apitrace: http://markus.members.selfnet.de/dolphin/i965-broadwell-freeze.trace Everything else but mesa is the same as Pierre's.
FWIW, enabling the "Always re-emit all state" block in brw_state_upload.c:708 seems to fix this problem. always_flush_batch=true and always_flush_cache=true (way more flushing) have no impact. This makes me suspect missing dirty flags...
I decided to try and do a binary search across the atoms, and narrowed the range - always re-emitting atoms 44-57 (inclusive) makes the hang go away. Notably, not forcing 57 (gen8_pma_fix) returns it to hanging. Disabling the Depth PMA fix also makes the hang go away.
Apparently re-emitting exactly 3DSTATE_WM and the PMA fix works.
Also, not setting the early depth/stencil mode to EDSC_PREPS works around the problem. (We set this based on ARB_conservative_depth settings...) Unfortunately, I've really no idea why...
Another dolphin game is also known to freeze the GPU, maybe it helps to debug this issue: http://markus.members.selfnet.de/dolphin/i965-broadwell-freeze-2.trace
Disabling ARB_shader_image_load_store in Dolphin seems to fix the crash, which concurs with your diagnostics. We use ARB_shader_image_load_store to force early depth test in some shaders.
Right, I misspoke - ARB_conservative_depth has a similarly named feature, but it's definitely the ARB_image_load_store early depth test layout qualifier that's tripping us up here. I still have no idea why.
Another data point, which may or many not be useful (it might be obvious already from what you've debugged): disabling HiZ through ~/.drirc makes the crash disappear.
So I've been looking at this for the past N hours in the hope of learning more about how Mesa and i965 work (and I've already learned a lot \o/). I noticed something that I find interesting though. It might be completely normal and uninteresting, but if it is I'm curious about the explanation. I randomly figured out that adding BRW_NEW_FRAGMENT_PROGRAM to the dirty bits that trigger upload_wm_state and gen8_emit_pma_stall_workaround works around the crash bug. But both of these should already trigger on BRW_NEW_FS_PROG_DATA being dirty. In what condition should BRW_NEW_FRAGMENT_PROGRAM be dirty but not BRW_NEW_FS_PROG_DATA? It sounds like getting a new fragment program should in practice always come with a new prog_data as well?
Examples captured in the wild: + if (prev_prog_data && brw->wm.prog_data && prev_prog_data->early_fragment_tests != brw->wm.prog_data->early_fragment_tests) { + printf("Old early fragments: %d new: %d\n", prev_prog_data->early_fragment_tests, brw->wm.prog_data->early_fragment_tests); + printf("Dirty: %d %d | %016lx\n", !!(ctx->NewDriverState & BRW_NEW_FS_PROG_DATA), !!(ctx->NewDriverState & BRW_NEW_FRAGMENT_PROGRAM), ctx->NewDriverState); + } Old early fragments: 0 new: 1 Dirty: 0 1 | 0000000200600a08 Old early fragments: 1 new: 0 Dirty: 0 1 | 0000000200600a00 Old early fragments: 0 new: 1 Dirty: 1 1 | 0000000200600a09 Old early fragments: 1 new: 0 Dirty: 1 1 | 0000000220600a09 Old early fragments: 0 new: 1 Dirty: 1 1 | 0000000220600a09 Old early fragments: 1 new: 0 Dirty: 0 1 | 0000000200600a00
And just confirmed with a small test program (really hacky, code attached): if you switch between two almost identical shaders with early_fragment_tests being the only difference, NEW_FS_PROG_DATA is never set as dirty! Old early fragments: 1 new: 0 Dirty: 0 1 | 0000000000600a00 Old early fragments: 0 new: 1 Dirty: 0 1 | 0000000000600a00 Now I'm not completely sure how that should be fixed. But I'm sure you will have an idea.
Created attachment 119240 [details] Hackish testcase that switches between two FS programs that differ only in prog_data
This is indeed confirmed as the source of the problem. It also happens to be a regression introduced by 1bba29ed403e735ba0bf04ed8aa2e571884fcaaf. The commit just before r1bba29ed works, then it starts hanging. It impacts at least ARB_shader_image_load_store (early_fragment_tests) and ARB_conservative_depth (other depth-related layout() declarations).
Thanks a ton for tracking this down! The demo application made it really easy to see what's going on. The fix turned out to be pretty straightforward: http://lists.freedesktop.org/archives/mesa-dev/2015-October/098490.html
Fixed on master with: commit bf05af3f0e8769f417bbd995470dc1b8083a0df9 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Wed Oct 28 00:53:20 2015 -0700 i965: Fix missing BRW_NEW_*_PROG_DATA flagging caused by cache reuse. It's nominated for 11.0, too, so Emil will most likely pick it up for the next release.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.