Summary: | Differences in prog_data ignored when caching fragment programs (causes hangs) | ||
---|---|---|---|
Product: | Mesa | Reporter: | Pierre Bourdon <delroth> |
Component: | Drivers/DRI/i965 | Assignee: | Kenneth Graunke <kenneth> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs, kanepyork |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Xorg configuration used for headless tests
/sys/class/drm/card0/error Hackish testcase that switches between two FS programs that differ only in prog_data |
Description
Pierre Bourdon
2015-10-22 23:12:45 UTC
Of course I haven't tested these repro instructions as well as I thought I did... Replace: $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini $ wget https://github.com/dolphin-emu/fifoci/blob/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini With: $ wget https://raw.githubusercontent.com/dolphin-emu/fifoci/master/runner/linux/Config-ogl/Dolphin.ini -O ~/.dolphin-emu/Config/Dolphin.ini $ wget https://raw.githubusercontent.com/dolphin-emu/fifoci/master/runner/linux/Config-ogl/gfx_opengl.ini -O ~/.dolphin-emu/Config/gfx_opengl.ini Created attachment 119121 [details]
/sys/class/drm/card0/error
I've tried this hang again on mesa-master debug build, 70b06fb: Same freeze as on 11.0, no assert. But good luck, this hang is reproduce-able with apitrace: http://markus.members.selfnet.de/dolphin/i965-broadwell-freeze.trace Everything else but mesa is the same as Pierre's. FWIW, enabling the "Always re-emit all state" block in brw_state_upload.c:708 seems to fix this problem. always_flush_batch=true and always_flush_cache=true (way more flushing) have no impact. This makes me suspect missing dirty flags... I decided to try and do a binary search across the atoms, and narrowed the range - always re-emitting atoms 44-57 (inclusive) makes the hang go away. Notably, not forcing 57 (gen8_pma_fix) returns it to hanging. Disabling the Depth PMA fix also makes the hang go away. Apparently re-emitting exactly 3DSTATE_WM and the PMA fix works. Also, not setting the early depth/stencil mode to EDSC_PREPS works around the problem. (We set this based on ARB_conservative_depth settings...) Unfortunately, I've really no idea why... Another dolphin game is also known to freeze the GPU, maybe it helps to debug this issue: http://markus.members.selfnet.de/dolphin/i965-broadwell-freeze-2.trace Disabling ARB_shader_image_load_store in Dolphin seems to fix the crash, which concurs with your diagnostics. We use ARB_shader_image_load_store to force early depth test in some shaders. Right, I misspoke - ARB_conservative_depth has a similarly named feature, but it's definitely the ARB_image_load_store early depth test layout qualifier that's tripping us up here. I still have no idea why. Another data point, which may or many not be useful (it might be obvious already from what you've debugged): disabling HiZ through ~/.drirc makes the crash disappear. So I've been looking at this for the past N hours in the hope of learning more about how Mesa and i965 work (and I've already learned a lot \o/). I noticed something that I find interesting though. It might be completely normal and uninteresting, but if it is I'm curious about the explanation. I randomly figured out that adding BRW_NEW_FRAGMENT_PROGRAM to the dirty bits that trigger upload_wm_state and gen8_emit_pma_stall_workaround works around the crash bug. But both of these should already trigger on BRW_NEW_FS_PROG_DATA being dirty. In what condition should BRW_NEW_FRAGMENT_PROGRAM be dirty but not BRW_NEW_FS_PROG_DATA? It sounds like getting a new fragment program should in practice always come with a new prog_data as well? Examples captured in the wild: + if (prev_prog_data && brw->wm.prog_data && prev_prog_data->early_fragment_tests != brw->wm.prog_data->early_fragment_tests) { + printf("Old early fragments: %d new: %d\n", prev_prog_data->early_fragment_tests, brw->wm.prog_data->early_fragment_tests); + printf("Dirty: %d %d | %016lx\n", !!(ctx->NewDriverState & BRW_NEW_FS_PROG_DATA), !!(ctx->NewDriverState & BRW_NEW_FRAGMENT_PROGRAM), ctx->NewDriverState); + } Old early fragments: 0 new: 1 Dirty: 0 1 | 0000000200600a08 Old early fragments: 1 new: 0 Dirty: 0 1 | 0000000200600a00 Old early fragments: 0 new: 1 Dirty: 1 1 | 0000000200600a09 Old early fragments: 1 new: 0 Dirty: 1 1 | 0000000220600a09 Old early fragments: 0 new: 1 Dirty: 1 1 | 0000000220600a09 Old early fragments: 1 new: 0 Dirty: 0 1 | 0000000200600a00 And just confirmed with a small test program (really hacky, code attached): if you switch between two almost identical shaders with early_fragment_tests being the only difference, NEW_FS_PROG_DATA is never set as dirty! Old early fragments: 1 new: 0 Dirty: 0 1 | 0000000000600a00 Old early fragments: 0 new: 1 Dirty: 0 1 | 0000000000600a00 Now I'm not completely sure how that should be fixed. But I'm sure you will have an idea. Created attachment 119240 [details]
Hackish testcase that switches between two FS programs that differ only in prog_data
This is indeed confirmed as the source of the problem. It also happens to be a regression introduced by 1bba29ed403e735ba0bf04ed8aa2e571884fcaaf. The commit just before r1bba29ed works, then it starts hanging. It impacts at least ARB_shader_image_load_store (early_fragment_tests) and ARB_conservative_depth (other depth-related layout() declarations). Thanks a ton for tracking this down! The demo application made it really easy to see what's going on. The fix turned out to be pretty straightforward: http://lists.freedesktop.org/archives/mesa-dev/2015-October/098490.html Thanks a ton for tracking this down! The demo application made it really easy to see what's going on. The fix turned out to be pretty straightforward: http://lists.freedesktop.org/archives/mesa-dev/2015-October/098490.html Fixed on master with: commit bf05af3f0e8769f417bbd995470dc1b8083a0df9 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Wed Oct 28 00:53:20 2015 -0700 i965: Fix missing BRW_NEW_*_PROG_DATA flagging caused by cache reuse. It's nominated for 11.0, too, so Emil will most likely pick it up for the next release. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.