I see a GPU hang on SKL GT4 with DiRT Rally. The hang only happens during gameplay, not in the menus. I always see it hang fairly quickly (within about a minute of play). I bisected it to: commit 7c5988e615e580f771f6f478631d63aaada872a6 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Tue Sep 5 15:14:18 2017 -0700 i965: Disentangle batch and state buffer flushing.
Created attachment 134664 [details] error state
Does Topi's GEN9+ GPU hang fix affect this in any way? (It's still not in, and I was wondering whether Kenneth's commit just exposes better the issue that Topi's patch fixes.)
No, "intel/compiler/gen9: Pixel shader header only workaround" did not help with the GPU hang.
Some debug info: Increasing STATE_SZ from 16k to 32k allows the "-benchmark" to run for about 30~60s before the GPU hang. At 16k, it hangs within ~10s. Increasing STATE_SZ from 16k to 64k allowed it to run for about 30~60s before hitting an assert: DirtRally: intel_batchbuffer.c:1063: brw_state_batch: Assertion `offset + size < batch->state_bo->size' failed. I also tried increasing BATCH_SZ to 32k, but it did not appear to make a difference.
Patches on the list: https://patchwork.freedesktop.org/series/34575/
Those apparently don't help at all. The game's scripts muck with LD_LIBRARY_PATH and I think I was running the wrong build at some point.
(In reply to Kenneth Graunke from comment #6) > The game's scripts muck with > LD_LIBRARY_PATH and I think I was running the wrong build at some point. My script for DiRT only sets LIBGL_DRIVERS_PATH which luckily they don't seem to mess with.
Okay, those patches fix bugs. Just not the one super obvious bug. :) This one applies on master before the others and by itself fixes the issue: https://patchwork.freedesktop.org/patch/190968/
Fixed by: commit cfc5af588cf8e0cfb41ea907a7da3cca676be1c2 Author: Kenneth Graunke <kenneth@whitecape.org> AuthorDate: Wed Nov 29 00:27:18 2017 -0800 Commit: Kenneth Graunke <kenneth@whitecape.org> CommitDate: Wed Nov 29 11:48:29 2017 -0800 i965: Program the dynamic state heap size to MAX_STATE_SIZE. STATE_BASE_ADDRESS specifies a maximum size of the dynamic state section, beyond which data supposedly reads back as 0. On Gen8+, we were programming it to the size of the buffer. This worked fine until we started growing the state buffer in commit 2dfc119f22f25708. When the state buffer grows, the value in STATE_BASE_ADDRESS becomes too small, and our state beyond STATE_SZ bytes would read back as 0. To avoid having to update the value, we program it to MAX_STATE_SIZE. We used to program the upper bound to the maximum on older hardware anyway, so programming it too large isn't a big deal. Bogus SURFACE_STATE can easily lead to GPU hangs and misrendering. DiRT Rally was hitting the statebuffer growth path, and suffered from bad texture corruption and GPU hangs (usually around the same time). This patch fixes both issues. Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need space and can't flush." Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103101 Tested-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> These patches are also relevant stable candidates for fixing related bugs: commit 2af70854609509adf5dc92af2fcf1c30938e2a5d Author: Kenneth Graunke <kenneth@whitecape.org> Date: Tue Nov 28 08:30:50 2017 -0800 i965: Use old_bo->align when growing batch/state buffer instead of 4096. The intention here is make the new BO use the same alignment as the old BO. This isn't strictly necessary, but we would have to update the 'alignment' field in the validation list when swapping it out, and we don't bother today. The batch and state buffers use an alignment of 4096, so this should be equivalent - it's just clearer than cut and pasting a magic constant. Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need space and can't flush." Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> commit 52d32917e1f3f70abcbcff5508f7423e94626b41 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Tue Nov 28 08:59:07 2017 -0800 i965: Preserve EXEC_OBJECT_CAPTURE when growing the BO. The original state buffer was marked with EXEC_OBJECT_CAPTURE. When growing it, we want to preserve that flag so we continue to capture it in GPU hang reports. Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need s pace and can't flush." Reviewed-by: Ian Romanick <ian.d.romanick@intel.com> commit ca4361658635f2b401e9793c0b982721998ecb70 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Tue Nov 28 08:20:39 2017 -0800 i965: Don't grow batch/state buffer on every emit after an overflow. Once we reach the intended size of the buffer (BATCH_SZ or STATE_SZ), we try and flush. If we're not allowed to flush, we resort to growing the buffer so that there's space for the data we need to emit. We accidentally got the threshold wrong. The first non-wrappable call beyond (e.g.) STATE_SZ would grow the buffer to floor(1.5 * STATE_SZ), The next call would see we were beyond STATE_SZ and think we needed to grow a second time - when the buffer was already large enough. We still want to flush when we hit STATE_SZ, but for growing, we should use the actual size of the buffer as the threshold. This way, we only grow when actually necessary. v2: Simplify the control flow (suggested by Jordan) Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need s pace and can't flush." Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> One more related fix remains unreviewed: i965: Avoid problems from referencing orphaned BOs after growing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.