Bug 103101 - [SKL][bisected] DiRT Rally GPU hang
Summary: [SKL][bisected] DiRT Rally GPU hang
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Kenneth Graunke
QA Contact: Intel 3D Bugs Mailing List
Depends on:
Blocks: mesa-17.3
  Show dependency treegraph
Reported: 2017-10-04 17:08 UTC by Jordan Justen
Modified: 2017-11-30 01:33 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:

error state (9.52 KB, application/x-bzip)
2017-10-04 17:10 UTC, Jordan Justen

Description Jordan Justen 2017-10-04 17:08:03 UTC
I see a GPU hang on SKL GT4 with DiRT Rally. The hang only
happens during gameplay, not in the menus. I always see it
hang fairly quickly (within about a minute of play).

I bisected it to:

commit 7c5988e615e580f771f6f478631d63aaada872a6
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Tue Sep 5 15:14:18 2017 -0700

    i965: Disentangle batch and state buffer flushing.
Comment 1 Jordan Justen 2017-10-04 17:10:20 UTC
Created attachment 134664 [details]
error state
Comment 2 Eero Tamminen 2017-10-10 11:40:13 UTC
Does Topi's GEN9+ GPU hang fix affect this in any way?

(It's still not in, and I was wondering whether Kenneth's commit just exposes better the issue that Topi's patch fixes.)
Comment 3 Jordan Justen 2017-10-10 17:00:40 UTC
No, "intel/compiler/gen9: Pixel shader header only workaround"
did not help with the GPU hang.
Comment 4 Jordan Justen 2017-11-06 02:08:23 UTC
Some debug info:

Increasing STATE_SZ from 16k to 32k allows the "-benchmark"
to run for about 30~60s before the GPU hang. At 16k, it hangs
within ~10s.

Increasing STATE_SZ from 16k to 64k allowed it to run for
about 30~60s before hitting an assert:
DirtRally: intel_batchbuffer.c:1063: brw_state_batch: Assertion `offset + size < batch->state_bo->size' failed.

I also tried increasing BATCH_SZ to 32k, but it did not appear
to make a difference.
Comment 5 Kenneth Graunke 2017-11-29 00:22:03 UTC
Patches on the list:

Comment 6 Kenneth Graunke 2017-11-29 05:59:17 UTC
Those apparently don't help at all.  The game's scripts muck with LD_LIBRARY_PATH and I think I was running the wrong build at some point.
Comment 7 Jordan Justen 2017-11-29 06:25:33 UTC
(In reply to Kenneth Graunke from comment #6)
> The game's scripts muck with
> LD_LIBRARY_PATH and I think I was running the wrong build at some point.

My script for DiRT only sets LIBGL_DRIVERS_PATH which luckily
they don't seem to mess with.
Comment 8 Kenneth Graunke 2017-11-29 09:11:00 UTC
Okay, those patches fix bugs.  Just not the one super obvious bug. :)

This one applies on master before the others and by itself fixes the issue:

Comment 9 Kenneth Graunke 2017-11-30 01:33:41 UTC
Fixed by:

commit cfc5af588cf8e0cfb41ea907a7da3cca676be1c2
Author:     Kenneth Graunke <kenneth@whitecape.org>
AuthorDate: Wed Nov 29 00:27:18 2017 -0800
Commit:     Kenneth Graunke <kenneth@whitecape.org>
CommitDate: Wed Nov 29 11:48:29 2017 -0800

    i965: Program the dynamic state heap size to MAX_STATE_SIZE.
    STATE_BASE_ADDRESS specifies a maximum size of the dynamic state
    section, beyond which data supposedly reads back as 0.  On Gen8+,
    we were programming it to the size of the buffer.  This worked fine
    until we started growing the state buffer in commit 2dfc119f22f25708.
    When the state buffer grows, the value in STATE_BASE_ADDRESS becomes
    too small, and our state beyond STATE_SZ bytes would read back as 0.
    To avoid having to update the value, we program it to MAX_STATE_SIZE.
    We used to program the upper bound to the maximum on older hardware
    anyway, so programming it too large isn't a big deal.
    Bogus SURFACE_STATE can easily lead to GPU hangs and misrendering.
    DiRT Rally was hitting the statebuffer growth path, and suffered from
    bad texture corruption and GPU hangs (usually around the same time).
    This patch fixes both issues.
    Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need space and can't flush."
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103101
    Tested-by: Jordan Justen <jordan.l.justen@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

These patches are also relevant stable candidates for fixing related bugs:

commit 2af70854609509adf5dc92af2fcf1c30938e2a5d
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Tue Nov 28 08:30:50 2017 -0800

    i965: Use old_bo->align when growing batch/state buffer instead of 4096.
    The intention here is make the new BO use the same alignment as the old
    BO.  This isn't strictly necessary, but we would have to update the
    'alignment' field in the validation list when swapping it out, and we
    don't bother today.
    The batch and state buffers use an alignment of 4096, so this should be
    equivalent - it's just clearer than cut and pasting a magic constant.
    Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need space and can't flush."
    Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>

commit 52d32917e1f3f70abcbcff5508f7423e94626b41
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Tue Nov 28 08:59:07 2017 -0800

    i965: Preserve EXEC_OBJECT_CAPTURE when growing the BO.
    The original state buffer was marked with EXEC_OBJECT_CAPTURE.  When
    growing it, we want to preserve that flag so we continue to capture it
    in GPU hang reports.
    Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need s
pace and can't flush."
    Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>

commit ca4361658635f2b401e9793c0b982721998ecb70
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Tue Nov 28 08:20:39 2017 -0800

    i965: Don't grow batch/state buffer on every emit after an overflow.
    Once we reach the intended size of the buffer (BATCH_SZ or STATE_SZ), we
    try and flush.  If we're not allowed to flush, we resort to growing the
    buffer so that there's space for the data we need to emit.
    We accidentally got the threshold wrong.  The first non-wrappable call
    beyond (e.g.) STATE_SZ would grow the buffer to floor(1.5 * STATE_SZ),
    The next call would see we were beyond STATE_SZ and think we needed to
    grow a second time - when the buffer was already large enough.
    We still want to flush when we hit STATE_SZ, but for growing, we should
    use the actual size of the buffer as the threshold.  This way, we only
    grow when actually necessary.
    v2: Simplify the control flow (suggested by Jordan)
    Fixes: 2dfc119f22f257082ab0 "i965: Grow the batch/state buffers if we need s
pace and can't flush."
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>

One more related fix remains unreviewed:

    i965: Avoid problems from referencing orphaned BOs after growing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.