Bug 110131

Summary: [GEN9] random/rare GPU hangs in tessellation tests
Product: Mesa Reporter: Eero Tamminen <eero.t.tamminen>
Component: Drivers/DRI/i965Assignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: VERIFIED WORKSFORME QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: SKL GT4e SynMark OglTerrainFlyTess hang error state

Description Eero Tamminen 2019-03-15 12:02:28 UTC
Created attachment 143679 [details]
SKL GT4e SynMark OglTerrainFlyTess hang error state

Setup:
* Ubuntu 18.04
* v5.0+ drm-tip kernel & git version of Xserver
* Mesa git version

In last few days I've seen couple of random GPU hangs in tessellation related tests:

- on SKL GT4e, recoverable one once in SynMark2 v7 OglTerrainFlyTess, and once in GfxBench v5 GL Aztec Ruins normal (does also lot of other things besides tessellation)
- one system hang in GfxBench tessellation test on KBL GT2 day before

It's possible that first item is related to starting to use Weston/XWayland instead of normal X:
----------------------------------------------------
[ 8231.866172] i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in  [0], hang on rcs0
[ 8231.866174] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 8231.866174] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 8231.866175] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 8231.866175] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 8231.866175] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 8231.867183] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 8239.858844] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 8243.313841] Asynchronous wait on fence i915:weston[643]/1:5eb9e timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
[ 8247.858844] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----------------------------------------------------

See attachement for error state.

(Another possibility could be that those started (to be more visible?) after the "intel/nir: Vectorize all IO" fix to bug 107510, as that improved tessellation tests.)

Note: I don't actively read dmesg output, so I may have missed most of the recoverable GPU hangs unless they've been serious enough to hang the system, fail the test, or at least slow it down enough to significantly impact performance.  I'll add some better tracking for that.

No idea whether these are related to compute hangs bug 108820, or Heaven hang bug 103556.
Comment 1 Eero Tamminen 2019-03-18 11:18:09 UTC
(In reply to Eero Tamminen from comment #0)
> No idea whether these are related to compute hangs bug 108820, or Heaven
> hang bug 103556.

There were Heaven hangs during weekend, but no tessellation test hangs.  If I don't happen to notice more of these by end of the month, I'll close this as WORKSFORME.
Comment 2 Lionel Landwerlin 2019-03-26 12:25:56 UTC
(In reply to Eero Tamminen from comment #0)
> Created attachment 143679 [details]
> SKL GT4e SynMark OglTerrainFlyTess hang error state
> 
> Setup:
> * Ubuntu 18.04
> * v5.0+ drm-tip kernel & git version of Xserver
> * Mesa git version
> 
> In last few days I've seen couple of random GPU hangs in tessellation
> related tests:
> 
> - on SKL GT4e, recoverable one once in SynMark2 v7 OglTerrainFlyTess, and
> once in GfxBench v5 GL Aztec Ruins normal (does also lot of other things
> besides tessellation)
> - one system hang in GfxBench tessellation test on KBL GT2 day before
> 
> It's possible that first item is related to starting to use Weston/XWayland
> instead of normal X:
> ----------------------------------------------------
> [ 8231.866172] i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in  [0],
> hang on rcs0
> [ 8231.866174] [drm] GPU hangs can indicate a bug anywhere in the entire gfx
> stack, including userspace.
> [ 8231.866174] [drm] Please file a _new_ bug report on bugs.freedesktop.org
> against DRI -> DRM/Intel
> [ 8231.866175] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [ 8231.866175] [drm] The gpu crash dump is required to analyze gpu hangs, so
> please always attach it.
> [ 8231.866175] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [ 8231.867183] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
> [ 8239.858844] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
> [ 8243.313841] Asynchronous wait on fence i915:weston[643]/1:5eb9e timed out
> (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
> [ 8247.858844] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
> ----------------------------------------------------
> 
> See attachement for error state.
> 
> (Another possibility could be that those started (to be more visible?) after
> the "intel/nir: Vectorize all IO" fix to bug 107510, as that improved
> tessellation tests.)
> 
> Note: I don't actively read dmesg output, so I may have missed most of the
> recoverable GPU hangs unless they've been serious enough to hang the system,
> fail the test, or at least slow it down enough to significantly impact
> performance.  I'll add some better tracking for that.
> 
> No idea whether these are related to compute hangs bug 108820, or Heaven
> hang bug 103556.

This error state has consistent HS/TE/DS stage programming so it would seem to be a different issue from the unigine bug.
Comment 3 Eero Tamminen 2019-03-26 13:53:23 UTC
(In reply to Lionel Landwerlin from comment #2)
> This error state has consistent HS/TE/DS stage programming so it would seem
> to be a different issue from the unigine bug.

Thanks!

I have now tracking for GPU resets, and I haven't seen any tessellation test hangs since I filed this bug (only bug 108820 & bug 103556 hangs).  If they don't appear by next week, I'll close this as WORKSFORME.
Comment 4 Eero Tamminen 2019-04-09 11:30:54 UTC
(In reply to Eero Tamminen from comment #3)
> I have now tracking for GPU resets, and I haven't seen any tessellation test
> hangs since I filed this bug (only bug 108820 & bug 103556 hangs).  If they
> don't appear by next week, I'll close this as WORKSFORME.

-> WORKSFORME.

I'm seeing (recoverable) GEN9+ GPU hangs only in Manhattan 3.1, CarChase and AztecRuins (and all of those could be compute issues, bug 108820), not in the tests listed in this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.