Bug 111424

Summary: (Only partially recoverable) GPU hangs in (trivial) GpuTest Triangle test
Product: DRI Reporter: Eero Tamminen <eero.t.tamminen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, kenny, leho
Version: DRI gitKeywords: regression
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=111385
https://bugs.freedesktop.org/show_bug.cgi?id=112169
Whiteboard: Triaged, ReadyForDev
i915 platform: BXT, KBL, SKL i915 features: GPU hang
Attachments:
Description Flags
SKL GT4e hang, with drm-tip git from few days ago and fullscreen Triangle benchmark
none
SKL GT2 hang, with drm-tip v5.2 and windowed/composited Triangle benchmark none

Description Eero Tamminen 2019-08-19 10:11:44 UTC
Created attachment 145093 [details]
SKL GT4e hang, with drm-tip git from few days ago and fullscreen Triangle benchmark

Setup:
* HW: SKL GT2, SKL GT4e
* Display: FullHD
* kernel: drm-tip git v5.2 and git from few days ago
* OS: Ubuntu 18.04
* Desktop: lightdm/Unity/compiz
* Test: GpuTest v0.7 (latest)

Use-case:
* Trivial windowed/composited use-case (SKL GT2 hang, with drm-tip v5.2):
  GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
* Trivial fullscreen use-case (SKL GT4e hang, with drm-tip git from few days ago):
  GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000

Expected output:
* No hangs

Actual output:
* (Recoverable) GPU hang.

Notes:
* Hang is rare
* Test-case is trivial 3D one, and therefore very high FPS.  It does a (fast) clear, draws a 1/2 window sized triangle (benchmark's framework draws also a small bar for progress and blits penguin image to indicate Linux, but those are so small they don't impact perf)
* Due to simplicity of the benchmark and its resulting high FPS, this looks like race-condition, in kernel I think
Comment 1 Eero Tamminen 2019-08-19 10:13:33 UTC
Created attachment 145094 [details]
SKL GT2 hang, with drm-tip v5.2 and windowed/composited Triangle benchmark
Comment 2 Chris Wilson 2019-08-19 10:33:22 UTC
The current sampling is ACTHD (batch/ring head address, updates with batch buffer) + RING_START (to spot context switches) + RING_HEAD (includes wrap counter so effectively ~24b). Only if they all match plus some fuzzing of INSTDONE do we worry about the GPU being hung.

We could now throw in an engine->serial [submission serial] and only declare a hang if we stop submitting (which would be CS freeze on execlists, and ring full on ringbuffer). Or we can go with the heartbeat scheme.

If the drm-tip result is typical, the engine reset should be harmless.
Comment 3 Eero Tamminen 2019-08-23 07:23:08 UTC
> If the drm-tip result is typical, the engine reset should be harmless.

Last night SKL GT4e didn't recover from the Triangle hang, but got stuck.
Comment 4 Eero Tamminen 2019-08-28 08:25:52 UTC
Yesterday this happened with KBL GT3e when using older drm-tip v5.2 kernel -> added KBL to platforms.

Btw. maybe these are related to SynMark TerrainFlyInst hangs I've commented in bug 111453, as those happen also with drm-tip v5.2?
Comment 5 Eero Tamminen 2019-09-06 07:45:08 UTC
This is rare, but now it happened also on BXT, i.e. it happens on all GEN9 devices I have (don't have any newer).
Comment 6 Eero Tamminen 2019-09-06 08:03:00 UTC
(In reply to Eero Tamminen from comment #5)
> This is rare, but now it happened also on BXT, i.e. it happens on all GEN9
> devices I have (don't have any newer).

This time it didn't recover properly either.  Many other tests failed after it and I didn't get i915 error data because testing timeout out due to large slowdown.
Comment 7 Eero Tamminen 2019-09-16 15:33:03 UTC
(In reply to Eero Tamminen from comment #3)
> > If the drm-tip result is typical, the engine reset should be harmless.
> 
> Last night SKL GT4e didn't recover from the Triangle hang, but got stuck.

E.g. last night KBL GT3e GPU hangs caused 20s of extra run-time for the 40s GpuTest Triangle runs, although following tests worked fine.  In summary, it doesn't seem harmless.

[ 4866.563824] Iteration 3/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4902.983138] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
[ 4910.877780] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[ 4910.877782] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 4910.877783] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 4910.877784] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 4910.877785] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 4910.877786] GPU crash dump saved to /sys/class/drm/card0/error
[ 4910.878798] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 4910.879554] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
...
[ 4924.831512] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 4924.833273] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4924.834022] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4926.579189] Iteration 1/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4926.878687] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 4926.879436] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
...
[ 4974.877657] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
[ 4974.879438] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4974.880194] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4974.880895] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 4986.598036] Iteration 2/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
Comment 8 Chris Wilson 2019-09-20 19:22:38 UTC
*** Bug 111760 has been marked as a duplicate of this bug. ***
Comment 9 Eero Tamminen 2019-09-26 09:07:01 UTC
Got this again with SKL GT2.  Raising to major as kernel driver doesn't recover working state. Nowadays typically all 3D tests after Triangle test fail (and take much longer than they normally would).
Comment 10 Lakshmi 2019-10-03 07:49:43 UTC
Considering the reproduction rate, setting the priority and severity of this issue.
Comment 11 Eero Tamminen 2019-10-03 13:04:03 UTC
(In reply to Lakshmi from comment #10)
> Considering the reproduction rate, setting the priority and severity of this
> issue.

I tried to trigger this on SKL GT2 just by running the Triangle test-case after boot and failed.  Apparently it requires running some other 3D benchmarks first (we run Unigine Heaven & Valley, all GfxBench and GpuTest tests).
Comment 12 Martin Peres 2019-11-29 19:24:05 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/376.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.