Bug 111424 - Random recoverable GPU hangs in trivial GpuTest Triangle test
Summary: Random recoverable GPU hangs in trivial GpuTest Triangle test
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-08-19 10:11 UTC by Eero Tamminen
Modified: 2019-09-16 15:33 UTC (History)
1 user (show)

See Also:
i915 platform: BXT, KBL, SKL
i915 features: GPU hang


Attachments
SKL GT4e hang, with drm-tip git from few days ago and fullscreen Triangle benchmark (4.74 KB, text/plain)
2019-08-19 10:11 UTC, Eero Tamminen
no flags Details
SKL GT2 hang, with drm-tip v5.2 and windowed/composited Triangle benchmark (16.04 KB, text/plain)
2019-08-19 10:13 UTC, Eero Tamminen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2019-08-19 10:11:44 UTC
Created attachment 145093 [details]
SKL GT4e hang, with drm-tip git from few days ago and fullscreen Triangle benchmark

Setup:
* HW: SKL GT2, SKL GT4e
* Display: FullHD
* kernel: drm-tip git v5.2 and git from few days ago
* OS: Ubuntu 18.04
* Desktop: lightdm/Unity/compiz
* Test: GpuTest v0.7 (latest)

Use-case:
* Trivial windowed/composited use-case (SKL GT2 hang, with drm-tip v5.2):
  GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
* Trivial fullscreen use-case (SKL GT4e hang, with drm-tip git from few days ago):
  GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000

Expected output:
* No hangs

Actual output:
* (Recoverable) GPU hang.

Notes:
* Hang is rare
* Test-case is trivial 3D one, and therefore very high FPS.  It does a (fast) clear, draws a 1/2 window sized triangle (benchmark's framework draws also a small bar for progress and blits penguin image to indicate Linux, but those are so small they don't impact perf)
* Due to simplicity of the benchmark and its resulting high FPS, this looks like race-condition, in kernel I think
Comment 1 Eero Tamminen 2019-08-19 10:13:33 UTC
Created attachment 145094 [details]
SKL GT2 hang, with drm-tip v5.2 and windowed/composited Triangle benchmark
Comment 2 Chris Wilson 2019-08-19 10:33:22 UTC
The current sampling is ACTHD (batch/ring head address, updates with batch buffer) + RING_START (to spot context switches) + RING_HEAD (includes wrap counter so effectively ~24b). Only if they all match plus some fuzzing of INSTDONE do we worry about the GPU being hung.

We could now throw in an engine->serial [submission serial] and only declare a hang if we stop submitting (which would be CS freeze on execlists, and ring full on ringbuffer). Or we can go with the heartbeat scheme.

If the drm-tip result is typical, the engine reset should be harmless.
Comment 3 Eero Tamminen 2019-08-23 07:23:08 UTC
> If the drm-tip result is typical, the engine reset should be harmless.

Last night SKL GT4e didn't recover from the Triangle hang, but got stuck.
Comment 4 Eero Tamminen 2019-08-28 08:25:52 UTC
Yesterday this happened with KBL GT3e when using older drm-tip v5.2 kernel -> added KBL to platforms.

Btw. maybe these are related to SynMark TerrainFlyInst hangs I've commented in bug 111453, as those happen also with drm-tip v5.2?
Comment 5 Eero Tamminen 2019-09-06 07:45:08 UTC
This is rare, but now it happened also on BXT, i.e. it happens on all GEN9 devices I have (don't have any newer).
Comment 6 Eero Tamminen 2019-09-06 08:03:00 UTC
(In reply to Eero Tamminen from comment #5)
> This is rare, but now it happened also on BXT, i.e. it happens on all GEN9
> devices I have (don't have any newer).

This time it didn't recover properly either.  Many other tests failed after it and I didn't get i915 error data because testing timeout out due to large slowdown.
Comment 7 Eero Tamminen 2019-09-16 15:33:03 UTC
(In reply to Eero Tamminen from comment #3)
> > If the drm-tip result is typical, the engine reset should be harmless.
> 
> Last night SKL GT4e didn't recover from the Triangle hang, but got stuck.

E.g. last night KBL GT3e GPU hangs caused 20s of extra run-time for the 40s GpuTest Triangle runs, although following tests worked fine.  In summary, it doesn't seem harmless.

[ 4866.563824] Iteration 3/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4902.983138] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun
[ 4910.877780] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[ 4910.877782] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 4910.877783] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 4910.877784] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 4910.877785] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 4910.877786] GPU crash dump saved to /sys/class/drm/card0/error
[ 4910.878798] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 4910.879554] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
...
[ 4924.831512] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 4924.833273] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4924.834022] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4926.579189] Iteration 1/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4926.878687] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 4926.879436] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
...
[ 4974.877657] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
[ 4974.879438] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4974.880194] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4974.880895] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 4986.598036] Iteration 2/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.