Created attachment 145093 [details] SKL GT4e hang, with drm-tip git from few days ago and fullscreen Triangle benchmark Setup: * HW: SKL GT2, SKL GT4e * Display: FullHD * kernel: drm-tip git v5.2 and git from few days ago * OS: Ubuntu 18.04 * Desktop: lightdm/Unity/compiz * Test: GpuTest v0.7 (latest) Use-case: * Trivial windowed/composited use-case (SKL GT2 hang, with drm-tip v5.2): GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000 * Trivial fullscreen use-case (SKL GT4e hang, with drm-tip git from few days ago): GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000 Expected output: * No hangs Actual output: * (Recoverable) GPU hang. Notes: * Hang is rare * Test-case is trivial 3D one, and therefore very high FPS. It does a (fast) clear, draws a 1/2 window sized triangle (benchmark's framework draws also a small bar for progress and blits penguin image to indicate Linux, but those are so small they don't impact perf) * Due to simplicity of the benchmark and its resulting high FPS, this looks like race-condition, in kernel I think
Created attachment 145094 [details] SKL GT2 hang, with drm-tip v5.2 and windowed/composited Triangle benchmark
The current sampling is ACTHD (batch/ring head address, updates with batch buffer) + RING_START (to spot context switches) + RING_HEAD (includes wrap counter so effectively ~24b). Only if they all match plus some fuzzing of INSTDONE do we worry about the GPU being hung. We could now throw in an engine->serial [submission serial] and only declare a hang if we stop submitting (which would be CS freeze on execlists, and ring full on ringbuffer). Or we can go with the heartbeat scheme. If the drm-tip result is typical, the engine reset should be harmless.
> If the drm-tip result is typical, the engine reset should be harmless. Last night SKL GT4e didn't recover from the Triangle hang, but got stuck.
Yesterday this happened with KBL GT3e when using older drm-tip v5.2 kernel -> added KBL to platforms. Btw. maybe these are related to SynMark TerrainFlyInst hangs I've commented in bug 111453, as those happen also with drm-tip v5.2?
This is rare, but now it happened also on BXT, i.e. it happens on all GEN9 devices I have (don't have any newer).
(In reply to Eero Tamminen from comment #5) > This is rare, but now it happened also on BXT, i.e. it happens on all GEN9 > devices I have (don't have any newer). This time it didn't recover properly either. Many other tests failed after it and I didn't get i915 error data because testing timeout out due to large slowdown.
(In reply to Eero Tamminen from comment #3) > > If the drm-tip result is typical, the engine reset should be harmless. > > Last night SKL GT4e didn't recover from the Triangle hang, but got stuck. E.g. last night KBL GT3e GPU hangs caused 20s of extra run-time for the 40s GpuTest Triangle runs, although following tests worked fine. In summary, it doesn't seem harmless. [ 4866.563824] Iteration 3/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000 [ 4902.983138] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun [ 4910.877780] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0 [ 4910.877782] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 4910.877783] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 4910.877784] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 4910.877785] The GPU crash dump is required to analyze GPU hangs, so please always attach it. [ 4910.877786] GPU crash dump saved to /sys/class/drm/card0/error [ 4910.878798] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [ 4910.879554] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} ... [ 4924.831512] i915 0000:00:02.0: Resetting chip for hang on rcs0 [ 4924.833273] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ 4924.834022] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ 4926.579189] Iteration 1/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000 [ 4926.878687] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [ 4926.879436] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} ... [ 4974.877657] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. [ 4974.879438] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ 4974.880194] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001} [ 4974.880895] i915 0000:00:02.0: Resetting chip for hang on rcs0 [ 4986.598036] Iteration 2/3: /opt/benchmarks/GpuTest07/GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
*** Bug 111760 has been marked as a duplicate of this bug. ***
Got this again with SKL GT2. Raising to major as kernel driver doesn't recover working state. Nowadays typically all 3D tests after Triangle test fail (and take much longer than they normally would).
Considering the reproduction rate, setting the priority and severity of this issue.
(In reply to Lakshmi from comment #10) > Considering the reproduction rate, setting the priority and severity of this > issue. I tried to trigger this on SKL GT2 just by running the Triangle test-case after boot and failed. Apparently it requires running some other 3D benchmarks first (we run Unigine Heaven & Valley, all GfxBench and GpuTest tests).
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/376.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.