111936 – [GEN9+] Non-recoverable GPU hangs in SynMark2 OglTerrainFly* with Iris

Bug 111936 - [GEN9+] Non-recoverable GPU hangs in SynMark2 OglTerrainFly* with Iris

Summary: [GEN9+] Non-recoverable GPU hangs in SynMark2 OglTerrainFly* with Iris

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged
Keywords:

Depends on:
Blocks:

Reported:	2019-10-09 10:59 UTC by Eero Tamminen
Modified:	2019-11-29 19:38 UTC (History)
CC List:	2 users (show)

See Also:	111385 111453
i915 platform:	ICL, KBL, SKL
i915 features:	GPU hang

Attachments
TerrainFlyTess ICL error state (2019-10-07 drm-tip) (27.71 KB, text/plain) 2019-10-09 10:59 UTC, Eero Tamminen	no flags	Details
TerrainFlyTess ICL error state (2019-10-09 drm-tip) (15.71 KB, text/plain) 2019-10-09 14:29 UTC, Eero Tamminen	no flags	Details
TerrainFlyTess SKL GT4e error state (2019-10-20 drm-tip) (39.12 KB, text/plain) 2019-10-21 09:42 UTC, Eero Tamminen	no flags	Details
TerrainFlyInst SKL GT4e error state (drm-tip 2019-11-05) (5.18 KB, text/plain) 2019-11-06 09:42 UTC, Eero Tamminen	no flags	Details
View All

Description Eero Tamminen 2019-10-09 10:59:00 UTC

Created attachment 145683 [details]
TerrainFlyTess ICL error state (2019-10-07 drm-tip)

Setup:
* HW: ICL-U D1
* OS: Ubuntu 18.04 with Unity desktop (compiz)
* SW: git versions of drm-tip 5.4-rc2 kernel, X server & Mesa
* Desktop uses i965, benchmarks use Iris

Use-case:
* Run SynMark TerrainFlyTess with Iris:
  MESA_LOADER_DRIVER_OVERRIDE=iris ./synmark2 OglTerrainFlyTess

Expected outcome:
* Like on GEN9, no GPU hangs

Actual outcome:
* Recoverable GPU hangs, see attachment
* Reproducibility: always

Notes:
* This test-case tests CPU<->GPU synchronization, by generating the terrain data on-fly in 4 CPU threads with AVX, for GPU tessellation & rendering
* No idea whether these hangs are an regression. It seems to have happened already >2 weeks ago on drm-tip v5.3 when I first did ICL testing


Kenneth from Mesa team already looked at the error state and commented that:

"This error state makes no sense, ACTHD points at the very start of the batch and IPEHR is 0x18800101 which never appears in the error dump at all. Sounds like a kernel bug to me."

Comment 1 Chris Wilson 2019-10-09 11:07:03 UTC

The capture is as it is fetching the first bytes of the batch. Either it took a page fault, or we have a novel means of dying. Note that the GPU did not send the completion event for the context switch in the previous 6s, so I'm erring on the side of novel death throes.

Comment 2 Chris Wilson 2019-10-09 11:45:54 UTC

One thing that would be useful as it is reproducible would be to enable CONFIG_DRM_I915_DEBUG_GEM and attach the drm.debug=0x2 dmesg.

Comment 3 Eero Tamminen 2019-10-09 12:00:46 UTC

Machine was on loan from Jani and I need to give it back now, so unfortunately I cannot provide that.  Test-case & fault should be very easy to reproduce though (pretty much same as the HDRBloom case).

Chris, mail me directly if you don't have SynMark, or would like to get pre-built latest git 3D user-space stack.

Comment 4 Chris Wilson 2019-10-09 12:12:09 UTC

It's Icelake that continues to be a myth. I shall pester Francesco if we can at least get icl-gem.

Comment 5 Eero Tamminen 2019-10-09 14:29:51 UTC

Created attachment 145684 [details]
TerrainFlyTess ICL error state (2019-10-09 drm-tip)

Jani extended the ICL loan.


Unfortunately neither drm.debug nor CONFIG_LOCKDEP=y shows anything:
-----------------------------------------------------------------------
[  151.523178] [drm:intel_combo_phy_init [i915]] Combo PHY A already enabled, won't reprogram it.
[  151.523211] [drm:intel_combo_phy_init [i915]] Combo PHY B already enabled, won't reprogram it.
[  153.874368] Iteration 1/3: synmark2 OglTerrainFlyTess
[  180.033099] Iteration 2/3: synmark2 OglTerrainFlyTess
[  187.993335] i915 0000:00:02.0: GPU HANG: ecode 11:1:0x00000000, hang on rcs0
[  187.993338] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  187.993339] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  187.993340] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  187.993341] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[  187.993342] GPU crash dump saved to /sys/class/drm/card0/error
[  187.993406] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  208.151967] Iteration 3/3: synmark2 OglTerrainFlyTess
-----------------------------------------------------------------------

No idea whether the worse reproducibility (1/3 instead of 1/1), is due to using latest git kernel, or debug options.  Attached is new error state.

Comment 6 Eero Tamminen 2019-10-11 10:39:02 UTC

(In reply to Eero Tamminen from comment #5)
> No idea whether the worse reproducibility (1/3 instead of 1/1), is due to
> using latest git kernel, or debug options.

With latest drm-tip & Mesa versions, this doesn't anymore hang on every run, maybe just every 1/5th run.

Since CSDof and HDRBloom GPU hangs with Iris continue happening on every run of teh test, and HDRBloom can be easily reproduced also on other platforms, not just ICL, I would concentrate on that (bug 111385).

Comment 7 Eero Tamminen 2019-10-21 09:41:45 UTC

With the latest drm-tip git kernel (from last evening), this happened also on the SKL & KBL, so I assume its GEN9+ Core issues.

Comment 8 Eero Tamminen 2019-10-21 09:42:21 UTC

Created attachment 145785 [details]
TerrainFlyTess SKL GT4e error state (2019-10-20 drm-tip)

Comment 9 Eero Tamminen 2019-10-22 08:15:35 UTC

(Recoverable) GPU hangs have started to happen now also in the TerrainFlyInst tests (on SKL) and in TerrainPanTess (on SKL GT4e).

Comment 10 Eero Tamminen 2019-10-23 08:17:44 UTC

Last night OglTerrainFlyInst test GPU hang didn't anymore recover, but system hanged SKL GT4e.

Comment 11 Eero Tamminen 2019-11-05 08:43:55 UTC

After this test fails, screen shows last working frame from the test, but it's possible to still run 3D & Media test-cases through ssh, they just fail.

*However*, at some point after that, the machine will freeze; network goes down and machine doesn't anymore react to other input than SysRq keys.

So far, this bad recovery failures have happened only on SkullCanyon, not on KBL (I don't test ICL anymore).

Comment 12 Eero Tamminen 2019-11-06 09:42:11 UTC

Created attachment 145899 [details]
TerrainFlyInst SKL GT4e error state (drm-tip 2019-11-05)

What happens with Iris is a bit odd:
* First SynMark2 Multithread fails, but there's no GPU hang
* A bit later TerrainFlyInst doesn't fail, but triggers the attached GPU hang [1]
* After few successful runs for other tests, screen updates stop, and all further GPU tests fail, but there's no indication of any problem in dmesg

[1] dmesg:
[ 4859.448337] Iteration 2/3: synmark2 OglTerrainFlyInst
[ 4876.890967] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4876.891740] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4883.930958] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4883.931728] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4886.938950] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4886.939713] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4889.883259] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, stopped heartbeat on rcs0
[ 4889.883262] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 4889.883264] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 4889.883265] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 4889.883266] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 4889.883267] GPU crash dump saved to /sys/class/drm/card0/error
[ 4889.984912] i915 0000:00:02.0: Resetting rcs0 for stopped heartbeat on rcs0
[ 4889.985681] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4889.986705] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4889.987465] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4889.987735] i915 0000:00:02.0: Resetting chip for stopped heartbeat on rcs0
[ 4890.088868] [drm] GuC communication stopped
[ 4890.089600] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4890.090325] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 4890.091936] [drm] GuC communication enabled
[ 4890.091978] i915 0000:00:02.0: GuC firmware i915/skl_guc_33.0.0.bin version 33.0 submission:disabled
[ 4890.091980] i915 0000:00:02.0: HuC firmware i915/skl_huc_2.0.0.bin version 2.0 authenticated:yes
[ 4890.323457] Iteration 3/3: synmark2 OglTerrainFlyInst
[ 4891.401947] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4891.514932] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4916.787572] Iteration 1/3: synmark2 OglTerrainPanInst
[ 4943.096357] Iteration 2/3: synmark2 OglTerrainPanInst
...

Comment 13 Chris Wilson 2019-11-09 12:38:23 UTC

Device: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake 8x8 GT2)  (0x8a52)
OpenGL renderer string: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake 8x8 GT2) 
$ ./synmark2 OglTerrainFlyTess
is proving unproblematic.

I don't have a skl gt4e; but I do have a kbl gt4e. Close enough?

Comment 14 Eero Tamminen 2019-11-11 09:27:24 UTC

(In reply to Chris Wilson from comment #13)
> Device: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake 8x8 GT2)  (0x8a52)
> OpenGL renderer string: Mesa DRI Intel(R) Iris(R) Plus Graphics (Ice Lake
> 8x8 GT2) 
> $ ./synmark2 OglTerrainFlyTess
> is proving unproblematic.

It's not 100% reproducible, you need to run it several times, try e.g. 10-20 times.

(Test is run 3x every night for latest 3D stack.  I don't see it every night, but it does happen several times a week.)


> I don't have a skl gt4e; but I do have a kbl gt4e. Close enough?

Yes, the issue does happen more often with higher GT versions (faster -> triggers some race-condition more easily?).

Comment 15 Eero Tamminen 2019-11-11 09:29:35 UTC

And on GEN9, you still need MESA_LOADER_DRIVER_OVERRIDE=iris to use Iris instead of i965.

Comment 16 Eero Tamminen 2019-11-20 11:31:28 UTC

In the last few days, I've seen this only on SkullCanyon (SKL GT4e), not on KBL GT3e.

Note: SkullCanyon is running nowadays Weston, KBL GT3e is still running X11/Unity.

(I don't have anymore ICL, it was loan from Jani.  Let's see when I can loan it again.)

Comment 17 Eero Tamminen 2019-11-20 12:12:49 UTC

(In reply to Eero Tamminen from comment #16)
> In the last few days, I've seen this only on SkullCanyon (SKL GT4e), not on
> KBL GT3e.

I mean, the GPU hangs still happen on all machines, but the recovery fails only on SkullCanyon, not KBL GT3e (or SKL GT2).

Comment 18 Martin Peres 2019-11-29 19:38:29 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/487.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.