Bug 111936 - [ICL] (Recoverable) GPU hangs in TerrainFlyTess with Iris
Summary: [ICL] (Recoverable) GPU hangs in TerrainFlyTess with Iris
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-09 10:59 UTC by Eero Tamminen
Modified: 2019-10-11 10:39 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features: GPU hang


Attachments
TerrainFlyTess ICL error state (2019-10-07 drm-tip) (27.71 KB, text/plain)
2019-10-09 10:59 UTC, Eero Tamminen
no flags Details
TerrainFlyTess ICL error state (2019-10-09 drm-tip) (15.71 KB, text/plain)
2019-10-09 14:29 UTC, Eero Tamminen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2019-10-09 10:59:00 UTC
Created attachment 145683 [details]
TerrainFlyTess ICL error state (2019-10-07 drm-tip)

Setup:
* HW: ICL-U D1
* OS: Ubuntu 18.04 with Unity desktop (compiz)
* SW: git versions of drm-tip 5.4-rc2 kernel, X server & Mesa
* Desktop uses i965, benchmarks use Iris

Use-case:
* Run SynMark TerrainFlyTess with Iris:
  MESA_LOADER_DRIVER_OVERRIDE=iris ./synmark2 OglTerrainFlyTess

Expected outcome:
* Like on GEN9, no GPU hangs

Actual outcome:
* Recoverable GPU hangs, see attachment
* Reproducibility: always

Notes:
* This test-case tests CPU<->GPU synchronization, by generating the terrain data on-fly in 4 CPU threads with AVX, for GPU tessellation & rendering
* No idea whether these hangs are an regression. It seems to have happened already >2 weeks ago on drm-tip v5.3 when I first did ICL testing


Kenneth from Mesa team already looked at the error state and commented that:

"This error state makes no sense, ACTHD points at the very start of the batch and IPEHR is 0x18800101 which never appears in the error dump at all. Sounds like a kernel bug to me."
Comment 1 Chris Wilson 2019-10-09 11:07:03 UTC
The capture is as it is fetching the first bytes of the batch. Either it took a page fault, or we have a novel means of dying. Note that the GPU did not send the completion event for the context switch in the previous 6s, so I'm erring on the side of novel death throes.
Comment 2 Chris Wilson 2019-10-09 11:45:54 UTC
One thing that would be useful as it is reproducible would be to enable CONFIG_DRM_I915_DEBUG_GEM and attach the drm.debug=0x2 dmesg.
Comment 3 Eero Tamminen 2019-10-09 12:00:46 UTC
Machine was on loan from Jani and I need to give it back now, so unfortunately I cannot provide that.  Test-case & fault should be very easy to reproduce though (pretty much same as the HDRBloom case).

Chris, mail me directly if you don't have SynMark, or would like to get pre-built latest git 3D user-space stack.
Comment 4 Chris Wilson 2019-10-09 12:12:09 UTC
It's Icelake that continues to be a myth. I shall pester Francesco if we can at least get icl-gem.
Comment 5 Eero Tamminen 2019-10-09 14:29:51 UTC
Created attachment 145684 [details]
TerrainFlyTess ICL error state (2019-10-09 drm-tip)

Jani extended the ICL loan.


Unfortunately neither drm.debug nor CONFIG_LOCKDEP=y shows anything:
-----------------------------------------------------------------------
[  151.523178] [drm:intel_combo_phy_init [i915]] Combo PHY A already enabled, won't reprogram it.
[  151.523211] [drm:intel_combo_phy_init [i915]] Combo PHY B already enabled, won't reprogram it.
[  153.874368] Iteration 1/3: synmark2 OglTerrainFlyTess
[  180.033099] Iteration 2/3: synmark2 OglTerrainFlyTess
[  187.993335] i915 0000:00:02.0: GPU HANG: ecode 11:1:0x00000000, hang on rcs0
[  187.993338] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  187.993339] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  187.993340] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  187.993341] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[  187.993342] GPU crash dump saved to /sys/class/drm/card0/error
[  187.993406] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  208.151967] Iteration 3/3: synmark2 OglTerrainFlyTess
-----------------------------------------------------------------------

No idea whether the worse reproducibility (1/3 instead of 1/1), is due to using latest git kernel, or debug options.  Attached is new error state.
Comment 6 Eero Tamminen 2019-10-11 10:39:02 UTC
(In reply to Eero Tamminen from comment #5)
> No idea whether the worse reproducibility (1/3 instead of 1/1), is due to
> using latest git kernel, or debug options.

With latest drm-tip & Mesa versions, this doesn't anymore hang on every run, maybe just every 1/5th run.

Since CSDof and HDRBloom GPU hangs with Iris continue happening on every run of teh test, and HDRBloom can be easily reproduced also on other platforms, not just ICL, I would concentrate on that (bug 111385).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.