Bug 112169 - [regression] (recoverable) GPU resets & fails in slower 3D benchmarks
Summary: [regression] (recoverable) GPU resets & fails in slower 3D benchmarks
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-29 13:00 UTC by Eero Tamminen
Modified: 2019-11-11 10:27 UTC (History)
1 user (show)

See Also:
i915 platform: BXT, KBL, SKL, TGL
i915 features: GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2019-10-29 13:00:06 UTC
Setup:
* HW: e.g. J4205 (BXT) and 7567U-i7 (KBL GT3e)
* OS: Ubuntu 18.04
* SW: latest Git versions of drm-tip kernel, X server and Mesa

Between following drm-tip commits:
* 2019-10-23 16:27:36 863a8a1bef: drm-tip: 2019y-10m-23d-16h-23m-44s UTC integration manifest
* 2019-10-27 20:46:21 54520983c6: drm-tip: 2019y-10m-27d-20h-45m-34s UTC integration manifest

Some of the slower 3D benchmarks started to GPU hang:
------------------------------------------
[ 4937.365687] Iteration 1/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4937.904763] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.016748] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.128707] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.213360] Iteration 2/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4938.736916] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.848767] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.960794] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4939.013764] Iteration 3/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4939.536891] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4939.648780] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4940.560756] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4940.623625] Iteration 1/3: GpuTest /test=pixmark_volplosion /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4941.112755] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4941.224772] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4941.504756] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4941.569851] Iteration 2/3: GpuTest /test=pixmark_volplosion /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4942.032771] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4942.600708] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4942.864828] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4942.920061] Iteration 3/3: GpuTest /test=pixmark_volplosion /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4943.408815] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4943.520753] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4943.632747] i915 0000:00:02.0: Resetting rcs0 for preemption time out
------------------------------------------

These are mainly Piano & Volplosion in public GpuTest v0.7 test-suite, and most of the tests in the internal GPU MemBW test-suite.

Notes:
* There's no i915 error state
* Screen updates in the failing benchmarks happen on average at single digit (2-8) FPS on BXT
* These aren't slowest benchmarks, e.g. offscreen GfxBench tests run at <1 FPS, and don't have GPU resets
Comment 1 Chris Wilson 2019-10-29 13:04:04 UTC
QoS; you can poke around with the timeout setting in config (via sysfs in later patches) to try and determine the right ballpark. Current was chosen by looking at piglit and declaring the test that tried to keep preemption off for 1s a dos.
Comment 2 Eero Tamminen 2019-10-29 13:23:16 UTC
(In reply to Chris Wilson from comment #1)
> QoS; you can poke around with the timeout setting in config (via sysfs in
> later patches)

When do you think those config patches will land, and from which sysfs file one can configure it?


> to try and determine the right ballpark. Current was chosen
> by looking at piglit and declaring the test that tried to keep preemption
> off for 1s a dos.

While these tests don't have slowest frame updates, they probably have slowest individual draw calls.

GpuTest is public, so it can be easily tested.  Piano & Volplosion are slowest tests in it, and slow linearly down with larger resolutions.

Marking this for all platforms I have, as it happens on all of them.
Comment 3 Chris Wilson 2019-10-29 15:26:00 UTC
So far with a bump to 250ms, gputest fails to cause iris to loop unpreemptibly. The only other question is whether or not iris itself can avoid such long denial of service.
Comment 4 Eero Tamminen 2019-10-29 16:05:31 UTC
(In reply to Chris Wilson from comment #3)
> So far with a bump to 250ms, gputest fails to cause iris to loop
> unpreemptibly. The only other question is whether or not iris itself can
> avoid such long denial of service.

Why that would be a problem for Iris, when it AFAIK has never been a problem for i965?
Comment 5 Chris Wilson 2019-10-29 23:08:31 UTC
(In reply to Eero Tamminen from comment #4)
> (In reply to Chris Wilson from comment #3)
> > So far with a bump to 250ms, gputest fails to cause iris to loop
> > unpreemptibly. The only other question is whether or not iris itself can
> > avoid such long denial of service.
> 
> Why that would be a problem for Iris, when it AFAIK has never been a problem
> for i965?

It's a problem in both (unbounded execution latency affecting higher priority clients); I was just using iris as an example as being more convenient to play with.

./scripts/config --set-val CONFIG_DRM_I915_PREEMPT_TIMEOUT 250
Comment 6 Eero Tamminen 2019-11-08 11:01:12 UTC
On TGL, this happens in many additional tests, including 2 different Automotive customer benchmarks.

In the SynMark benchmark, it happens in the following tests (when using last evening drm-tip):
[  778.513869] Iteration 1/3: synmark2 OglCSCloth
[  778.849082] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  778.898748] Iteration 2/3: synmark2 OglCSCloth
[  779.097064] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  779.134308] Iteration 3/3: synmark2 OglCSCloth
[  779.329093] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  779.374631] Iteration 1/3: synmark2 OglCSDof
[  780.000965] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  780.764332] Iteration 2/3: synmark2 OglCSDof
[  781.001096] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  781.699362] Iteration 3/3: synmark2 OglCSDof
[  781.928953] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...
[ 1445.167415] Iteration 1/3: synmark2 OglPSPom
[ 1445.377107] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1470.346698] Iteration 2/3: synmark2 OglPSPom
[ 1470.553096] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1495.481206] Iteration 3/3: synmark2 OglPSPom
[ 1495.681099] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...
[ 1876.058405] Iteration 3/3: synmark2 OglTerrainFlyTess
[ 1898.265122] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1901.588742] Iteration 1/3: synmark2 OglTerrainPanTess
[ 1927.116433] Iteration 2/3: synmark2 OglTerrainPanTess
[ 1952.646441] Iteration 3/3: synmark2 OglTerrainPanTess
[ 1969.121125] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1969.233106] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1969.345105] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...
[ 2074.882611] Iteration 2/3: synmark2 OglVSDiffuse8
[ 2093.057110] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 2093.169108] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 2093.281111] i915 0000:00:02.0: Resetting rcs0 for preemption time out

(On ICL, Terrain tests run at >100FPS, CSCloth & VSDiffuse8 at >200FPS, so it seems unlikely that draw calls in these test would be particularly slow on TGL either, unless there's some other TGL specific bug in upstream Mesa / drm-tip kernel.  SynMark above is Wayland version because X server crashes on TGL when something tries to use it.)
Comment 7 Chris Wilson 2019-11-08 11:10:24 UTC
GPU reclocking is dysfunctional on Tigerlake (at least as far as observation goes and the randomly reported frequencies), so not too surprising as other observations back up that Tigerlake is clocking lower than intended.
Comment 8 Eero Tamminen 2019-11-08 12:31:03 UTC
(In reply to Chris Wilson from comment #7)
> GPU reclocking is dysfunctional on Tigerlake (at least as far as observation
> goes and the randomly reported frequencies), so not too surprising as other
> observations back up that Tigerlake is clocking lower than intended.

Looking at:
- /sys/class/drm/card0/gt_act_freq_mhz
- i915:intel_gpu_freq_change tracepoint values

For first minutes of running 3D benchmarks without a pause, tracepoint reports i915 to have requested 300MHz speed from GPU, but sampling gt_act_freq_mhz tells GPU to run at 400MHz.

After that tracepoint tells 900Mhz to be requested, but gt_act_freq_mhz tells GPU to run at 700MHz.

That indeed seems bogus power management behavior, but some of the tests where the TGL pre-emption reset issues are seen, are still running at > 100FPS, so they're definitely not too slow from the user point of view.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.