Bug 112169 - [regression] (recoverable) GPU resets & fails in slower 3D benchmarks
Summary: [regression] (recoverable) GPU resets & fails in slower 3D benchmarks
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-29 13:00 UTC by Eero Tamminen
Modified: 2019-11-29 19:45 UTC (History)
4 users (show)

See Also:
i915 platform: BXT, KBL, SKL, TGL
i915 features: GPU hang


Attachments

Description Eero Tamminen 2019-10-29 13:00:06 UTC
Setup:
* HW: e.g. J4205 (BXT) and 7567U-i7 (KBL GT3e)
* OS: Ubuntu 18.04
* SW: latest Git versions of drm-tip kernel, X server and Mesa

Between following drm-tip commits:
* 2019-10-23 16:27:36 863a8a1bef: drm-tip: 2019y-10m-23d-16h-23m-44s UTC integration manifest
* 2019-10-27 20:46:21 54520983c6: drm-tip: 2019y-10m-27d-20h-45m-34s UTC integration manifest

Some of the slower 3D benchmarks started to GPU hang:
------------------------------------------
[ 4937.365687] Iteration 1/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4937.904763] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.016748] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.128707] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.213360] Iteration 2/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4938.736916] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.848767] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4938.960794] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4939.013764] Iteration 3/3: GpuTest /test=pixmark_piano /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4939.536891] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4939.648780] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4940.560756] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4940.623625] Iteration 1/3: GpuTest /test=pixmark_volplosion /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4941.112755] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4941.224772] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4941.504756] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4941.569851] Iteration 2/3: GpuTest /test=pixmark_volplosion /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4942.032771] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4942.600708] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4942.864828] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4942.920061] Iteration 3/3: GpuTest /test=pixmark_volplosion /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000
[ 4943.408815] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4943.520753] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 4943.632747] i915 0000:00:02.0: Resetting rcs0 for preemption time out
------------------------------------------

These are mainly Piano & Volplosion in public GpuTest v0.7 test-suite, and most of the tests in the internal GPU MemBW test-suite.

Notes:
* There's no i915 error state
* Screen updates in the failing benchmarks happen on average at single digit (2-8) FPS on BXT
* These aren't slowest benchmarks, e.g. offscreen GfxBench tests run at <1 FPS, and don't have GPU resets
Comment 1 Chris Wilson 2019-10-29 13:04:04 UTC
QoS; you can poke around with the timeout setting in config (via sysfs in later patches) to try and determine the right ballpark. Current was chosen by looking at piglit and declaring the test that tried to keep preemption off for 1s a dos.
Comment 2 Eero Tamminen 2019-10-29 13:23:16 UTC
(In reply to Chris Wilson from comment #1)
> QoS; you can poke around with the timeout setting in config (via sysfs in
> later patches)

When do you think those config patches will land, and from which sysfs file one can configure it?


> to try and determine the right ballpark. Current was chosen
> by looking at piglit and declaring the test that tried to keep preemption
> off for 1s a dos.

While these tests don't have slowest frame updates, they probably have slowest individual draw calls.

GpuTest is public, so it can be easily tested.  Piano & Volplosion are slowest tests in it, and slow linearly down with larger resolutions.

Marking this for all platforms I have, as it happens on all of them.
Comment 3 Chris Wilson 2019-10-29 15:26:00 UTC
So far with a bump to 250ms, gputest fails to cause iris to loop unpreemptibly. The only other question is whether or not iris itself can avoid such long denial of service.
Comment 4 Eero Tamminen 2019-10-29 16:05:31 UTC
(In reply to Chris Wilson from comment #3)
> So far with a bump to 250ms, gputest fails to cause iris to loop
> unpreemptibly. The only other question is whether or not iris itself can
> avoid such long denial of service.

Why that would be a problem for Iris, when it AFAIK has never been a problem for i965?
Comment 5 Chris Wilson 2019-10-29 23:08:31 UTC
(In reply to Eero Tamminen from comment #4)
> (In reply to Chris Wilson from comment #3)
> > So far with a bump to 250ms, gputest fails to cause iris to loop
> > unpreemptibly. The only other question is whether or not iris itself can
> > avoid such long denial of service.
> 
> Why that would be a problem for Iris, when it AFAIK has never been a problem
> for i965?

It's a problem in both (unbounded execution latency affecting higher priority clients); I was just using iris as an example as being more convenient to play with.

./scripts/config --set-val CONFIG_DRM_I915_PREEMPT_TIMEOUT 250
Comment 6 Eero Tamminen 2019-11-08 11:01:12 UTC
On TGL, this happens in many additional tests, including 2 different Automotive customer benchmarks.

In the SynMark benchmark, it happens in the following tests (when using last evening drm-tip):
[  778.513869] Iteration 1/3: synmark2 OglCSCloth
[  778.849082] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  778.898748] Iteration 2/3: synmark2 OglCSCloth
[  779.097064] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  779.134308] Iteration 3/3: synmark2 OglCSCloth
[  779.329093] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  779.374631] Iteration 1/3: synmark2 OglCSDof
[  780.000965] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  780.764332] Iteration 2/3: synmark2 OglCSDof
[  781.001096] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[  781.699362] Iteration 3/3: synmark2 OglCSDof
[  781.928953] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...
[ 1445.167415] Iteration 1/3: synmark2 OglPSPom
[ 1445.377107] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1470.346698] Iteration 2/3: synmark2 OglPSPom
[ 1470.553096] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1495.481206] Iteration 3/3: synmark2 OglPSPom
[ 1495.681099] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...
[ 1876.058405] Iteration 3/3: synmark2 OglTerrainFlyTess
[ 1898.265122] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1901.588742] Iteration 1/3: synmark2 OglTerrainPanTess
[ 1927.116433] Iteration 2/3: synmark2 OglTerrainPanTess
[ 1952.646441] Iteration 3/3: synmark2 OglTerrainPanTess
[ 1969.121125] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1969.233106] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 1969.345105] i915 0000:00:02.0: Resetting rcs0 for preemption time out
...
[ 2074.882611] Iteration 2/3: synmark2 OglVSDiffuse8
[ 2093.057110] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 2093.169108] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 2093.281111] i915 0000:00:02.0: Resetting rcs0 for preemption time out

(On ICL, Terrain tests run at >100FPS, CSCloth & VSDiffuse8 at >200FPS, so it seems unlikely that draw calls in these test would be particularly slow on TGL either, unless there's some other TGL specific bug in upstream Mesa / drm-tip kernel.  SynMark above is Wayland version because X server crashes on TGL when something tries to use it.)
Comment 7 Chris Wilson 2019-11-08 11:10:24 UTC
GPU reclocking is dysfunctional on Tigerlake (at least as far as observation goes and the randomly reported frequencies), so not too surprising as other observations back up that Tigerlake is clocking lower than intended.
Comment 8 Eero Tamminen 2019-11-08 12:31:03 UTC
(In reply to Chris Wilson from comment #7)
> GPU reclocking is dysfunctional on Tigerlake (at least as far as observation
> goes and the randomly reported frequencies), so not too surprising as other
> observations back up that Tigerlake is clocking lower than intended.

Looking at:
- /sys/class/drm/card0/gt_act_freq_mhz
- i915:intel_gpu_freq_change tracepoint values

For first minutes of running 3D benchmarks without a pause, tracepoint reports i915 to have requested 300MHz speed from GPU, but sampling gt_act_freq_mhz tells GPU to run at 400MHz.

After that tracepoint tells 900Mhz to be requested, but gt_act_freq_mhz tells GPU to run at 700MHz.

That indeed seems bogus power management behavior, but some of the tests where the TGL pre-emption reset issues are seen, are still running at > 100FPS, so they're definitely not too slow from the user point of view.
Comment 9 Chris Wilson 2019-11-20 17:23:45 UTC
Did you ever manage to try with the 250ms preemption reset timeout? That cleared up the benchmarks you highlighted for me, but my sample size is limited.
Comment 10 Eero Tamminen 2019-11-21 15:15:55 UTC
(In reply to Chris Wilson from comment #5)
> ./scripts/config --set-val CONFIG_DRM_I915_PREEMPT_TIMEOUT 250

(In reply to Chris Wilson from comment #9)
> Did you ever manage to try with the 250ms preemption reset timeout? That
> cleared up the benchmarks you highlighted for me, but my sample size is
> limited.

~/intel-gpu-tools$ git pull
Already up to date.
~/intel-gpu-tools$ ls  scripts/
convert_itp.py  intel-gfx-trybot  list-workarounds  Makefile.am  media-bench.pl  quick-testlist.py  run-tests.sh  throttle.py  trace.pl  who.sh

-> there's no script called "config" that I could use for trying that.
Comment 11 Chris Wilson 2019-11-21 15:32:41 UTC
My apologies, I was referring to the kernel build config helper as currently the control knob (CONFIG_DRM_I915_PREEMPT_TIMEOUT) is only configurable at build time.
Comment 12 Eero Tamminen 2019-11-22 12:14:27 UTC
(In reply to Chris Wilson from comment #11)
> My apologies, I was referring to the kernel build config helper as currently
> the control knob (CONFIG_DRM_I915_PREEMPT_TIMEOUT) is only configurable at
> build time.

Just adding this to config:
CONFIG_DRM_I915_PREEMPT_TIMEOUT=250

wasn't enough.  After olddefconfig, values had reverted to:
CONFIG_DRM_I915_USERFAULT_AUTOSUSPEND=250
CONFIG_DRM_I915_HEARTBEAT_INTERVAL=2500
CONFIG_DRM_I915_PREEMPT_TIMEOUT=100
CONFIG_DRM_I915_STOP_TIMEOUT=100

What else is needed to get new preempt timeout through olddefconfig?

(olddefconfig is what our build machinery uses for kernel build, I can't change that.)
Comment 13 Chris Wilson 2019-11-22 13:19:08 UTC
I would have though olddefconfig would have worked. Nevermind, just patch the default then:

diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile
index 1799537a3228..082c409134af 100644
--- a/drivers/gpu/drm/i915/Kconfig.profile
+++ b/drivers/gpu/drm/i915/Kconfig.profile
@@ -25,7 +25,7 @@ config DRM_I915_HEARTBEAT_INTERVAL
 
 config DRM_I915_PREEMPT_TIMEOUT
        int "Preempt timeout (ms, jiffy granularity)"
-       default 100 # milliseconds
+       default 250 # milliseconds
        help
          How long to wait (in milliseconds) for a preemption event to occur
          when submitting a new context via execlists. If the current context
Comment 14 Eero Tamminen 2019-11-25 11:34:18 UTC
From quick test on SKL GT2, 250ms fixes hangs on MemBW blit & blend tests, but not in the texture BW test.  It fixes hang in GpuTest Volplosion, but not in the slower Piano test.

I'm currently trying few longer timeouts on J4205 (i.e. slower device than GT2), and a larger set of tests.
Comment 15 Eero Tamminen 2019-11-25 16:03:34 UTC
On SKL GT2, 450ms is enough for everything, but on J4205, GpuTest Piano still GPU resets with 450ms.

(450ms should have been enough for Piano based on average frame timings, but maybe some draws / frames are considerably longer than the average.)

With 550ms everything goes through without resets also on J4205.  Note that I'm running GpuTest Piano test in 1366x768 resolution.  With FullHD resolution (i.e. 2x more pixels), I assume it would still be reseted.

PS. I have hard time testing timeouts on J4205, because it system hangs so easily with drm-tip (in GpuTest Piano, in SynMark Trrain tests...).
Comment 16 Chris Wilson 2019-11-26 09:45:05 UTC
Applied

commit 5766a5ffc6a69595903865518c43636bde0e4ac4 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 25 16:27:37 2019 +0000

    drm/i915: Default to a more lenient forced preemption timeout
    
    Based on a sampling of a number of benchmarks across platforms, by
    default opt for a much more lenient timeout so that we should not
    adversely affect existing "good" clients.
    
    640ms ought to be enough for anyone.

to keep the wolves at bay. Thanks for the feedback. I'll leave this as NEEDINFO for a bit in case we need to refine that value again.
Comment 17 Eero Tamminen 2019-11-26 11:19:57 UTC
(In reply to Chris Wilson from comment #16)
> I'll leave this as NEEDINFO for a bit in case we need to refine that value again.

(In reply to Eero Tamminen from comment #15)
> On SKL GT2, 450ms is enough for everything, but on J4205, GpuTest Piano
> still GPU resets with 450ms.
...
> With 550ms everything goes through without resets also on J4205.  Note that
> I'm running GpuTest Piano test in 1366x768 resolution.

Tested also 500ms on J4205.  There are no GPU reset in GpuTest tests with 500ms either, when running test in that resolution (or in other tests I run regularly).

As this 3D benchmark is EU & memory bandwidth bound, BXT with minimal 12EUs instead 18EUs in J4205, would need larger timeout, but the new 640 value is probably enough for that too.

However, FullHD resolution would need 2x larger timeout, as it has 2x more pixels and GpuTest Piano is pixel shader bound.

While I'm not running GpuTest (Piano) in that large resolution, Phoronix does:
https://www.phoronix.com/scan.php?page=article&item=intel-iris-uhd620&num=4

(I'm not sure how much 3D benchmarks Phoronix runs on low end iGPU machines though.)
Comment 18 Martin Peres 2019-11-29 19:45:02 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/562.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.