102336 – [BYT] Offscreen tests performance drops to <1/3 due to power management

Bug 102336 - [BYT] Offscreen tests performance drops to <1/3 due to power management

Summary: [BYT] Offscreen tests performance drops to <1/3 due to power management

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-08-21 14:18 UTC by Eero Tamminen
Modified:	2018-04-04 05:46 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	BYT
i915 features:	power/runtime PM

Attachments

Description Eero Tamminen 2017-08-21 14:18:08 UTC

drm-tip RC kernel integration drops offscreen 3D tests performance to 1/3rd:
kernel git://anongit.freedesktop.org/drm-tip at 10de1e17faaab452782e5a1baffd1b30a639a261 2017-07-18_10-09-14 drm-tip: 2017y-07m-18d-10h-08m-42s UTC integration manifest

Because kernel doesn't anymore raise GPU speed from minimum (to maximum) with those tests, as it should.

These tests include offscreen versions of following:
* GfxBench 4.0 ALU2, Tessellation, T-Rex, CarChase
  (neither Manhattan, nor CPU bound driver tests were affected)
* GLBenchmark 2.7 Egypt, T-Rex and Fill tests (Fill least)

This issue is BYT specific.

Comment 1 Chris Wilson 2017-08-21 18:19:57 UTC

Maxing out the GPU (at least GAM) results in 50% c0 activity. I'm guessing out cdlck is incorrect, or slightly different to the c0 cycles:

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 32c62442c9d8..4b36ed2290b9 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1076,7 +1076,7 @@ static u32 vlv_wa_c0_ei(struct drm_i915_private *dev_priv, u32 pm_iir)
 
                time = ktime_us_delta(now.ktime, prev->ktime);
 
-               time *= dev_priv->czclk_freq;
+               time *= dev_priv->czclk_freq / 2;
 
                /* Workload can be split between render + media,
                 * e.g. SwapBuffers being blitted in X after being rendered in

Comment 2 Eero Tamminen 2017-08-23 09:38:26 UTC

In last nightly [1] things got worse:
* Unigine (onscreen) Heaven & Valley demos now run mostly with minimum GPU frequency too
  - I think that as result the CPU side is also run slower as demo startup and scene changes take longer, but it's hard to say which one is cause and which an effect
* There are now GPU hangs (in 2 different test-cases)

Last nightly was run with modesetting driver instead of Intel DDX, but I don't think that's related as there's no similar Unigine demo issue in Mesa testing.


[1] I.e. from changes between:
kernel git://anongit.freedesktop.org/drm-tip at dbfb2f62576e1c3550d10398b097589959356db3 2017-08-21_08-14-04 drm-tip: 2017y-08m-21d-08h-13m-34s UTC integration manifest
kernel git://anongit.freedesktop.org/drm-tip at 017fec5c2e57672a8c2a350376070e6c6a5ae950 2017-08-22_16-23-32 drm-tip: 2017y-08m-22d-16h-23m-11s UTC integration manifest

Comment 3 Chris Wilson 2017-08-25 13:31:29 UTC

Mika spotted that tkr_raw was 2x faster than tkr_mono. The culprit is:

commit fc6eead7c1e2e5376c25d2795d4539fdacbc0648
Author: John Stultz <john.stultz@linaro.org>
Date:   Mon May 22 17:20:20 2017 -0700

    time: Clean up CLOCK_MONOTONIC_RAW time handling
    
    Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
    remove the duplicitive tk->raw_time.tv_nsec, which can be
    stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
    for monotonic time).
    
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Miroslav Lichvar <mlichvar@redhat.com>
    Cc: Richard Cochran <richardcochran@gmail.com>
    Cc: Prarit Bhargava <prarit@redhat.com>
    Cc: Stephen Boyd <stephen.boyd@linaro.org>
    Cc: Kevin Brodsky <kevin.brodsky@arm.com>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: Daniel Mentz <danielmentz@google.com>
    Tested-by: Daniel Mentz <danielmentz@google.com>
    Signed-off-by: John Stultz <john.stultz@linaro.org>

Comment 4 Chris Wilson 2017-08-25 14:04:20 UTC

Applied to topic/core-for-CI:

commit 5567b808e5681f742856245bc1e34d40475cb89d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 25 14:46:41 2017 +0100

    Revert "time: Clean up CLOCK_MONOTONIC_RAW time handling"
    
    This reverts commit fc6eead7c1e2e5376c25d2795d4539fdacbc0648.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102336

Comment 5 john.stultz 2017-08-25 23:05:10 UTC

Potential fix posted for testing here:
https://lkml.org/lkml/2017/8/25/792

Comment 6 Chris Wilson 2017-08-26 10:18:52 UTC

Swapped out the revert for commit 177776dba04e4e02d46ec46d7927580eaeb106b6
Author: John Stultz <john.stultz@linaro.org>
Date:   Fri Aug 25 15:57:04 2017 -0700

    time: Fix ktime_get_raw() issues caused by incorrect base accumulation
    
    In commit fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time
    handling"), I mistakenly added the following:
    
     /* Update the monotonic raw base */
     seconds = tk->raw_sec;
     nsec = (u32)(tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift);
     tk->tkr_raw.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
    
    Which adds the raw_sec value and the shifted down raw xtime_nsec
    to the base value.
    
    This is problematic as when calling ktime_get_raw(), we add the
    tk->tkr_raw.xtime_nsec and current offset, shift it down and add
    it to the raw base.
    
    This results in the shifted down tk->tkr_raw.xtime_nsec being
    added twice.
    
    My mistake, was that I was matching the monotonic base logic
    above:
    
     seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
     nsec = (u32) tk->wall_to_monotonic.tv_nsec;
     tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
    
    Which adds the wall_to_monotonic.tv_nsec value, but not the
    tk->tkr_mono.xtime_nsec value to the base.
    
    The result of this is that ktime_get_raw() users (which are all
    internal users) see the raw time move faster then it should
    (the rate at which can vary with the current size of
    tkr_raw.xtime_nsec), which has resulted in at least problems
    with graphics rendering performance.
    
    To fix this, we simplify the tkr_raw.base accumulation to only
    accumulate the raw_sec portion, and do not include the
    tkr_raw.xtime_nsec portion, which will be added at read time.

in topic/core-for-CI

Comment 7 Chris Wilson 2017-08-26 20:08:27 UTC

Now fixed in x86/tip.

If you want to report some details of the hang, otherwise there's not much I can do.

The reason why it suddenly became worse is all due to switching to modesetting, which only used rcs and so we end up interesting behaviour with X getting stuck behind extra frames, and not reporting the buffer as idle as early (so the client would be waiting on the CPU fence being signaled and not waiting on the GPU, so being invisible to the waitboost mechanism). Interesting effect.

Comment 8 Eero Tamminen 2017-08-28 07:47:25 UTC

(In reply to Chris Wilson from comment #7)
> Now fixed in x86/tip.

Verified, all the regressions are no fixed, thanks!


> If you want to report some details of the hang, otherwise there's not much I
> can do.

Was the issue BYT specific, like the related perf drop was?


> The reason why it suddenly became worse is all due to switching to
> modesetting, which only used rcs and so we end up interesting behaviour with
> X getting stuck behind extra frames, and not reporting the buffer as idle as
> early (so the client would be waiting on the CPU fence being signaled and
> not waiting on the GPU, so being invisible to the waitboost mechanism).
> Interesting effect.

Modesetting drop isn't BYT specific.

When Martin tested modesetting ~1 year ago against Intel DDX, modesetting was same speed or slightly better in 3D cases (and lost in 2D ones).

However, now switching from Intel DDX to modesetting drops onscreen 3D test-cases perf on all platforms for Unigine demos (on anything after BYT), GfxBench test-cases and high FPS cases.

Any idea why?

Comment 9 Chris Wilson 2017-08-28 10:32:14 UTC

(In reply to Eero Tamminen from comment #8)
> (In reply to Chris Wilson from comment #7)
> > Now fixed in x86/tip.
> 
> Verified, all the regressions are no fixed, thanks!
> 
> 
> > If you want to report some details of the hang, otherwise there's not much I
> > can do.
> 
> Was the issue BYT specific, like the related perf drop was?

The hangs? Haven't seen any, so I don't yet know what's going on there. The perf issue here is for byt only (as it is the only one that cannot use the HW EI thresholds and so we calculate those manually).
 
> > The reason why it suddenly became worse is all due to switching to
> > modesetting, which only used rcs and so we end up interesting behaviour with
> > X getting stuck behind extra frames, and not reporting the buffer as idle as
> > early (so the client would be waiting on the CPU fence being signaled and
> > not waiting on the GPU, so being invisible to the waitboost mechanism).
> > Interesting effect.
> 
> Modesetting drop isn't BYT specific.
> 
> When Martin tested modesetting ~1 year ago against Intel DDX, modesetting
> was same speed or slightly better in 3D cases (and lost in 2D ones).
> 
> However, now switching from Intel DDX to modesetting drops onscreen 3D
> test-cases perf on all platforms for Unigine demos (on anything after BYT),
> GfxBench test-cases and high FPS cases.
> 
> Any idea why?

Hmm, from my inspection on e.g. Unigine -modesetting was and still is around 10% slower. As you are very aware, there are so many different factors at play -- but under ideal circumstances, the ddx is irrelevant to game throughput/latency.

Do you have a specific workload that I can use as an example to see what changes may have impacted it?

Comment 10 Eero Tamminen 2017-08-28 15:08:49 UTC

(In reply to Chris Wilson from comment #9)
> (In reply to Eero Tamminen from comment #8)
> > Was the issue BYT specific, like the related perf drop was?

Sorry, I meant the fix.

> The hangs? Haven't seen any, so I don't yet know what's going on there. The
> perf issue here is for byt only (as it is the only one that cannot use the
> HW EI thresholds and so we calculate those manually).

Thans, so it was really BYT specific.  It wasn't clear from the fix description. :-)


> > When Martin tested modesetting ~1 year ago against Intel DDX, modesetting
> > was same speed or slightly better in 3D cases (and lost in 2D ones).
> > 
> > However, now switching from Intel DDX to modesetting drops onscreen 3D
> > test-cases perf on all platforms for Unigine demos (on anything after BYT),
> > GfxBench test-cases and high FPS cases.
> > 
> > Any idea why?
> 
> Hmm, from my inspection on e.g. Unigine -modesetting was and still is around
> 10% slower.

Good to know.  Interesting that Martin got different results back then.


> As you are very aware, there are so many different factors at
> play -- but under ideal circumstances, the ddx is irrelevant to game
> throughput/latency.

Yea, only think it does it the copy for non-Vsync fullscreen contents and some Vblank etc synchronization.


> Do you have a specific workload that I can use as an example to see what
> changes may have impacted it?

GpuTest Triangle and SynMark Batch0 are the simplest cases.

Now that I have BXT data, Triangle actually went up on BYT & BXT with modesetting, although for other tested machines it went down. 

However, the cases I'm more concerned are Unigine tests, because they're low FPS and therefore the issue looks more like MOCS / tiling issue than just some (less interesting) high FPS overhead.

I'd suggest checking things either with SKL or KBL (say, GT2).

Comment 11 Jani Saarinen 2018-04-04 05:46:27 UTC

Lets close this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.