Bug 108598

Summary:	[GEN9] 20% perf drop in windowed/composited GpuTest Triangle
Product:	DRI	Reporter:	Eero Tamminen <eero.t.tamminen>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	RESOLVED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	minor
Priority:	medium	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	BXT, CFL, KBL, SKL	i915 features:

Description Eero Tamminen 2018-10-30 11:55:53 UTC

HW setup:
* GEN9 HW (BXT, SKL, KBL...)
* Full HD monitor

SW setups:
* Ubuntu 18.04 with Unity desktop (= compiz compositor)
* Git versions of drm-tip kernel, X server and Mesa, so that modifiers work with X
* Modifier support enabled in X server (i.e. Mesa can use end-to-end render buffer compression):
----------------------
Section "ServerFlags"
    Option "Debug" "dmabuf_capable"
EndSection
----------------------
* Monitor native Full HD resolution


Between following drm-tip commits/dates:
a3d29ccd2c: drm-tip: 2018y-09m-30d-06h-55m-08s UTC integration manifest
26e7a7d954: drm-tip: 2018y-10m-02d-12h-38m-20s UTC integration manifest

There were following performance regressions:
* 20% in GpuTest v0.7 Triangle windowed (1366x768, composited)
* 1-2% SynMark v7 OglCSCloth, OglBatch1-3 (all fullscreen)
* 1% SynMark v7 OglGeom*, GpuTest v0.7 Triangle fullscreen

SynMark use-cases have all about 2M of vertices with pass-through shaders, so at FullHD they're memory bandwidth limited.  All the tests have high FPS, especially the trivial triangle tests.

Interestingly, higher FPS windowed Triangle case perf drop is much higher than the fullscreen Triangle one, so very high FPS can be one trigger for this.

Above regression percentages are from SKL i5 GT2, where the regressions are most visible.  Regressions are visible on all of our GEN9 platforms, but smaller on some of them.

On SKL GT2, in the windowed Triangle case:
* memory and GPU power usage decreased
* CPU power usage increased
So that in total ~4% more power is used in this case (according to RAPL).

SKL GT2 isn't TDP limited, so CPU side just using more power doesn't explain the drop.


Median power usage in our full benchmarks test set didn't change, regressions in all the other tests, than the very high FPS windowed Triangle one, are so small that they can't be bisected, and the trivial windowed Triangle case isn't that interesting use-case.

-> WONTFIX sounds fine as long as somebody checks what commit caused the perf drop.

(I'm not myself looking at GFX perf anymore, filing this was the last remaining item.)

Comment 1 Chris Wilson 2018-10-30 12:06:49 UTC

That metric may not include the frequency of screen updates, which is the likely change here in the composited case. My suspicion is that we interject more work from the xserver resulting in more frames being blitted from individual batches as opposed to several frames being amalgamated into a single batch. Just a hunch.

Comment 2 Chris Wilson 2018-10-30 12:07:52 UTC

Assuming it's from b16c765122f987056e1dc9ef6c214571bb5bd694 or e9eaf82d97a2b05460ff5ef6a3e07446f7d049fe

Comment 3 Chris Wilson 2018-10-30 12:35:53 UTC

However, there is a small merge between those commits so worth double checking against external changes.

Comment 4 Eero Tamminen 2018-10-30 12:54:05 UTC

(In reply to Chris Wilson from comment #1)
> That metric may not include the frequency of screen updates, which is the
> likely change here in the composited case. My suspicion is that we interject
> more work from the xserver resulting in more frames being blitted from
> individual batches as opposed to several frames being amalgamated into a
> single batch. Just a hunch.

Triangle test FPS is *very* high (it does just fast clear and draws
half window sized triangle with trivial shader), whereas Compiz updates
are limited to monitor frequency.

Looking at the collected data:

* Compiz updates at 60Hz, that didn't change.

* GpuTest glXBufferSwap timing distribution changed so that:
  - slowest frames are now 7x slower,
  - fastest frames are 10% faster,
  - median frames are 1-2% slower,
  - which results in average FPS being ~20% slower.

Maybe kernel reacts now worse to Mesa frame throttling behavior?

(Controlled by Mesa "disable_throttling" env var.)


If you're interested, I can send you scripts to track & visualize
individual buffer swap timings and GPU & CPU speeds.

Comment 5 Chris Wilson 2018-11-13 16:23:34 UTC

Except for using -modesetting since that failed to even modeset my kbl, I've not seen any significant variation  in gputest:triangle:window + compiz between the two commits using dri2/dri3, tearfree on/of.

Comment 6 Chris Wilson 2018-11-14 08:38:35 UTC

Repeated now with i965_dri.so + dmabuf_capable, still no significant variation between the two commits for gputest:triangle:window

i5-7260U / 640 kbl gt3e
Compiz 0.9.13.1

bxt is the other I can conveniently try and replicate on, guess she's up next.

Comment 7 Eero Tamminen 2018-11-14 10:59:44 UTC

I don't see this regression on BXT devices (nor GEN6-GEN8), or on KBL GT2 device (i7-7500U).

I do see it in all SKL devices (GT2, GT3e, GT4e).

On KBL GT3e (i7-7567U) and CFL-S GT2 the drop is smaller than on the SKL devices.

Comment 8 Eero Tamminen 2018-11-14 11:20:48 UTC

(In reply to Eero Tamminen from comment #7)
> I don't see this regression on BXT devices

I mean, it's not consistently visible on all BXT devices, and drop is much smaller than variation in Triangle test on these TDP limited devices i.e. the drop is visible only when looking at variance range change in a trend, you won't get conclusive results with normal perf comparison.

Also, when looking at perf trend, there's a drop visible in one J4205 device, but not in the other one (they have small perf diff in general, although they should be identical). Perf has likely dropped a bit on A3960 & J3455 too, but there it's even smaller.

-> best comparison HW is SKL.

Comment 9 Chris Wilson 2018-11-18 12:10:02 UTC

Had a stable 2% drop on bxt. The effect was due to reordering the requests to make i915_spin_request() more likely to be taken. The cost was not from reordering the requests themselves, but the act of busywaiting.

i.e.

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index abd4dacbab8e..f5d4659a4aa0 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1230,6 +1230,11 @@ long i915_request_wait(struct i915_request *rq,
        if (!timeout)
                return -ETIME;
 
+       /* Optimistic short spin before touching IRQs */
+       wait.seqno = i915_request_global_seqno(rq);
+       if (wait.seqno && __i915_spin_request(rq, wait.seqno, state, 5))
+               return timeout;
+
        trace_i915_request_wait_begin(rq, flags);
 
        add_wait_queue(&rq->execute, &exec);
@@ -1266,10 +1271,6 @@ long i915_request_wait(struct i915_request *rq,
        GEM_BUG_ON(!intel_wait_has_seqno(&wait));
        GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
 
-       /* Optimistic short spin before touching IRQs */
-       if (__i915_spin_request(rq, wait.seqno, state, 5))
-               goto complete;
-

Comment 10 Chris Wilson 2019-01-30 11:09:08 UTC

(In reply to Chris Wilson from comment #9)
> Had a stable 2% drop on bxt. The effect was due to reordering the requests
> to make i915_spin_request() more likely to be taken. The cost was not from
> reordering the requests themselves, but the act of busywaiting.
> 
> i.e.
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c
> b/drivers/gpu/drm/i915/i915_request.c
> index abd4dacbab8e..f5d4659a4aa0 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1230,6 +1230,11 @@ long i915_request_wait(struct i915_request *rq,
>         if (!timeout)
>                 return -ETIME;
>  
> +       /* Optimistic short spin before touching IRQs */
> +       wait.seqno = i915_request_global_seqno(rq);
> +       if (wait.seqno && __i915_spin_request(rq, wait.seqno, state, 5))
> +               return timeout;
> +
>         trace_i915_request_wait_begin(rq, flags);
>  
>         add_wait_queue(&rq->execute, &exec);
> @@ -1266,10 +1271,6 @@ long i915_request_wait(struct i915_request *rq,
>         GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>         GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
>  
> -       /* Optimistic short spin before touching IRQs */
> -       if (__i915_spin_request(rq, wait.seqno, state, 5))
> -               goto complete;
> -

Fwiw, this made it into

commit 52c0fdb25c7c919334b97976d05096b441a3eada
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 29 20:52:29 2019 +0000

    drm/i915: Replace global breadcrumbs with per-context interrupt tracking

So that should be my small reproducer fixed, but the larger drop is still a mystery.

Comment 11 Eero Tamminen 2019-01-30 13:16:43 UTC

There was a windowed Triangle perf improvement in early December, along with large reduction in variance.

Another clear perf improvement (and variance reduction) happened recently, between:
2019-01-22 14:34:34 af90b519a6: drm-tip: 2019y-01m-22d-14h-33m-29s UTC integration manifest
2019-01-23 16:59:31 198addb18e: drm-tip: 2019y-01m-23d-16h-58m-20s UTC integration manifest

I think windowed GpuTest Triangle perf is still a bit below what it was originally (I need to check that from older data, I don't see it anymore in the trends).

Comment 12 Chris Wilson 2019-01-30 15:41:09 UTC

(In reply to Eero Tamminen from comment #11)
> There was a windowed Triangle perf improvement in early December, along with
> large reduction in variance.
> 
> Another clear perf improvement (and variance reduction) happened recently,
> between:
> 2019-01-22 14:34:34 af90b519a6: drm-tip: 2019y-01m-22d-14h-33m-29s UTC
> integration manifest
> 2019-01-23 16:59:31 198addb18e: drm-tip: 2019y-01m-23d-16h-58m-20s UTC
> integration manifest

Only one thing of significance there,

commit 6e062b60b0b1bd82cac475e63cdb8c451647182b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 23 13:51:55 2019 +0000

    drm/i915/execlists: Mark up priority boost on preemption

which should help prevent some surplus preemption loops and could definitely have a small, variable impact on performance.

> I think windowed GpuTest Triangle perf is still a bit below what it was
> originally (I need to check that from older data, I don't see it anymore in
> the trends).

There's a tiny bit more work to come tweaking need_preempt() to avoid excess preempt-to-idle cycles, but it should be diminishing returns.

Confidence is high that preemption was causing at least part of the problem though.

Comment 13 Eero Tamminen 2019-01-30 17:08:43 UTC

(In reply to Eero Tamminen from comment #11)
> I think windowed GpuTest Triangle perf is still a bit below what it was
> originally (I need to check that from older data, I don't see it anymore in
> the trends).

Nope, everything is with variance compared to original perf -> FIXED?

PS. I really like variance reductions (as long as they don't noticeably regress perf).

Comment 14 Chris Wilson 2019-01-30 17:13:39 UTC

Aye, let's presume it's righted itself (rather than the original regression being masked by a separate improvement). We'll know soon enough if there's worse to come.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.