HW setup: * GEN9 HW (BXT, SKL, KBL...) * Full HD monitor SW setups: * Ubuntu 18.04 with Unity desktop (= compiz compositor) * Git versions of drm-tip kernel, X server and Mesa, so that modifiers work with X * Modifier support enabled in X server (i.e. Mesa can use end-to-end render buffer compression): ---------------------- Section "ServerFlags" Option "Debug" "dmabuf_capable" EndSection ---------------------- * Monitor native Full HD resolution Between following drm-tip commits/dates: a3d29ccd2c: drm-tip: 2018y-09m-30d-06h-55m-08s UTC integration manifest 26e7a7d954: drm-tip: 2018y-10m-02d-12h-38m-20s UTC integration manifest There were following performance regressions: * 20% in GpuTest v0.7 Triangle windowed (1366x768, composited) * 1-2% SynMark v7 OglCSCloth, OglBatch1-3 (all fullscreen) * 1% SynMark v7 OglGeom*, GpuTest v0.7 Triangle fullscreen SynMark use-cases have all about 2M of vertices with pass-through shaders, so at FullHD they're memory bandwidth limited. All the tests have high FPS, especially the trivial triangle tests. Interestingly, higher FPS windowed Triangle case perf drop is much higher than the fullscreen Triangle one, so very high FPS can be one trigger for this. Above regression percentages are from SKL i5 GT2, where the regressions are most visible. Regressions are visible on all of our GEN9 platforms, but smaller on some of them. On SKL GT2, in the windowed Triangle case: * memory and GPU power usage decreased * CPU power usage increased So that in total ~4% more power is used in this case (according to RAPL). SKL GT2 isn't TDP limited, so CPU side just using more power doesn't explain the drop. Median power usage in our full benchmarks test set didn't change, regressions in all the other tests, than the very high FPS windowed Triangle one, are so small that they can't be bisected, and the trivial windowed Triangle case isn't that interesting use-case. -> WONTFIX sounds fine as long as somebody checks what commit caused the perf drop. (I'm not myself looking at GFX perf anymore, filing this was the last remaining item.)
That metric may not include the frequency of screen updates, which is the likely change here in the composited case. My suspicion is that we interject more work from the xserver resulting in more frames being blitted from individual batches as opposed to several frames being amalgamated into a single batch. Just a hunch.
Assuming it's from b16c765122f987056e1dc9ef6c214571bb5bd694 or e9eaf82d97a2b05460ff5ef6a3e07446f7d049fe
However, there is a small merge between those commits so worth double checking against external changes.
(In reply to Chris Wilson from comment #1) > That metric may not include the frequency of screen updates, which is the > likely change here in the composited case. My suspicion is that we interject > more work from the xserver resulting in more frames being blitted from > individual batches as opposed to several frames being amalgamated into a > single batch. Just a hunch. Triangle test FPS is *very* high (it does just fast clear and draws half window sized triangle with trivial shader), whereas Compiz updates are limited to monitor frequency. Looking at the collected data: * Compiz updates at 60Hz, that didn't change. * GpuTest glXBufferSwap timing distribution changed so that: - slowest frames are now 7x slower, - fastest frames are 10% faster, - median frames are 1-2% slower, - which results in average FPS being ~20% slower. Maybe kernel reacts now worse to Mesa frame throttling behavior? (Controlled by Mesa "disable_throttling" env var.) If you're interested, I can send you scripts to track & visualize individual buffer swap timings and GPU & CPU speeds.
Except for using -modesetting since that failed to even modeset my kbl, I've not seen any significant variation in gputest:triangle:window + compiz between the two commits using dri2/dri3, tearfree on/of.
Repeated now with i965_dri.so + dmabuf_capable, still no significant variation between the two commits for gputest:triangle:window i5-7260U / 640 kbl gt3e Compiz 0.9.13.1 bxt is the other I can conveniently try and replicate on, guess she's up next.
I don't see this regression on BXT devices (nor GEN6-GEN8), or on KBL GT2 device (i7-7500U). I do see it in all SKL devices (GT2, GT3e, GT4e). On KBL GT3e (i7-7567U) and CFL-S GT2 the drop is smaller than on the SKL devices.
(In reply to Eero Tamminen from comment #7) > I don't see this regression on BXT devices I mean, it's not consistently visible on all BXT devices, and drop is much smaller than variation in Triangle test on these TDP limited devices i.e. the drop is visible only when looking at variance range change in a trend, you won't get conclusive results with normal perf comparison. Also, when looking at perf trend, there's a drop visible in one J4205 device, but not in the other one (they have small perf diff in general, although they should be identical). Perf has likely dropped a bit on A3960 & J3455 too, but there it's even smaller. -> best comparison HW is SKL.
Had a stable 2% drop on bxt. The effect was due to reordering the requests to make i915_spin_request() more likely to be taken. The cost was not from reordering the requests themselves, but the act of busywaiting. i.e. diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index abd4dacbab8e..f5d4659a4aa0 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1230,6 +1230,11 @@ long i915_request_wait(struct i915_request *rq, if (!timeout) return -ETIME; + /* Optimistic short spin before touching IRQs */ + wait.seqno = i915_request_global_seqno(rq); + if (wait.seqno && __i915_spin_request(rq, wait.seqno, state, 5)) + return timeout; + trace_i915_request_wait_begin(rq, flags); add_wait_queue(&rq->execute, &exec); @@ -1266,10 +1271,6 @@ long i915_request_wait(struct i915_request *rq, GEM_BUG_ON(!intel_wait_has_seqno(&wait)); GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit)); - /* Optimistic short spin before touching IRQs */ - if (__i915_spin_request(rq, wait.seqno, state, 5)) - goto complete; -
(In reply to Chris Wilson from comment #9) > Had a stable 2% drop on bxt. The effect was due to reordering the requests > to make i915_spin_request() more likely to be taken. The cost was not from > reordering the requests themselves, but the act of busywaiting. > > i.e. > > diff --git a/drivers/gpu/drm/i915/i915_request.c > b/drivers/gpu/drm/i915/i915_request.c > index abd4dacbab8e..f5d4659a4aa0 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -1230,6 +1230,11 @@ long i915_request_wait(struct i915_request *rq, > if (!timeout) > return -ETIME; > > + /* Optimistic short spin before touching IRQs */ > + wait.seqno = i915_request_global_seqno(rq); > + if (wait.seqno && __i915_spin_request(rq, wait.seqno, state, 5)) > + return timeout; > + > trace_i915_request_wait_begin(rq, flags); > > add_wait_queue(&rq->execute, &exec); > @@ -1266,10 +1271,6 @@ long i915_request_wait(struct i915_request *rq, > GEM_BUG_ON(!intel_wait_has_seqno(&wait)); > GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit)); > > - /* Optimistic short spin before touching IRQs */ > - if (__i915_spin_request(rq, wait.seqno, state, 5)) > - goto complete; > - Fwiw, this made it into commit 52c0fdb25c7c919334b97976d05096b441a3eada Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 29 20:52:29 2019 +0000 drm/i915: Replace global breadcrumbs with per-context interrupt tracking So that should be my small reproducer fixed, but the larger drop is still a mystery.
There was a windowed Triangle perf improvement in early December, along with large reduction in variance. Another clear perf improvement (and variance reduction) happened recently, between: 2019-01-22 14:34:34 af90b519a6: drm-tip: 2019y-01m-22d-14h-33m-29s UTC integration manifest 2019-01-23 16:59:31 198addb18e: drm-tip: 2019y-01m-23d-16h-58m-20s UTC integration manifest I think windowed GpuTest Triangle perf is still a bit below what it was originally (I need to check that from older data, I don't see it anymore in the trends).
(In reply to Eero Tamminen from comment #11) > There was a windowed Triangle perf improvement in early December, along with > large reduction in variance. > > Another clear perf improvement (and variance reduction) happened recently, > between: > 2019-01-22 14:34:34 af90b519a6: drm-tip: 2019y-01m-22d-14h-33m-29s UTC > integration manifest > 2019-01-23 16:59:31 198addb18e: drm-tip: 2019y-01m-23d-16h-58m-20s UTC > integration manifest Only one thing of significance there, commit 6e062b60b0b1bd82cac475e63cdb8c451647182b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jan 23 13:51:55 2019 +0000 drm/i915/execlists: Mark up priority boost on preemption which should help prevent some surplus preemption loops and could definitely have a small, variable impact on performance. > I think windowed GpuTest Triangle perf is still a bit below what it was > originally (I need to check that from older data, I don't see it anymore in > the trends). There's a tiny bit more work to come tweaking need_preempt() to avoid excess preempt-to-idle cycles, but it should be diminishing returns. Confidence is high that preemption was causing at least part of the problem though.
(In reply to Eero Tamminen from comment #11) > I think windowed GpuTest Triangle perf is still a bit below what it was > originally (I need to check that from older data, I don't see it anymore in > the trends). Nope, everything is with variance compared to original perf -> FIXED? PS. I really like variance reductions (as long as they don't noticeably regress perf).
Aye, let's presume it's righted itself (rather than the original regression being masked by a separate improvement). We'll know soon enough if there's worse to come.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.