Summary: | [BDW] OglBatch7 performance reduced by ~7% after enable execlists | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | wendy.wang | ||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Jairo Miramontes <jairo.daniel.miramontes.caton> | ||||
Severity: | critical | ||||||
Priority: | high | CC: | christophe.prigent, eero.t.tamminen, intel-gfx-bugs | ||||
Version: | unspecified | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | ReadyForDev | ||||||
i915 platform: | BDW | i915 features: | GEM/execlists | ||||
Attachments: |
|
Description
wendy.wang
2014-12-26 01:03:12 UTC
Batch7 tests does (unrealistically) large number of batches within frame, so that it's completely CPU bound. It's kind of worst case for this. Wendy, is there any tests which improved with this commit? Did not observe other cases perf data improve on BDW GT2(E2 or F0) Wendy or Eero, do we have a profile for this case? If the test is CPU bound, does that mean our new execlist submission path has a lot more overhead than our ring buffer one? (In reply to Jesse Barnes from comment #3) > Wendy or Eero, do we have a profile for this case? If the test is CPU > bound, does that mean our new execlist submission path has a lot more > overhead than our ring buffer one? Sorry Jesse, I do not have the profile for the OglBatch7 case. Sounds like this is wontfix. Why not? It's the extra interrupt overhead in the driver which we can do our best to reduce. In the ideal case, execlists only increases CPU overhead and not introduce GPU stalls like here. I don't understand why we wouldn't be interested in fixing the inherent issues with execlists. Looking at my bdw-u system, the effect here seems to be attributable to the increased latency of execlist submission to an idle GPU compared to legacy - and that we are doing just enough work inside the execlist handler for the GPU to become idle between batches. On my machine I can get comparable timings between execlists and legacy by tuning the driver, but if my hypothesis is valid such tuning will be highly dependent upon the system. So are we not submitting work to the second submit port? If we do, and we re-submit it on the first port, I'm assuming it won't preempt it, dump the state, and then re-load it when we submit new work to the second port. With some more detail and experimentation hopefully we can get this down to 0% or at least in the noise. There isn't enough work to submit pairs of batches - we spend, between synmark, mesa and the driver, longer to create the batch for execution than it takes the GPU to execute it. This test really does focus on submission overhead. I think I have it mostly down to the noise now, or at least down to the level where even I won't whinge too much. (There is definitely room for improvement wrt to the hardware responding to ELSP so hopefully it will be less of an issue in future gen.) Wendy provided me data with execlists enabled (default) and disabled (i915.enable_execlists=0 kernel command line option), for full SynMark test-set with recent 3D stack. Execlists is in no case faster. It's ~50% slower in GL context re-creation test, and in other CPU bound tests it's 5-10% slower: - Batch5-7 (batching huge triangle strip with ever increasing number of draw calls) - DrvState (mostly CPU bound GL state change test) - Multithread (GL calls from multiple threads) Same difference both on BSW & BDW. (In reply to Eero Tamminen from comment #11) > Wendy provided me data with execlists enabled (default) and disabled > (i915.enable_execlists=0 kernel command line option), for full SynMark > test-set with recent 3D stack. > > Execlists is in no case faster. > > It's ~50% slower in GL context re-creation test, and in other CPU bound Hmm, not looked at that. The cause is fairly obvious since execlists implies even more state to create per context. I think I can safely eliminate most of the extra overhead. Worth filing that as a separate bug (I guess the report got overlooked due to the first regression from enabling ppgtt). > tests it's 5-10% slower: > - Batch5-7 (batching huge triangle strip with ever increasing number of draw > calls) These are what I've been looking at. There is a lot of low hanging fruit we can eliminate in execlist submission that brings us back on par. However, looking at the profile of igt/gem_exec_nop and igt/gem_exec_blt, I came to the conclusion that the largest effect is when the GPU is idle (i.e. the GPU takes longer to respond to ELSP than a TAIL update). Therefore the biggest impact of execlists will be different on different machines depending on just how frequently the GPU stalls waiting for the driver. > - DrvState (mostly CPU bound GL state change test) > - Multithread (GL calls from multiple threads) I presume these will be more or less the same effect as above for execlist submission overheads. Except we will be exercising context switching (and thus ring switching) much more. In the driver we can amalgamate consecutive batches from the same context into a single submission, with more contexts active we will not only have to do more ELSP submissions (i.e. more irq handling and general driver overheads) but also more latency from hw context switches. Do we have any update on this old regression? Looks like it may require some non negliegeable work It's more or less a fundamental difference in hw implementation between execlists and legacy submission. For whatever reason, execution latency from idle is much worse under execlists, and OglBatch7 hits that path continuously (since the overhead in the driver is sufficient that mesa cannot keep the GPU busy). However, some of this is self inflicted as execlists incurs greater execution overhead, both from per-bb/per-ctx instructions and from being irq driven. Those we can improve and have been improving, but that execution latency is hard to hide. Reference to Chris' patch set: https://patchwork.freedesktop.org/series/21821/ To put the difference into perspective, a couple of microbenchmarks (bdw i7-5557u): ring execlists exec continuous nops on all rings: 1.491us 2.223us exec sequential nops on each ring: 12.508us 53.682us single nop + sync: 9.272us 30.291us vblank_mode=0 glxgears: ~11000fps ~9000fps Priority changed to High+Critical. Adding tag into "Whiteboard" field - ReadyForDev The bug still active *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included I removed regression related tags from this issue. Based on the comment 16 and measurements done with OglBatch7 on Ubuntu kernels one can state that gap caused by change from ring to execlists still exist. If one sets Aug-2015+fixes as the baseline (100%) then: 4.2.0-42-generic - Aug-2015+fixes - 100% 4.10.0-19-generic - Feb-2017+fixes ~ 80% This is in line with glxgears numbers presented by Chris on comment 16. Hello everyone, This seems to need more time to make progress, so I'll change the status to NEEDINFO while any new update is added to this case. Once updated please change to REOPEN. Thank you. Minor improvements, but still the disparity remains. However, commit 79e6770cb1f5e32eb49590edbb794a97d0113aed Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 20 20:55:01 2017 +0000 drm/i915: Remove obsolete ringbuffer emission for gen8+ Since removing the module parameter to force selection of ringbuffer emission for gen8, the code is defunct. Remove it. To put the difference into perspective, a couple of microbenchmarks (bdw i7-5557u, 20170324): ring execlists exec continuous nops on all rings: 1.491us 2.223us exec sequential nops on each ring: 12.508us 53.682us single nop + sync: 9.272us 30.291us vblank_mode=0 glxgears: ~11000fps ~9000fps Since the earlier submission, gen8 ringbuffer submission has fallen further and further behind in features. So while ringbuffer may hold the throughput crown, in terms of interactive latency, execlists is much better. Alas, we have no convenient metrics for such, other than demonstrating things we can do with execlists but can not using legacy ringbuffer submission. We have made a few improvements to lowlevel execlists throughput, and ringbuffer currently panics on boot! (bdw i7-5557u, 20171026): ring execlists exec continuous nops on all rings: n/a 1.921us exec sequential nops on each ring: n/a 44.621us single nop + sync: n/a 21.953us vblank_mode=0 glxgears: n/a ~18500fps References: https://bugs.freedesktop.org/show_bug.cgi?id=87725 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Once-upon-a-time-Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-2-chris@chris-wilson.co.uk commit fb5c551ad510e4a408c105670f89c725ebbfe6c8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 20 20:55:00 2017 +0000 drm/i915: Remove i915.enable_execlists module parameter Execlists and legacy ringbuffer submission are no longer feature comparable (execlists now offer greater functionality that should overcome their performance hit) and obsoletes the unsafe module parameter, i.e. comparing the two modes of execution is no longer useful, so remove the debug tool. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> #i915_perf.c Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-1-chris@chris-wilson.co.uk I wonder what impact resource streamer would have on this? (This is not a request to add RS, as it goes away in newer GENs, but I know that Windows uses it, so I was just wondering how much impact you think it would have had.) (In reply to Eero Tamminen from comment #22) > I wonder what impact resource streamer would have on this? > > (This is not a request to add RS, as it goes away in newer GENs, but I know > that Windows uses it, so I was just wondering how much impact you think it > would have had.) Overall, none. RS afaik is the same for both submission paths; the effect causing the slowdown is the order of magnitude increased latency in both the initial submission and effectively doing a context switch every time (waking up, reprogramming ELSP, which in turn triggers another wakeup in a few microseconds, repeat until steady state is achieved, then start again at the end of the batch). |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.