Created attachment 111341 [details] dmesg file Environment: ----------------------------------- Platform: BDW Libdrm: (master)libdrm-2.4.58-19-gf99522e678dbbaffeca9462a8edcbe900574dc12 Mesa: (master)ff9653775962ab661c4891721b1b93d077d1f2db Xserver: (master)xorg-server-1.16.99.901-85-g91651e7c15892aa846fc406fbb13b37f094dd3f0 Xf86_video_intel: (master)2.99.916-172-g04a09d353fc5ad8a55eb2457dc8bb43638ba879e Cairo: (master)4a225fca5f121c31ddaa0af80a13bf95a7f21a5b Libva: (master)cdfd3d50d80c092ca3ae8914418be628d5a80832 Libva_intel_driver: (master)f2a34f94c57e1f7ce975b2068fb087df84b92e3a Kernel: (drm-intel-nightly)a243fbbcfd348ec042b100f3ce02f215252d177d Bug detailed description: --------------------------------------------- OglBatch7 performance reduced by ~7% after enable execlists Bisect drm-intel-next-queued kernel branch, the first bad commit is: 27401d126b5b1c8dd4df98bbb60b09ff2b3d5e60 Author: Thomas Daniel <thomas.daniel@intel.com> Date: Thu Dec 11 12:48:35 2014 +0000 drm/i915/bdw: Enable execlists by default where supported Execlist support in the i915 driver is now considered good enough for the feature to be enabled by default on Gen8 and later and routinely tested. Adjusted i915 parameters structure initialization to reflect this and updated the comment in intel_sanitize_enable_execlists(). There's still work to do before we can let the wider massive onto it, but there's still time left before the 3.20 cutoff. v2: Update the MODULE_PARM_DESC too. Issue: VIZ-2020 Signed-off-by: Thomas Daniel <thomas.daniel@intel.com> [danvet: Add note that there's still some work left to do.] Signed-off-by: Daniel Vetter daniel.vetter@ffwll.ch Reproduce steps: --------------------------------------------- 1, xinit & 2, cd /usr/local/games/OpenGL/SynMark2_6 3. ./synmark2 OglBatch7 Dmesg file have been attached.
Batch7 tests does (unrealistically) large number of batches within frame, so that it's completely CPU bound. It's kind of worst case for this. Wendy, is there any tests which improved with this commit?
Did not observe other cases perf data improve on BDW GT2(E2 or F0)
Wendy or Eero, do we have a profile for this case? If the test is CPU bound, does that mean our new execlist submission path has a lot more overhead than our ring buffer one?
(In reply to Jesse Barnes from comment #3) > Wendy or Eero, do we have a profile for this case? If the test is CPU > bound, does that mean our new execlist submission path has a lot more > overhead than our ring buffer one? Sorry Jesse, I do not have the profile for the OglBatch7 case.
Sounds like this is wontfix.
Why not? It's the extra interrupt overhead in the driver which we can do our best to reduce. In the ideal case, execlists only increases CPU overhead and not introduce GPU stalls like here.
I don't understand why we wouldn't be interested in fixing the inherent issues with execlists.
Looking at my bdw-u system, the effect here seems to be attributable to the increased latency of execlist submission to an idle GPU compared to legacy - and that we are doing just enough work inside the execlist handler for the GPU to become idle between batches. On my machine I can get comparable timings between execlists and legacy by tuning the driver, but if my hypothesis is valid such tuning will be highly dependent upon the system.
So are we not submitting work to the second submit port? If we do, and we re-submit it on the first port, I'm assuming it won't preempt it, dump the state, and then re-load it when we submit new work to the second port. With some more detail and experimentation hopefully we can get this down to 0% or at least in the noise.
There isn't enough work to submit pairs of batches - we spend, between synmark, mesa and the driver, longer to create the batch for execution than it takes the GPU to execute it. This test really does focus on submission overhead. I think I have it mostly down to the noise now, or at least down to the level where even I won't whinge too much. (There is definitely room for improvement wrt to the hardware responding to ELSP so hopefully it will be less of an issue in future gen.)
Wendy provided me data with execlists enabled (default) and disabled (i915.enable_execlists=0 kernel command line option), for full SynMark test-set with recent 3D stack. Execlists is in no case faster. It's ~50% slower in GL context re-creation test, and in other CPU bound tests it's 5-10% slower: - Batch5-7 (batching huge triangle strip with ever increasing number of draw calls) - DrvState (mostly CPU bound GL state change test) - Multithread (GL calls from multiple threads) Same difference both on BSW & BDW.
(In reply to Eero Tamminen from comment #11) > Wendy provided me data with execlists enabled (default) and disabled > (i915.enable_execlists=0 kernel command line option), for full SynMark > test-set with recent 3D stack. > > Execlists is in no case faster. > > It's ~50% slower in GL context re-creation test, and in other CPU bound Hmm, not looked at that. The cause is fairly obvious since execlists implies even more state to create per context. I think I can safely eliminate most of the extra overhead. Worth filing that as a separate bug (I guess the report got overlooked due to the first regression from enabling ppgtt). > tests it's 5-10% slower: > - Batch5-7 (batching huge triangle strip with ever increasing number of draw > calls) These are what I've been looking at. There is a lot of low hanging fruit we can eliminate in execlist submission that brings us back on par. However, looking at the profile of igt/gem_exec_nop and igt/gem_exec_blt, I came to the conclusion that the largest effect is when the GPU is idle (i.e. the GPU takes longer to respond to ELSP than a TAIL update). Therefore the biggest impact of execlists will be different on different machines depending on just how frequently the GPU stalls waiting for the driver. > - DrvState (mostly CPU bound GL state change test) > - Multithread (GL calls from multiple threads) I presume these will be more or less the same effect as above for execlist submission overheads. Except we will be exercising context switching (and thus ring switching) much more. In the driver we can amalgamate consecutive batches from the same context into a single submission, with more contexts active we will not only have to do more ELSP submissions (i.e. more irq handling and general driver overheads) but also more latency from hw context switches.
Do we have any update on this old regression? Looks like it may require some non negliegeable work
It's more or less a fundamental difference in hw implementation between execlists and legacy submission. For whatever reason, execution latency from idle is much worse under execlists, and OglBatch7 hits that path continuously (since the overhead in the driver is sufficient that mesa cannot keep the GPU busy). However, some of this is self inflicted as execlists incurs greater execution overhead, both from per-bb/per-ctx instructions and from being irq driven. Those we can improve and have been improving, but that execution latency is hard to hide.
Reference to Chris' patch set: https://patchwork.freedesktop.org/series/21821/
To put the difference into perspective, a couple of microbenchmarks (bdw i7-5557u): ring execlists exec continuous nops on all rings: 1.491us 2.223us exec sequential nops on each ring: 12.508us 53.682us single nop + sync: 9.272us 30.291us vblank_mode=0 glxgears: ~11000fps ~9000fps
Priority changed to High+Critical.
Adding tag into "Whiteboard" field - ReadyForDev The bug still active *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included
I removed regression related tags from this issue. Based on the comment 16 and measurements done with OglBatch7 on Ubuntu kernels one can state that gap caused by change from ring to execlists still exist. If one sets Aug-2015+fixes as the baseline (100%) then: 4.2.0-42-generic - Aug-2015+fixes - 100% 4.10.0-19-generic - Feb-2017+fixes ~ 80% This is in line with glxgears numbers presented by Chris on comment 16.
Hello everyone, This seems to need more time to make progress, so I'll change the status to NEEDINFO while any new update is added to this case. Once updated please change to REOPEN. Thank you.
Minor improvements, but still the disparity remains. However, commit 79e6770cb1f5e32eb49590edbb794a97d0113aed Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 20 20:55:01 2017 +0000 drm/i915: Remove obsolete ringbuffer emission for gen8+ Since removing the module parameter to force selection of ringbuffer emission for gen8, the code is defunct. Remove it. To put the difference into perspective, a couple of microbenchmarks (bdw i7-5557u, 20170324): ring execlists exec continuous nops on all rings: 1.491us 2.223us exec sequential nops on each ring: 12.508us 53.682us single nop + sync: 9.272us 30.291us vblank_mode=0 glxgears: ~11000fps ~9000fps Since the earlier submission, gen8 ringbuffer submission has fallen further and further behind in features. So while ringbuffer may hold the throughput crown, in terms of interactive latency, execlists is much better. Alas, we have no convenient metrics for such, other than demonstrating things we can do with execlists but can not using legacy ringbuffer submission. We have made a few improvements to lowlevel execlists throughput, and ringbuffer currently panics on boot! (bdw i7-5557u, 20171026): ring execlists exec continuous nops on all rings: n/a 1.921us exec sequential nops on each ring: n/a 44.621us single nop + sync: n/a 21.953us vblank_mode=0 glxgears: n/a ~18500fps References: https://bugs.freedesktop.org/show_bug.cgi?id=87725 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Once-upon-a-time-Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-2-chris@chris-wilson.co.uk commit fb5c551ad510e4a408c105670f89c725ebbfe6c8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 20 20:55:00 2017 +0000 drm/i915: Remove i915.enable_execlists module parameter Execlists and legacy ringbuffer submission are no longer feature comparable (execlists now offer greater functionality that should overcome their performance hit) and obsoletes the unsafe module parameter, i.e. comparing the two modes of execution is no longer useful, so remove the debug tool. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> #i915_perf.c Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-1-chris@chris-wilson.co.uk
I wonder what impact resource streamer would have on this? (This is not a request to add RS, as it goes away in newer GENs, but I know that Windows uses it, so I was just wondering how much impact you think it would have had.)
(In reply to Eero Tamminen from comment #22) > I wonder what impact resource streamer would have on this? > > (This is not a request to add RS, as it goes away in newer GENs, but I know > that Windows uses it, so I was just wondering how much impact you think it > would have had.) Overall, none. RS afaik is the same for both submission paths; the effect causing the slowdown is the order of magnitude increased latency in both the initial submission and effectively doing a context switch every time (waking up, reprogramming ELSP, which in turn triggers another wakeup in a few microseconds, repeat until steady state is achieved, then start again at the end of the batch).
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.