Created attachment 142419 [details] parameter file for sample_media_transcode Test setup: * Ubuntu 18.04 * git head build of drm-tip kernel * git head build of Mesa & X and their main deps * git head build of Intel MediaSDK and their main dependencies Good drm-tip version: a4e9f377a9: 2018-11-03 01:29:29: drm-tip: 2018y-11m-03d-01h-28m-29s UTC integration manifest Bad drm-tip version: 1a4a6dafa1: 2018-11-05 16:07:52: drm-tip: 2018y-11m-05d-16h-07m-05s UTC integration manifest Test-case 1: * Run (mostly) CPU bound GfxBench v4 Driver2 test Test-case 2: * Run MediaSDK provided tool with the attached parameter file (does 50 streams which lower H264 video frame & bit rates, size and adds filtering): sample_multi_transcode -par inputs.par * Sum FPS of all streams together Outcome on HW that is TDP limited: * Test-case 1 performance drops 15% * Test-case 2 performance drops 5% * Performance of other CPU bound GPU tests regress also, but less Outcome on HW that isn't TDP limited: * RAPL reports marginally larger CPU power consumption for test-case 1 * RAPL reports 1.5-2.5x higher CPU power consumption for test-case 2 There were no performance improvements in other tests we run on these devices. Large CPU power usage increase without perf change is visible on: * SKL i5-6600K (GT2) * KBL i7-7500U (GT2) * KBL i7-8809G (GT2) (And on pre-production CFL-S device we had) TDP limit caused performance to drop (with increased CPU usage) on: * KBL i7-7567U (GT3e) * SKL i7-6770HQ (GT4e) There was one device where performance increases with the much higher CPU power usage, but it's only by 1-2% and only in test-case 2: * SKL i5-6260U (GT3e) Neither perf nor power usage changed on BXT devices, so I guess this change concerns only Core devices. On BDW GT2 the CPU usage increase was clearly smaller than on GEN9 Core devices (and there was no noticeable performance change). MediaSDK doesn't support older devices, so I don't have data from them.
Drm-tip seems to have rebased from v4.19 to v4.20-rc1 during that 1 day interval.
Yeah, can you bisect Eero ;)
(In reply to Jani Saarinen from comment #2) > Yeah, can you bisect Eero ;) I don't have anything set up that would automate bisecting kernel well enough (reboots, boot failures, handling drm-tip rebases etc). However, if you have in mind few commits in that range, I could manually check whether they give good or bad performance. And I can of course (internally) provide ready-made SW setup and reserve suitable HW for whomever is going to look into this.
I haven't yet tried the exact tests as cited here, all I've found so far is a remarkable improvement from 08e3e21a24d23db6a4adca90f7cb40d69e09d35c ("drm/i915: kill resource streamer support") in the -rc1 merge. The report would suggest we were looking for a pstate or scheduler change.
(In reply to Chris Wilson from comment #4) > I haven't yet tried the exact tests as cited here, all I've found so far is > a remarkable improvement from 08e3e21a24d23db6a4adca90f7cb40d69e09d35c > ("drm/i915: kill resource streamer support") in the -rc1 merge. In CPU bound Driver2 GL tests? On which device?
(In reply to Eero Tamminen from comment #5) > (In reply to Chris Wilson from comment #4) > > I haven't yet tried the exact tests as cited here, all I've found so far is > > a remarkable improvement from 08e3e21a24d23db6a4adca90f7cb40d69e09d35c > > ("drm/i915: kill resource streamer support") in the -rc1 merge. > > In CPU bound Driver2 GL tests? On which device? kbl + glxgears; basic context switch exercise. In light of the rc1 controversy, do you have spectre/meltdown migrations enabled on your test systems?
We don't specifically enable any mitigations, just use drm-tip kernel defaults. It seems to have enabled an additional one when it was rebased to 4.20-rc1: Spectre V2 : Spectre v2 cross-process SMT mitigation: Enabling STIBP Threading in the listed test-cases: * MediaSDK 50 stream transcode case has 250 threads * I thought GfxBench Driver2 doesn't thread, as only single CPU is busy, but it actually uses 3 threads of which 2 use as much CPU as they can, and apparently kernel just sticks them to same core, so they seem hyperthreaded -> I think that SMT mitigation is very likely cause for the drop instead of i915. Could you point out suitable drm-tip commit IDs before and after enabling the mitigation so that I could verify it?
STIBP fixes in drm-tip v4.20-rc5 fix the CPU bound 3D cases performance (test-case 1). However, those fixes, nor disabling Spectre mitigation completely from kernel command line (checked by David), do NOT have any impact on the Media performance regression (test-case 2). David will try to bisect the Media perf regression.
Reverting 01bad1c6896db021db82042e71c2bf1f97cc026b seems to resolve at least part of the performance regression; the CPU power usage seems to have been a separate issue.
(In reply to David Weinehall from comment #9) > Reverting 01bad1c6896db021db82042e71c2bf1f97cc026b seems to resolve at least > part of the performance regression https://cgit.freedesktop.org/drm/drm-tip/commit/?id=01bad1c6896db021db82042e71c2bf1f97cc026b ---------------------------------------------------------------- Author:Rafael J. Wysocki <rafael.j.wysocki@intel.com> Committer: Rafael J. Wysocki <rafael.j.wysocki@intel.com> cpuidle: poll_state: Revise loop termination condition If need_resched() returns "false", breaking out of the loop in poll_idle() will cause a new idle state to be selected, so in fact it usually doesn't make sense to spin in it longer than the target residency of the second state. [Note that the "polling" state is used only if there is at least one "real" state defined in addition to it, so the second state is always there.] On the other hand, breaking out of it early (say in case the next state is disabled) shouldn't hurt as it is polling anyway. For this reason, make the loop in poll_idle() break if the CPU has been spinning longer than the target residency of the second state (the "polling" state can only be state[0]). ---------------------------------------------------------------- > the CPU power usage seems to have been a separate issue. Before the issue, CPU core(s)s power usage was less than GPU power usage, and afterwards it was >2x GPU power usage, in a (mostly) *GPU limited* Media test-case. STIBP fix didn't improve >2x CPU power usage increase in Media test-case 2) at all, so I think at this point it's more interesting than the small perf drop in it. It's like to explain rest of the perf drop too.
Sorry for the delay... Rafael, any comments from you?
Note: kernel had been scheduling parallel media processes very unevenly (I think most visible on GT4e SkullCanyon). This appears to have been fixed somewhere in early February (and AFAIK didn't happen earlier in 2018). If one just sums FPS reported by the individual processes together (like is suggested in first comment use-case), not fully parallel scheduling could distort the results significantly.
Comments from Rafael J. Wysocki is added here On 3/7/2019 2:00 PM, Vudum, Lakshminarayana wrote: > Hi, Can you comment on this bug? Is your commit causing performance issues? It shouldn't. That said if reverting it causes the perf numbers to go up, then there is some impact, but it can only be very indirect. This commit only affects idle CPUs and what it does is to cause them to spend less time doing idle polling in one go. Effectively, this may allow them to enter real idle (non-polling) states more often and spend more time in non-polling idle states overall. If they are woken up from those idle states (instead of just interrupting idle polling), you may see some increased latency in very specific workloads. Maybe the workload in question is one of these. However, if it affects perf adversely, it will also improve energy-efficiency. To verify, it might be instructive to run the workload under turbostat (as "turbostat <command>") and compare the results with and without the commit in question reverted.
(In reply to Eero Tamminen from comment #8) > STIBP fixes in drm-tip v4.20-rc5 fix the CPU bound 3D cases performance > (test-case 1). > > However, those fixes, nor disabling Spectre mitigation completely from > kernel command line (checked by David), do NOT have any impact on the Media > performance regression (test-case 2). After above, regressed Media tests-case "perf" started to fluctuate between the original and regressed performance. In start of January "perf" stopped fluctuating: * KBL GT3e - perf keeps at what it was before regression * SKL GT4e - perf keeps in regressed state At start of February, KBL GT3e "perf" dropped to lower than it was before, while SKL GT4e perf remained at January level. HOWEVER, that was because of: (In reply to Eero Tamminen from comment #12) > Note: kernel had been scheduling parallel media processes very badly/unevenly > (I think most visible on GT4e SkullCanyon). This appears to have been fixed > somewhere in early February (and AFAIK didn't happen earlier in 2018). If > one just sums FPS reported by the individual processes together (like is > suggested in first comment use-case), not fully parallel scheduling could > distort the results significantly. I.e. this issue could actually have been result of kernel messing up media thread scheduling, and that being hidden by the STIBP mess, but I don't have data when the media thread scheduling actually broke, so maybe we'll just forget about this bug? => WONTFIX/WORKSFORME? (I've of course fixed perf measuring to be less impacted by broken scheduling, but I'm not going to track down when it changed, as I can't easily automate its bisecting, and it works OK now. If somebody else is interested to do that, I can help with the test-cases though.)
> I.e. this issue could actually have been result of kernel messing up media > thread scheduling, and that being hidden by the STIBP mess, but I don't have > data when the media thread scheduling actually broke, so maybe we'll just > forget about this bug? => WONTFIX/WORKSFORME? Sounds reasonable, closing as WONTFIX.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.