Bug 108700 - 2x Media CPU power usage, or 15% perf drop when CPU bound GPU use-case is TDP limited
Summary: 2x Media CPU power usage, or 15% perf drop when CPU bound GPU use-case is TDP...
Status: VERIFIED WONTFIX
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-09 14:06 UTC by Eero Tamminen
Modified: 2019-05-02 10:22 UTC (History)
3 users (show)

See Also:
i915 platform: CFL, KBL, SKL
i915 features:


Attachments
parameter file for sample_media_transcode (7.47 KB, text/plain)
2018-11-09 14:06 UTC, Eero Tamminen
no flags Details

Description Eero Tamminen 2018-11-09 14:06:55 UTC
Created attachment 142419 [details]
parameter file for sample_media_transcode

Test setup:
* Ubuntu 18.04
* git head build of drm-tip kernel
* git head build of Mesa & X and their main deps
* git head build of Intel MediaSDK and their main dependencies

Good drm-tip version:
a4e9f377a9: 2018-11-03 01:29:29: drm-tip: 2018y-11m-03d-01h-28m-29s UTC integration manifest

Bad drm-tip version:
1a4a6dafa1: 2018-11-05 16:07:52: drm-tip: 2018y-11m-05d-16h-07m-05s UTC integration manifest

Test-case 1:
* Run (mostly) CPU bound GfxBench v4 Driver2 test

Test-case 2:
* Run MediaSDK provided tool with the attached parameter file (does 50 streams which lower H264 video frame & bit rates, size and adds filtering):
  sample_multi_transcode -par inputs.par
* Sum FPS of all streams together

Outcome on HW that is TDP limited:
* Test-case 1 performance drops 15%
* Test-case 2 performance drops 5%
* Performance of other CPU bound GPU tests regress also, but less

Outcome on HW that isn't TDP limited:
* RAPL reports marginally larger CPU power consumption for test-case 1
* RAPL reports 1.5-2.5x higher CPU power consumption for test-case 2

There were no performance improvements in other tests we run on these devices.

Large CPU power usage increase without perf change is visible on:
* SKL i5-6600K (GT2)
* KBL i7-7500U (GT2)
* KBL i7-8809G (GT2)
(And on pre-production CFL-S device we had)

TDP limit caused performance to drop (with increased CPU usage) on:
* KBL i7-7567U (GT3e)
* SKL i7-6770HQ (GT4e)

There was one device where performance increases with the much higher CPU power usage, but it's only by 1-2% and only in test-case 2:
* SKL i5-6260U (GT3e)

Neither perf nor power usage changed on BXT devices, so I guess this change concerns only Core devices.

On BDW GT2 the CPU usage increase was clearly smaller than on GEN9 Core devices (and there was no noticeable performance change).  MediaSDK doesn't support older devices, so I don't have data from them.
Comment 1 Eero Tamminen 2018-11-09 14:10:52 UTC
Drm-tip seems to have rebased from v4.19 to v4.20-rc1 during that 1 day interval.
Comment 2 Jani Saarinen 2018-11-09 14:14:07 UTC
Yeah, can you bisect Eero ;)
Comment 3 Eero Tamminen 2018-11-09 14:27:35 UTC
(In reply to Jani Saarinen from comment #2)
> Yeah, can you bisect Eero ;)

I don't have anything set up that would automate bisecting kernel well enough (reboots, boot failures, handling drm-tip rebases etc).

However, if you have in mind few commits in that range, I could manually check whether they give good or bad performance.

And I can of course (internally) provide ready-made SW setup and reserve suitable HW for whomever is going to look into this.
Comment 4 Chris Wilson 2018-11-13 16:26:35 UTC
I haven't yet tried the exact tests as cited here, all I've found so far is a remarkable improvement from 08e3e21a24d23db6a4adca90f7cb40d69e09d35c ("drm/i915: kill resource streamer support") in the -rc1 merge.

The report would suggest we were looking for a pstate or scheduler change.
Comment 5 Eero Tamminen 2018-11-14 11:57:09 UTC
(In reply to Chris Wilson from comment #4)
> I haven't yet tried the exact tests as cited here, all I've found so far is
> a remarkable improvement from 08e3e21a24d23db6a4adca90f7cb40d69e09d35c
> ("drm/i915: kill resource streamer support") in the -rc1 merge.

In CPU bound Driver2 GL tests?  On which device?
Comment 6 Chris Wilson 2018-11-19 14:20:26 UTC
(In reply to Eero Tamminen from comment #5)
> (In reply to Chris Wilson from comment #4)
> > I haven't yet tried the exact tests as cited here, all I've found so far is
> > a remarkable improvement from 08e3e21a24d23db6a4adca90f7cb40d69e09d35c
> > ("drm/i915: kill resource streamer support") in the -rc1 merge.
> 
> In CPU bound Driver2 GL tests?  On which device?

kbl + glxgears; basic context switch exercise.

In light of the rc1 controversy, do you have spectre/meltdown migrations enabled on your test systems?
Comment 7 Eero Tamminen 2018-11-19 17:36:06 UTC
We don't specifically enable any mitigations, just use drm-tip kernel defaults.
It seems to have enabled an additional one when it was rebased to 4.20-rc1:
  Spectre V2 : Spectre v2 cross-process SMT mitigation: Enabling STIBP


Threading in the listed test-cases:

* MediaSDK 50 stream transcode case has 250 threads

* I thought GfxBench Driver2 doesn't thread, as only single CPU is busy, but it actually uses 3 threads of which 2 use as much CPU as they can, and apparently kernel just sticks them to same core, so they seem hyperthreaded

-> I think that SMT mitigation is very likely cause for the drop instead of i915.


Could you point out suitable drm-tip commit IDs before and after enabling the mitigation so that I could verify it?
Comment 8 Eero Tamminen 2018-12-05 16:28:04 UTC
STIBP fixes in drm-tip v4.20-rc5 fix the CPU bound 3D cases performance (test-case 1).

However, those fixes, nor disabling Spectre mitigation completely from kernel command line (checked by David), do NOT have any impact on the Media performance regression (test-case 2).

David will try to bisect the Media perf regression.
Comment 9 David Weinehall 2018-12-11 15:42:00 UTC
Reverting 01bad1c6896db021db82042e71c2bf1f97cc026b seems to resolve at least part of the performance regression; the CPU power usage seems to have been a separate issue.
Comment 10 Eero Tamminen 2018-12-13 14:02:51 UTC
(In reply to David Weinehall from comment #9)
> Reverting 01bad1c6896db021db82042e71c2bf1f97cc026b seems to resolve at least
> part of the performance regression

https://cgit.freedesktop.org/drm/drm-tip/commit/?id=01bad1c6896db021db82042e71c2bf1f97cc026b

----------------------------------------------------------------
Author:Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Committer: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpuidle: poll_state: Revise loop termination condition

If need_resched() returns "false", breaking out of the loop in
poll_idle() will cause a new idle state to be selected, so in fact
it usually doesn't make sense to spin in it longer than the target
residency of the second state.  [Note that the "polling" state is
used only if there is at least one "real" state defined in addition
to it, so the second state is always there.]  On the other hand,
breaking out of it early (say in case the next state is disabled)
shouldn't hurt as it is polling anyway.

For this reason, make the loop in poll_idle() break if the CPU has
been spinning longer than the target residency of the second state
(the "polling" state can only be state[0]).
----------------------------------------------------------------


> the CPU power usage seems to have been a separate issue.

Before the issue, CPU core(s)s power usage was less than GPU power usage, and afterwards it was >2x GPU power usage, in a (mostly) *GPU limited* Media test-case.

STIBP fix didn't improve >2x CPU power usage increase in Media test-case 2) at all, so I think at this point it's more interesting than the small perf drop in it.  It's like to explain rest of the perf drop too.
Comment 11 Lakshmi 2019-02-22 08:08:20 UTC
Sorry for the delay...
Rafael, any comments from you?
Comment 12 Eero Tamminen 2019-03-13 17:09:28 UTC
Note: kernel had been scheduling parallel media processes very unevenly (I think most visible on GT4e SkullCanyon).  This appears to have been fixed somewhere in early February (and AFAIK didn't happen earlier in 2018). If one just sums FPS reported by the individual processes together (like is suggested in first comment use-case), not fully parallel scheduling could distort the results significantly.
Comment 13 Lakshmi 2019-03-27 10:19:10 UTC
Comments from Rafael J. Wysocki is added here

On 3/7/2019 2:00 PM, Vudum, Lakshminarayana wrote:
> Hi, Can you comment on this bug? Is your commit causing performance issues?

It shouldn't.

That said if reverting it causes the perf numbers to go up, then there is some impact, but it can only be very indirect.

This commit only affects idle CPUs and what it does is to cause them to spend less time doing idle polling in one go. Effectively, this may allow them to enter real idle (non-polling) states more often and spend more time in non-polling idle states overall.  If they are woken up from those idle states (instead of just interrupting idle polling), you may see some increased latency in very specific workloads.  Maybe the workload in question is one of these.

However, if it affects perf adversely, it will also improve energy-efficiency.

To verify, it might be instructive to run the workload under turbostat (as "turbostat <command>") and compare the results with and without the commit in question reverted.
Comment 14 Eero Tamminen 2019-03-27 13:55:15 UTC
(In reply to Eero Tamminen from comment #8)
> STIBP fixes in drm-tip v4.20-rc5 fix the CPU bound 3D cases performance
> (test-case 1).
> 
> However, those fixes, nor disabling Spectre mitigation completely from
> kernel command line (checked by David), do NOT have any impact on the Media
> performance regression (test-case 2).

After above, regressed Media tests-case "perf" started to fluctuate between the original and regressed performance.

In start of January "perf" stopped fluctuating:
* KBL GT3e - perf keeps at what it was before regression
* SKL GT4e - perf keeps in regressed state

At start of February, KBL GT3e "perf" dropped to lower than it was before, while SKL GT4e perf remained at January level.  HOWEVER, that was because of:

(In reply to Eero Tamminen from comment #12)
> Note: kernel had been scheduling parallel media processes very badly/unevenly
> (I think most visible on GT4e SkullCanyon).  This appears to have been fixed
> somewhere in early February (and AFAIK didn't happen earlier in 2018). If
> one just sums FPS reported by the individual processes together (like is
> suggested in first comment use-case), not fully parallel scheduling could
> distort the results significantly.

I.e. this issue could actually have been result of kernel messing up media thread scheduling, and that being hidden by the STIBP mess, but I don't have data when the media thread scheduling actually broke, so maybe we'll just forget about this bug?  => WONTFIX/WORKSFORME?

(I've of course fixed perf measuring to be less impacted by broken scheduling, but I'm not going to track down when it changed, as I can't easily automate its bisecting, and it works OK now. If somebody else is interested to do that, I can help with the test-cases though.)
Comment 15 Francesco Balestrieri 2019-04-29 07:03:15 UTC
> I.e. this issue could actually have been result of kernel messing up media 
> thread scheduling, and that being hidden by the STIBP mess, but I don't have
> data when the media thread scheduling actually broke, so maybe we'll just 
> forget about this bug?  => WONTFIX/WORKSFORME?

Sounds reasonable, closing as WONTFIX.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.