Summary: | [SKL] Up to 20% performance regression in GpuTest Triangle, due to 2x higher CPU power usage | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Eero Tamminen <eero.t.tamminen> | ||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | normal | ||||||
Priority: | medium | CC: | david.weinehall, intel-gfx-bugs, srinivas.pandruvada | ||||
Version: | DRI git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
i915 platform: | SKL | i915 features: | |||||
Attachments: |
|
Description
Eero Tamminen
2018-07-27 14:28:30 UTC
That window pulled in 4.18-rc1. It is likely to be the changes around setting cpu frequencies around iowaits. Of particular interest, I guess, commit d09fcecb0c797b884ce65daa37c121a2786bb17b Merge: f5b7769eb040 6a900f884e3e Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed Jun 13 07:24:18 2018 -0700 Merge tag 'pm-4.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These revert a recent PM core change that introduced a regression, fix the build when the recently added Kryo cpufreq driver is selected, add support for devices attached to multiple power domains to the generic power domains (genpd) framework, add support for iowait boosting on systens with hardware-managed P-states (HWP) enabled to the intel_pstate driver, modify the behavior of the wakeup_count device attribute in sysfs, fix a few issues and clean up some ugliness, mostly in cpufreq (core and drivers) and in the cpupower utility. Specifics: - Revert a recent PM core change that attempted to fix an issue related to device links, but introduced a regression (Rafael Wysocki) - Fix build when the recently added cpufreq driver for Kryo processors is selected by making it possible to build that driver as a module (Arnd Bergmann) - Fix the long idle detection mechanism in the out-of-band (ondemand and conservative) cpufreq governors (Chen Yu) - Add support for devices in multiple power domains to the generic power domains (genpd) framework (Ulf Hansson) - Add support for iowait boosting on systems with hardware-managed P-states (HWP) enabled to the intel_pstate driver and make it use that feature on systems with Skylake Xeon processors as it is reported to improve performance significantly on those systems (Srinivas Pandruvada) - Fix and update the acpi_cpufreq, ti-cpufreq and imx6q cpufreq drivers (Colin Ian King, Suman Anna, Sébastien Szymanski) - Change the behavior of the wakeup_count device attribute in sysfs to expose the number of events when the device might have aborted system suspend in progress (Ravi Chandra Sadineni) - Fix two minor issues in the cpupower utility (Abhishek Goel, Colin Ian King)" * tag 'pm-4.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "PM / runtime: Fixup reference counting of device link suppliers at probe" cpufreq: imx6q: check speed grades for i.MX6ULL cpufreq: governors: Fix long idle detection logic in load calculation cpufreq: intel_pstate: enable boost for Skylake Xeon PM / wakeup: Export wakeup_count instead of event_count via sysfs PM / Domains: Add dev_pm_domain_attach_by_id() to manage multi PM domains PM / Domains: Add support for multi PM domains per device to genpd PM / Domains: Split genpd_dev_pm_attach() PM / Domains: Don't attach devices in genpd with multi PM domains PM / Domains: dt: Allow power-domain property to be a list of specifiers cpufreq: intel_pstate: New sysfs entry to control HWP boost cpufreq: intel_pstate: HWP boost performance on IO wakeup cpufreq: intel_pstate: Add HWP boost utility and sched util hooks cpufreq: ti-cpufreq: Use devres managed API in probe() cpufreq: ti-cpufreq: Fix an incorrect error return value cpufreq: ACPI: make function acpi_cpufreq_fast_switch() static cpufreq: kryo: allow building as a loadable module cpupower : Fix header name to read idle state name cpupower: fix spelling mistake: "logilename" -> "logfilename" All of these (Ubuntu 16.04) machines use P-state with "powersave" governor, so the P-state ioboost indeed sounds a good candidate. Issue happening only on specific versions of SKL might help narrowing the change down better, but they aren't Xeon processors mentioned in above commit description. +static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = { + ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs), + ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs), + {} +}; That's all but mobile parts. Try echo 0 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost Created attachment 140870 [details] [review] Try these steps 1. First check if you have /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 1 Unfortunately some Xeon E3 reused DESKTOP CPU ID. 2. If yes, as Chris suggested, try #echo 0 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost If this fixes, try the attached patch. After this patch should be /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 0 if your platform is not a E3. If you still see attach output of #acpidump > acpi.out Also please attach #cat /proc/cpuinfo Also do test on clear Linux, which always runs in performance mode? (In reply to Srinivas Pandruvada from comment #6) > First check if you have > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 1 > Unfortunately some Xeon E3 reused DESKTOP CPU ID. It's "1" on the regressing SKL machines, and "0" on the "U" one that didn't regress. > If yes, as Chris suggested, try > #echo 0 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost > > If this fixes At least the largest GpuTest Triangle regression disappears. > try the attached patch. I'm running the full set of benchmarks with this, so it takes a while. I'll report results tomorrow. > After this patch should be > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 0 if your platform > is not a E3. With the patch it's "0". (In reply to Srinivas Pandruvada from comment #7) > Also do test on clear Linux, which always runs in performance mode? Why? I assume this is not just to test performance governor [1]... 1. We're tracking upstream Git versions, not older distro versions of SW, so we build our own git versions of (drm-tip) kernel and rest of the 3D stack, and use those also on Clear Linux (currently we use Clear only on BXT). 2. You need to define what you mean by "Clear Linux". It updates frequently and at least I haven't yet found a way to determine what its packages have actually been built from/with. Its package management doesn't seem to provide the basic information of: - which exact sources - what patches - compile options Are used to build a package. [1] We switched to powersave governor a while because it gives slightly better performance in (3D) benchmarks than Clear Linux default "performance" governor, and didn't regress in any of our (3D) benchmarks. Besides, with Francisco's P-state improvements applied on top of it, (3D / IGP) perf improves very noticeably (only) with powersave. I tested Unigine-Heaven on SKL 6600 Gaming system, with liquid cooling. I didn't see any impact with the change. I will try GPUTest and check today. Heaven wasn't impacted in my tests either. Only simpler tests were. (In reply to Eero Tamminen from comment #0) > There were following 3D benchmark performance regressions on SKL GT4e: > - 20% GpuTest Triangle (windowed & composited, half screen size) ./GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000 (Composited with Unity desktop compositor as it's windowed.) > - 10% GpuTest Triangle (FullHD fullscreen i.e. slower) ./GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000 Test does just a clear and draws single triangle covering half a window, so this test is as trivial as glxgears. GpuTest is the benchmark from here: https://www.geeks3d.com/gputest/ (Score it reports is just the number of frames rendered.) > - 1-6% SynMark Batch[2-4] Batch0-7 are a series of tests where about 2M of triangles of are drawn. In Batch0 they're with a single draw call, in Batch1 they're split to two calls, in Batch2 to 4 calls etc. At or after Batch5 i.e. 2M of triangles being split to 32 draw calls, tests start to become CPU bound. > - 1-4% GfxBench Tessellation (onscreen only), GpuTest Julia32 (windowed), SynMark GSCloth & ShMapPcf (fullscreen) GSCloth does 72 draw calls per frame, Tessalation test 50, ShMapPcf 7, and Julia32 does 2 of them. Julia32 is fastest of these and GSCloth slowest (GSCloth may be slightly CPU bound). (In reply to Eero Tamminen from comment #8) > (In reply to Srinivas Pandruvada from comment #6) > > If this fixes, try the attached patch. > > I'm running the full set of benchmarks with this, so it takes a while. > I'll report results tomorrow. Yes, the attached patch fixed all the regressions. (In reply to Eero Tamminen from comment #0) > There were no improvements in any 3D benchmarks from this, but I noticed > small increase in SIMD CPU copy and large increase in SIMD CPU read > performance. However, that was only on SKL GT2, not GT4e, so it may be > unrelated. The fix patch regressed Unigine Valley slightly. On closer look Valley had actually increased with original change (by 1.0-1.5%), but only on SKL GT2. I.e. there was actually one platform with one 3D benchmark where the initial change improved perf slightly (without any noticeable increase in CPU power usage). -> I assume the SKL GT2 CPU read performance (25%) improvement was also due to original change, but the improvement is very much a corner case [1]. [1] SKL-i5 6600K has 4 real cores. CPU read (performance with 64MB block improved with the original change *only* in following cases: - SSE2 with 4 threads - SSE4.1 with 6 threads - AVX1/2 with 6 threads It didn't improve when the number of running SIMD threads was smaller or larger that those. commit 01e61a42a5d345a4c0205889498f0c9a0fb9ee8c Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Date: Mon Jul 30 15:00:29 2018 -0700 cpufreq: intel_pstate: Limit the scope of HWP dynamic boost platforms Dynamic boosting of HWP performance on IO wake showed significant improvement to IO workloads. This series was intended for Skylake Xeon platforms only and feature was enabled by default based on CPU model number. But some Xeon platforms reused the Skylake desktop CPU model number. This caused some undesirable side effects to some graphics workloads. Since they are heavily IO bound, the increase in CPU performance decreased the power available for GPU to do its computing and hence decrease in graphics benchmark performance. For example on a Skylake desktop, GpuTest benchmark showed average FPS reduction from 529 to 506. This change makes sure that HWP boost feature is only enabled for Skylake server platforms by using ACPI FADT preferred PM Profile. If some desktop users wants to get benefit of boost, they can still enable boost from intel_pstate sysfs attribute "hwp_dynamic_boost". Fixes: 41ab43c9c89e (cpufreq: intel_pstate: enable boost for Skylake Xeon) Link: https://bugs.freedesktop.org/show_bug.cgi?id=107410 Reported-by: Eero Tamminen <eero.t.tamminen@intel.com> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Reviewed-by: Francisco Jerez <currojerez@riseup.net> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Verified, the large regressions are gone. (And the small Unigine Valley perf improvement also disappeared as expected.) Closed as it is verified. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.