Bug 100572 - [SKL dmc] Headless mode media transcoding is 20-30% slower comparing to connected monitor use case
Summary: [SKL dmc] Headless mode media transcoding is 20-30% slower comparing to conne...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: medium major
Assignee: Imre Deak
QA Contact: Intel GFX Bugs mailing list
URL: https://patchwork.freedesktop.org/pat...
Whiteboard: ReadyForDev
Keywords:
: 102563 102589 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-04-05 00:06 UTC by Dmitry Rogozhkin
Modified: 2018-04-20 11:11 UTC (History)
6 users (show)

See Also:
i915 platform: BXT, KBL, SKL
i915 features: display/HDMI, firmware/dmc


Attachments
dmesg.0.after_boot (153.32 KB, text/plain)
2017-04-05 00:25 UTC, Dmitry Rogozhkin
no flags Details
dmesg.1.headless (161.93 KB, text/plain)
2017-04-05 00:26 UTC, Dmitry Rogozhkin
no flags Details
dmesg.2.hdmi-a-ON (193.02 KB, text/plain)
2017-04-05 00:26 UTC, Dmitry Rogozhkin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dmitry Rogozhkin 2017-04-05 00:06:26 UTC
Running single media transcoding workload with yamitranscode in headless mode and with connected monitor gives the following results (4 consequent runs):

# <elapsed time>;<user-space time>;<kernel-space time>;<CPU%>
# connected monitor:
18.41;3.54;3.29;37%
17.09;3.53;3.08;38%
18.29;3.55;3.06;36%
17.06;3.55;3.04;38%
# headless:
25.86;3.55;3.38;26%
22.40;3.53;3.35;30%
23.51;3.56;3.28;29%
22.33;3.55;3.31;30%

So, headless mode is ~25% slower. Mind that issue disappears in headless mode if the following will be done:
echo on | sudo tee /sys/class/drm/card0-HDMI-A-1/status

Command line used to reproduce:
yamitranscode \
  -i 1920x1080p_29.97_10mb_h264_cabac.264 \
  -o out.264 \
  -W 1920 -H 1080 \
  -c AVC \
  -ipperiod 1 \
  -intraperiod 30 \
  --rcmode CBR \
  -b 5000

System configuration used:
- SKL NUC6i7KYK (Scull Canyon)
- CPU Turbo OFF in BIOS
- CentOS 7.3 x64, minimal
- cmdline options: intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1
- Pinned frequency settings:
cpupower frequency-set --governor performance
cpupower frequency-set --min 2000000
cpupower frequency-set --max 2000000
echo 700 > /sys/class/drm/card0/gt_min_freq_mhz
echo 700 > /sys/class/drm/card0/gt_max_freq_mhz
echo 700 > /sys/class/drm/card0/gt_boost_freq_mhz

- Linux kernel built from drm-tip taken at March 10 2017:
commit 2095bbc9d234d71fa44fd9181597431e2653058c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 10 15:03:46 2017 +0000

    drm-tip: 2017y-03m-10d-15h-03m-17s UTC integration manifest
Comment 1 Dmitry Rogozhkin 2017-04-05 00:23:50 UTC
I am attaching 3 dmesg shots corresponding to the following command sequence:

# dmesg > dmesg.0.after_boot
#
# ./run_libyami.sh
3596 frame decoded, fps = 143.44. fps after 5 frames = 143.55.
transcode done
# ./run_libyami.sh
3596 frame decoded, fps = 160.02. fps after 5 frames = 160.18.
transcode done
#
# dmesg > dmesg.1.headless
#
# cat /sys/class/drm/card0-HDMI-A-1/status
disconnected
#
# echo on > /sys/class/drm/card0-HDMI-A-1/status
# cat /sys/class/drm/card0-HDMI-A-1/status
connected
#
# ./run_libyami.sh
3596 frame decoded, fps = 197.47. fps after 5 frames = 197.67.
transcode done
# ./run_wrap.libyami.sh
3596 frame decoded, fps = 209.68. fps after 5 frames = 209.91.
transcode done
#
# dmesg > dmesg.1.hdmi-a-ON
Comment 2 Dmitry Rogozhkin 2017-04-05 00:25:40 UTC
Created attachment 130680 [details]
dmesg.0.after_boot
Comment 3 Dmitry Rogozhkin 2017-04-05 00:26:22 UTC
Created attachment 130681 [details]
dmesg.1.headless
Comment 4 Dmitry Rogozhkin 2017-04-05 00:26:44 UTC
Created attachment 130682 [details]
dmesg.2.hdmi-a-ON
Comment 5 Dmitry Rogozhkin 2017-04-05 00:27:58 UTC
Theory for the bug is that it is related to DMC FW. Need to check whether bug will disappear if DMC will not be loaded. I did not try that myself, sorry.
Comment 6 Tvrtko Ursulin 2017-04-05 06:57:26 UTC
I tried not loading the DMC firmware and can confirm that the issue is not present in that case.

Also, it is possible to reproduce this in the default kernel config (no pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which case the perf difference between the two setups was ~8x in my testing.
Comment 7 Imre Deak 2017-04-07 09:09:04 UTC
(In reply to Tvrtko Ursulin from comment #6)
> I tried not loading the DMC firmware and can confirm that the issue is not
> present in that case.
> 
> Also, it is possible to reproduce this in the default kernel config (no
> pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which
> case the perf difference between the two setups was ~8x in my testing.

One possibility is that DC6 enables deeper system level power states and this causes latency elsewhere. What are the PC state residencies shown by powertop or the kernel's tools/power/x86/turbostat when DMC is loaded and not?

What's the effect of limiting max_cstates to 0 (and having DMC loaded)?

An other problem could be that the GPU is trying to access the display, (maybe checking scan line counts or something?).

Does /sys/kernel/debug/dri/0/i915_dmc_info show any transitions during the test when DMC is loaded?
Comment 8 Imre Deak 2017-04-07 10:41:24 UTC
(In reply to Imre Deak from comment #7)
> (In reply to Tvrtko Ursulin from comment #6)
> > I tried not loading the DMC firmware and can confirm that the issue is not
> > present in that case.
> > 
> > Also, it is possible to reproduce this in the default kernel config (no
> > pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which
> > case the perf difference between the two setups was ~8x in my testing.
> 
> One possibility is that DC6 enables deeper system level power states and
> this causes latency elsewhere. What are the PC state residencies shown by
> powertop or the kernel's tools/power/x86/turbostat when DMC is loaded and
> not?
> 
> What's the effect of limiting max_cstates to 0 (and having DMC loaded)?
> 
> An other problem could be that the GPU is trying to access the display,
> (maybe checking scan line counts or something?).
> 
> Does /sys/kernel/debug/dri/0/i915_dmc_info show any transitions during the
> test when DMC is loaded?

Also the actual RC6 residency reported by powertop/turbostat would be interesting. (even though it's disabled)
Comment 9 Tvrtko Ursulin 2017-04-07 10:59:43 UTC
(In reply to Imre Deak from comment #7)
> (In reply to Tvrtko Ursulin from comment #6)
> > I tried not loading the DMC firmware and can confirm that the issue is not
> > present in that case.
> > 
> > Also, it is possible to reproduce this in the default kernel config (no
> > pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which
> > case the perf difference between the two setups was ~8x in my testing.
> 
> One possibility is that DC6 enables deeper system level power states and
> this causes latency elsewhere. What are the PC state residencies shown by
> powertop or the kernel's tools/power/x86/turbostat when DMC is loaded and
> not?

1. With DMC, idle system, no displays:

PKG is in PC2, CPU is in C7, GPU is in RC6.

When looking in i915_dmc_info I can see that the "DC3 - > DC5" transition counter increases exactly by one each second. "DC5 -> DC6 counter is zero".

If I now run gem_latency -n 0:

"DC3 -> DC5" counter starts increasing by ~2k per second.

PKG is not any deeper states now.
CPU split between C2/C3/C6/C7 is approx. 42/2/10/40%.
GPU is 0% RC6.

Benchmark goes slow.

2. Now I force turn on a display (echo on | 
tee /sys/class/drm/card0-HDMI-A-1/status).

"DC3 -> DC5" transition counter stops increasing.

PKG is still in PC2, CPU in C7 and GPU in RC6.

Benchmark is not normal speed and while it is running PKG is not in any low power states, RC6 is 0% and CPU C2/C3/C6/C7 is approx 52/0/0/25%.

3. DMC not loaded, idle system, no displays

PKG is now in PC7 (not PC2 as above!), CPU is C7, GPU is RC6.

gem_latency is now normal speed with power states like above.

Out of curiosity I tried forcing the display on in this config. That makes the PKG go to ~3% PC2, rest in PC7. Turning it off again brings it back to <0.5% PC2 and the rest in PC7.
 
> What's the effect of limiting max_cstates to 0 (and having DMC loaded)?

No effect on benchmark speed or reported "DC3 -> DC5" transitions.

> An other problem could be that the GPU is trying to access the display,
> (maybe checking scan line counts or something?).

You mean something behind the covers or explicitly by i915?
 
> Does /sys/kernel/debug/dri/0/i915_dmc_info show any transitions during the
> test when DMC is loaded?

Yes, see above. :)
Comment 10 Tvrtko Ursulin 2017-04-07 11:01:35 UTC
(In reply to Tvrtko Ursulin from comment #9)
> (In reply to Imre Deak from comment #7)
> > (In reply to Tvrtko Ursulin from comment #6)
> > > I tried not loading the DMC firmware and can confirm that the issue is not
> > > present in that case.
> > > 
> > > Also, it is possible to reproduce this in the default kernel config (no
> > > pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which
> > > case the perf difference between the two setups was ~8x in my testing.
> > 
> > One possibility is that DC6 enables deeper system level power states and
> > this causes latency elsewhere. What are the PC state residencies shown by
> > powertop or the kernel's tools/power/x86/turbostat when DMC is loaded and
> > not?
> 
> 1. With DMC, idle system, no displays:
> 
> PKG is in PC2, CPU is in C7, GPU is in RC6.
> 
> When looking in i915_dmc_info I can see that the "DC3 - > DC5" transition
> counter increases exactly by one each second. "DC5 -> DC6 counter is zero".
> 
> If I now run gem_latency -n 0:
> 
> "DC3 -> DC5" counter starts increasing by ~2k per second.
> 
> PKG is not any deeper states now.

"not in" !

> CPU split between C2/C3/C6/C7 is approx. 42/2/10/40%.
> GPU is 0% RC6.
> 
> Benchmark goes slow.
> 
> 2. Now I force turn on a display (echo on | 
> tee /sys/class/drm/card0-HDMI-A-1/status).
> 
> "DC3 -> DC5" transition counter stops increasing.
> 
> PKG is still in PC2, CPU in C7 and GPU in RC6.
> 
> Benchmark is not normal speed and while it is running PKG is not in any low

s/not/now/ :( So it is normal speed now!

> power states, RC6 is 0% and CPU C2/C3/C6/C7 is approx 52/0/0/25%.
> 
> 3. DMC not loaded, idle system, no displays
> 
> PKG is now in PC7 (not PC2 as above!), CPU is C7, GPU is RC6.
> 
> gem_latency is now normal speed with power states like above.
> 
> Out of curiosity I tried forcing the display on in this config. That makes
> the PKG go to ~3% PC2, rest in PC7. Turning it off again brings it back to
> <0.5% PC2 and the rest in PC7.
>  
> > What's the effect of limiting max_cstates to 0 (and having DMC loaded)?
> 
> No effect on benchmark speed or reported "DC3 -> DC5" transitions.
> 
> > An other problem could be that the GPU is trying to access the display,
> > (maybe checking scan line counts or something?).
> 
> You mean something behind the covers or explicitly by i915?
>  
> > Does /sys/kernel/debug/dri/0/i915_dmc_info show any transitions during the
> > test when DMC is loaded?
> 
> Yes, see above. :)
Comment 11 Imre Deak 2017-04-07 12:30:03 UTC
(In reply to Tvrtko Ursulin from comment #9)
> (In reply to Imre Deak from comment #7)
> > (In reply to Tvrtko Ursulin from comment #6)
> > > I tried not loading the DMC firmware and can confirm that the issue is not
> > > present in that case.
> > > 
> > > Also, it is possible to reproduce this in the default kernel config (no
> > > pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which
> > > case the perf difference between the two setups was ~8x in my testing.
> > 
> > One possibility is that DC6 enables deeper system level power states and
> > this causes latency elsewhere. What are the PC state residencies shown by
> > powertop or the kernel's tools/power/x86/turbostat when DMC is loaded and
> > not?
> 
> 1. With DMC, idle system, no displays:
> 
> PKG is in PC2, 

PC2 vs. PC7 without DMC is weird, no idea for the reason. Normally you should reach PC8+ with display off, but for that you'd also need to enable power saving for other devices too.

> CPU is in C7, GPU is in RC6.

Was this also by booting with 'intel_idle.max_cstate=1 i915.enable_rc6=0'? Those should prevent C7 and RC6.. Dmitry saw the problem even with these settings, but would be good to double check on your side too, since RC6 would be the most logical root cause. Did you check the CPU cstate also when you ran with max_cstate=0?

> When looking in i915_dmc_info I can see that the "DC3 - > DC5" transition
> counter increases exactly by one each second. "DC5 -> DC6 counter is zero".

Err, forgot to say that reading that file itself increases the counter (if DC states are enabled, so display is off):/ So you should sample only at the beginning and end of the test and deduct the increment caused by the sampling.

> 
> If I now run gem_latency -n 0:
> 
> "DC3 -> DC5" counter starts increasing by ~2k per second.

Same here as above, in case you now sampled with higher freq.

> 
> PKG is not any deeper states now.
> CPU split between C2/C3/C6/C7 is approx. 42/2/10/40%.
> GPU is 0% RC6.
> 
> Benchmark goes slow.
> 
> 2. Now I force turn on a display (echo on | 
> tee /sys/class/drm/card0-HDMI-A-1/status).
> 
> "DC3 -> DC5" transition counter stops increasing.

Right, display-on keeps it in DC0.

> 
> PKG is still in PC2, CPU in C7 and GPU in RC6.
> 
> Benchmark is not normal speed and while it is running PKG is not in any low
> power states, RC6 is 0% and CPU C2/C3/C6/C7 is approx 52/0/0/25%.

Hm, so now we are constantly in DC0 and so DMC should be completely inactive (it only ever activates when either entering DC5 or DC6). Yet there is a slow-down, seemingly caused by it.

> 
> 3. DMC not loaded, idle system, no displays
> 
> PKG is now in PC7 (not PC2 as above!), CPU is C7, GPU is RC6.
> 
> gem_latency is now normal speed with power states like above.
> 
> Out of curiosity I tried forcing the display on in this config. That makes
> the PKG go to ~3% PC2, rest in PC7. Turning it off again brings it back to
> <0.5% PC2 and the rest in PC7.
>  
> > What's the effect of limiting max_cstates to 0 (and having DMC loaded)?
> 
> No effect on benchmark speed or reported "DC3 -> DC5" transitions.

As above, did you double check if the cstate limit is really in effect?

> 
> > An other problem could be that the GPU is trying to access the display,
> > (maybe checking scan line counts or something?).
> 
> You mean something behind the covers or explicitly by i915?

It was just a wild guess, not sure at all if it's possible. The kernel shouldn't do anything while the display is off, unless you have runtime PM enabled (if /sys/bus/pci/devices/0000\:00\:02.0/power/control contains 'auto') Ville said that X does the scan line readout when rendering to the front buffer, but that shouldn't be the case here. Yea, could be still something under the hood by the HW itself, DC transitions would be an indication for that.

>  
> > Does /sys/kernel/debug/dri/0/i915_dmc_info show any transitions during the
> > test when DMC is loaded?
> 
> Yes, see above. :)

So no good idea still. One other thing to try would be to limit the package state to PC2 in BIOS if there is an option for that and boot with DMC; would show if somehow the PC7 vs. PC2 difference itself would be the cause.
Comment 12 Chris Wilson 2017-04-07 12:35:49 UTC
(In reply to Imre Deak from comment #11)
>  Ville said that X does the scan line readout when rendering to the
> front buffer, but that shouldn't be the case here.

Only if root and gen <= 8 && !(vlv || chv), fwiw.
Comment 13 Imre Deak 2017-04-07 12:42:12 UTC
(In reply to Chris Wilson from comment #12)
> (In reply to Imre Deak from comment #11)
> >  Ville said that X does the scan line readout when rendering to the
> > front buffer, but that shouldn't be the case here.
> 
> Only if root and gen <= 8 && !(vlv || chv), fwiw.

Ah ok, so that's ruled out then.
Comment 14 Tvrtko Ursulin 2017-04-07 13:00:09 UTC
(In reply to Imre Deak from comment #11)
> (In reply to Tvrtko Ursulin from comment #9)
> > (In reply to Imre Deak from comment #7)
> > > (In reply to Tvrtko Ursulin from comment #6)
> > > > I tried not loading the DMC firmware and can confirm that the issue is not
> > > > present in that case.
> > > > 
> > > > Also, it is possible to reproduce this in the default kernel config (no
> > > > pinning is required) simply with igt/benchmarks/gem_latency -n 0 in which
> > > > case the perf difference between the two setups was ~8x in my testing.
> > > 
> > > One possibility is that DC6 enables deeper system level power states and
> > > this causes latency elsewhere. What are the PC state residencies shown by
> > > powertop or the kernel's tools/power/x86/turbostat when DMC is loaded and
> > > not?
> > 
> > 1. With DMC, idle system, no displays:
> > 
> > PKG is in PC2, 
> 
> PC2 vs. PC7 without DMC is weird, no idea for the reason. Normally you
> should reach PC8+ with display off, but for that you'd also need to enable
> power saving for other devices too.
> 
> > CPU is in C7, GPU is in RC6.
> 
> Was this also by booting with 'intel_idle.max_cstate=1 i915.enable_rc6=0'?
> Those should prevent C7 and RC6.. Dmitry saw the problem even with these
> settings, but would be good to double check on your side too, since RC6

I did not bother running with disabled rc6 since that does not seem to have any effect to all this.

> would be the most logical root cause. Did you check the CPU cstate also when
> you ran with max_cstate=0?

Yeah I did check, PKG and CPU were both in top states then.

> > When looking in i915_dmc_info I can see that the "DC3 - > DC5" transition
> > counter increases exactly by one each second. "DC5 -> DC6 counter is zero".
> 
> Err, forgot to say that reading that file itself increases the counter (if
> DC states are enabled, so display is off):/ So you should sample only at the
> beginning and end of the test and deduct the increment caused by the
> sampling.
> 
> > 
> > If I now run gem_latency -n 0:
> > 
> > "DC3 -> DC5" counter starts increasing by ~2k per second.
> 
> Same here as above, in case you now sampled with higher freq.

I was sampling once per second so the ~2k per second increase still sounds valid.
 
> > PKG is not any deeper states now.
> > CPU split between C2/C3/C6/C7 is approx. 42/2/10/40%.
> > GPU is 0% RC6.
> > 
> > Benchmark goes slow.
> > 
> > 2. Now I force turn on a display (echo on | 
> > tee /sys/class/drm/card0-HDMI-A-1/status).
> > 
> > "DC3 -> DC5" transition counter stops increasing.
> 
> Right, display-on keeps it in DC0.
> 
> > 
> > PKG is still in PC2, CPU in C7 and GPU in RC6.
> > 
> > Benchmark is not normal speed and while it is running PKG is not in any low
> > power states, RC6 is 0% and CPU C2/C3/C6/C7 is approx 52/0/0/25%.
> 
> Hm, so now we are constantly in DC0 and so DMC should be completely inactive
> (it only ever activates when either entering DC5 or DC6). Yet there is a
> slow-down, seemingly caused by it.
> 
> > 
> > 3. DMC not loaded, idle system, no displays
> > 
> > PKG is now in PC7 (not PC2 as above!), CPU is C7, GPU is RC6.
> > 
> > gem_latency is now normal speed with power states like above.
> > 
> > Out of curiosity I tried forcing the display on in this config. That makes
> > the PKG go to ~3% PC2, rest in PC7. Turning it off again brings it back to
> > <0.5% PC2 and the rest in PC7.
> >  
> > > What's the effect of limiting max_cstates to 0 (and having DMC loaded)?
> > 
> > No effect on benchmark speed or reported "DC3 -> DC5" transitions.
> 
> As above, did you double check if the cstate limit is really in effect?

Yep.
 
> > > An other problem could be that the GPU is trying to access the display,
> > > (maybe checking scan line counts or something?).
> > 
> > You mean something behind the covers or explicitly by i915?
> 
> It was just a wild guess, not sure at all if it's possible. The kernel
> shouldn't do anything while the display is off, unless you have runtime PM
> enabled (if /sys/bus/pci/devices/0000\:00\:02.0/power/control contains
> 'auto') Ville said that X does the scan line readout when rendering to the
> front buffer, but that shouldn't be the case here. Yea, could be still
> something under the hood by the HW itself, DC transitions would be an
> indication for that.

I got 'on' in /sys/bus/pci/devices/0000\:00\:02.0/power/control. And no X running or anything. Just fbcon but no displays connected. Should I try without fbcon perhaps?
 
> > > Does /sys/kernel/debug/dri/0/i915_dmc_info show any transitions during the
> > > test when DMC is loaded?
> > 
> > Yes, see above. :)
> 
> So no good idea still. One other thing to try would be to limit the package
> state to PC2 in BIOS if there is an option for that and boot with DMC; would
> show if somehow the PC7 vs. PC2 difference itself would be the cause.

Will try.
Comment 15 Tvrtko Ursulin 2017-04-07 13:35:03 UTC
More datapoints but no idea if they will provide any clues..

I tried compiling out fbcon and it does change things:

  1. DMC loaded, idle system, no display, fbcon: PKG is in PC2
  2. DMC loaded, idle system, no display, no fbcon: PKG is in PC7

Continuing with no fbcon and trying the HDMI force on - this now does not restore the performance to normal.

And the i915_dmc_info "DC3 -> DC5" counter is still increasing rapidly as in the case with DMC, fbcon and no displays.

Without DMC and no fbcon performance is good. (PKG PC7)
Comment 16 Tvrtko Ursulin 2017-04-07 13:49:13 UTC
i915.disable_power_wells=0 also fixes the performance.
Comment 17 Tvrtko Ursulin 2017-04-13 12:13:34 UTC
Workaround:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 33fb11cc5acc..b5c262f629f7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3203,6 +3203,10 @@ i915_gem_idle_work_handler(struct work_struct *work)
 
        if (INTEL_GEN(dev_priv) >= 6)
                gen6_rps_idle(dev_priv);
+
+       if (IS_SKYLAKE(dev_priv) && dev_priv->csr.dmc_payload)
+               intel_display_power_put(dev_priv, POWER_DOMAIN_MODESET);
+
        intel_runtime_pm_put(dev_priv);
 out_unlock:
        mutex_unlock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 313cdff7c6dd..d5ea4ce47306 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -838,6 +838,9 @@ static void i915_gem_mark_busy(const struct intel_engine_cs *engine)
        if (INTEL_GEN(dev_priv) >= 6)
                gen6_rps_busy(dev_priv);
 
+       if (IS_SKYLAKE(dev_priv) && dev_priv->csr.dmc_payload)
+               intel_display_power_get(dev_priv, POWER_DOMAIN_MODESET);
+
        queue_delayed_work(dev_priv->wq,
                           &dev_priv->gt.retire_work,
                           round_jiffies_up_relative(HZ));
Comment 18 Tvrtko Ursulin 2017-04-13 14:44:48 UTC
And a first draft of an IGT: https://patchwork.freedesktop.org/patch/150314/
Comment 19 Dmitry Rogozhkin 2017-04-13 20:41:49 UTC
I looks to me that the workaround:
  echo on | tee /sys/class/drm/card0-HDMI-A-1/status
works only for HDMI-A-1. I have tried to fake DP-1 and performance did not go up. And in automated validation on our side someone faked HDMI-A-2 and performance also did not go up. Maybe this will give some clue...
Comment 20 Sunil Kamath 2017-04-18 08:26:32 UTC
Some details and further tests to be tried are:

Though by avoiding DMC fw load, issue is going away issue is not lilted just to DMC fw. (proved by other tests results)
Real way to address this is to gracefully handle power infra change for a headless system.

“i915.disable_power_wells=0” fixes performance issue on DP too?
Comment 21 Tvrtko Ursulin 2017-04-18 08:45:01 UTC
(In reply to Sunil Kamath from comment #20)
> Some details and further tests to be tried are:
> 
> Though by avoiding DMC fw load, issue is going away issue is not lilted just
> to DMC fw. (proved by other tests results)

Which results are you referring to? So far I haven't been able to reproduce the performance regression without the DMC loaded. But yeah, it is possible it is an driver - firmware interaction of some sorts.

> Real way to address this is to gracefully handle power infra change for a
> headless system.
> 
> “i915.disable_power_wells=0” fixes performance issue on DP too?

It fixes the issue in headless mode and headless mode cannot have DP connected so not sure what you mean by this?

What Dmitry observed in #19 is that forcing the DP connector on hasn't got the same workaround effect as forcing the HDMI on does. As far as I could see, there is some difference in the display code paths, where forcing the DP on does not actually turn on the power well(s). Contrary to forcing the HDMI connector to on which does this. There is also a difference in system behaviour depending on presence or absence of fbcon, but this is just a secondary effect of modeset happening or not when the connector is forced.
Comment 22 Sunil Kamath 2017-04-18 09:52:21 UTC
>> Some details and further tests to be tried are:
>> 
>> Though by avoiding DMC fw load, issue is going away issue is not lilted just
>> to DMC fw. (proved by other tests results)
>
>Which results are you referring to? So far I haven't been able to reproduce the performance regression without the DMC loaded. But yeah, it is possible it >is an driver - firmware interaction of some sorts.
I am referring to results with “i915.disable_power_wells=0”.
>
>> Real way to address this is to gracefully handle power infra change for a
>> headless system.
>> 
>> “i915.disable_power_wells=0” fixes performance issue on DP too?
>
>It fixes the issue in headless mode and headless mode cannot have DP connected so not sure what you mean by this?

Here im referring to comment #19. If the same scenario is tested with i915.disable_power_wells=0. But I got further clarity in below comments from you.
>
>What Dmitry observed in #19 is that forcing the DP connector on hasn't got the same workaround effect as forcing the HDMI on does. As far as I could see, >there is some difference in the display code paths, where forcing the DP on does not actually turn on the power well(s). Contrary to forcing the HDMI >connector to on which does this. There is also a difference in system behaviour depending on presence or absence of fbcon, but this is just a secondary >effect of modeset happening or not when the connector is forced.

This is the area where I was seeking efforts from Imre – to clarify further on power infrastructure handling for headless-system.
Comment 23 Dmitry Rogozhkin 2017-04-18 21:42:46 UTC
WA from comment 17 works for me performance-wise: performance restored to the expected level.
Comment 24 Sunil Kamath 2017-05-01 11:17:14 UTC
Summarizing various experiments:
- Animesh/Tvrtko tried various experiments to route cause the problem.
- As discussed/mentioned before major issue seems like:
1. Handling power infrastructure for headless system.
2. How to make headless system really headless in real way.

For 2nd test, below option was tried:
i915.disable_display=1 and with this issue goes away.

In addition to the above various experiments were done.
Imre to confirm back if already done experiments are sufficient to get answer to both 1 and 2.
Comment 25 Tvrtko Ursulin 2017-05-02 13:18:45 UTC
One of the experiments Animesh asked me to do was to look at the state of the DC_STATE_EN register at runtime.

It looked that it had reverted to a value other than what was programmed by i915. On top of that, it looked impossible to manually modify it at runtime.

The value it is reverting to was DC_STATE_EN_UPTO_DC6, regardless of whether the driver has programmed DC_STATE_EN_UPTO_DC5 or DC_STATE_EN_UPTO_DC5 | DC_STATE_EN_UPTO_DC6.

This coupled with a comment in gen9_write_dc_state makes me suspicious whether DMC is not doing things behind the drivers back?
Comment 26 Imre Deak 2017-05-02 16:00:05 UTC
(In reply to Tvrtko Ursulin from comment #25)
> One of the experiments Animesh asked me to do was to look at the state of
> the DC_STATE_EN register at runtime.
> 
> It looked that it had reverted to a value other than what was programmed by
> i915. On top of that, it looked impossible to manually modify it at runtime.
> 
> The value it is reverting to was DC_STATE_EN_UPTO_DC6, regardless of whether
> the driver has programmed DC_STATE_EN_UPTO_DC5 or DC_STATE_EN_UPTO_DC5 |
> DC_STATE_EN_UPTO_DC6.
> 
> This coupled with a comment in gen9_write_dc_state makes me suspicious
> whether DMC is not doing things behind the drivers back?

We don't use DC5 on SKL normally only DC6. Did you boot with i915.enable_dc set to something non-default?
Comment 27 Tvrtko Ursulin 2017-05-02 17:42:16 UTC
(In reply to Imre Deak from comment #26)
> (In reply to Tvrtko Ursulin from comment #25)
> > One of the experiments Animesh asked me to do was to look at the state of
> > the DC_STATE_EN register at runtime.
> > 
> > It looked that it had reverted to a value other than what was programmed by
> > i915. On top of that, it looked impossible to manually modify it at runtime.
> > 
> > The value it is reverting to was DC_STATE_EN_UPTO_DC6, regardless of whether
> > the driver has programmed DC_STATE_EN_UPTO_DC5 or DC_STATE_EN_UPTO_DC5 |
> > DC_STATE_EN_UPTO_DC6.
> > 
> > This coupled with a comment in gen9_write_dc_state makes me suspicious
> > whether DMC is not doing things behind the drivers back?
> 
> We don't use DC5 on SKL normally only DC6. Did you boot with i915.enable_dc
> set to something non-default?

Not via the param but on Animesh'es suggestion I had the gen9_write_dc_state modified to only ever program DC_STATE_EN_UPTO_DC5, if non zero state was passed in. Even after that the read back from DC_STATE_EN at runtime was DC_STATE_EN_UPTO_DC6. But during programming the read-back was getting the programmed value. So I guess some time after the initial programming it gets modified by someone. And I couldn't find any other place in i915 which would do it. Which is why I thought it could be DMC.
Comment 28 Imre Deak 2017-05-02 18:06:20 UTC
(In reply to Tvrtko Ursulin from comment #27)
> (In reply to Imre Deak from comment #26)
> > (In reply to Tvrtko Ursulin from comment #25)
> > > One of the experiments Animesh asked me to do was to look at the state of
> > > the DC_STATE_EN register at runtime.
> > > 
> > > It looked that it had reverted to a value other than what was programmed by
> > > i915. On top of that, it looked impossible to manually modify it at runtime.
> > > 
> > > The value it is reverting to was DC_STATE_EN_UPTO_DC6, regardless of whether
> > > the driver has programmed DC_STATE_EN_UPTO_DC5 or DC_STATE_EN_UPTO_DC5 |
> > > DC_STATE_EN_UPTO_DC6.
> > > 
> > > This coupled with a comment in gen9_write_dc_state makes me suspicious
> > > whether DMC is not doing things behind the drivers back?
> > 
> > We don't use DC5 on SKL normally only DC6. Did you boot with i915.enable_dc
> > set to something non-default?
> 
> Not via the param but on Animesh'es suggestion I had the gen9_write_dc_state
> modified to only ever program DC_STATE_EN_UPTO_DC5, if non zero state was
> passed in. Even after that the read back from DC_STATE_EN at runtime was
> DC_STATE_EN_UPTO_DC6. But during programming the read-back was getting the
> programmed value. So I guess some time after the initial programming it gets
> modified by someone. And I couldn't find any other place in i915 which would
> do it. Which is why I thought it could be DMC.

Yes, as I understand there is a firmware bug that prevents using DC5 on SKL. After exiting a low-power DC state it will always "restore" DC_STATE_EN_UPTO_DC6 to DC_STATE_EN regardless of what was programmed there.
Comment 29 Sunil Kamath 2017-05-04 07:58:04 UTC
quick query:
when we do not have any display connected, isn't it expected that display goes to lowest possible power state? that's DC6?
Comment 30 Imre Deak 2017-05-04 10:08:40 UTC
(In reply to Sunil Kamath from comment #29)
> quick query:
> when we do not have any display connected, isn't it expected that display
> goes to lowest possible power state? that's DC6?

If all the conditions allow this to happen. That is DC6 is allowed in DC_STATE_EN, PW2 is disabled (and all display outputs are disabled). Note that DMC will signal an actual DC6 transition (via its DC6 transition debug counter) only if other peripherals on the system also allow this. That is you need to do powertop --auto-tune, and make sure no other devices would block deeper PC states (it's PC9+ on SKL AFAIR). Examples for these are SSD without ALPM enabled, active network link, USB device.
Comment 31 Tvrtko Ursulin 2017-05-04 10:53:51 UTC
We also seem to keep DPLL0 enabled at all times since according to Ville if we turn it off, DMC turns it back on again. It is possible DPLL0 being on prevents DC6?
Comment 32 Imre Deak 2017-05-04 12:01:28 UTC
(In reply to Tvrtko Ursulin from comment #31)
> We also seem to keep DPLL0 enabled at all times since according to Ville if
> we turn it off, DMC turns it back on again. It is possible DPLL0 being on
> prevents DC6?

No, DPLL0 being on all the time from the driver's POV is normal. DMC will turn it off when transitioning to DC5/6 and turn it back on when exiting these states.
Comment 33 Tvrtko Ursulin 2017-05-05 11:03:47 UTC
It seems that the number of DC state transitions done by the DMC (0x80030) is correlated to the activity on the GT IIR. 

Initially I was trying to correlate with the number of GT interrupts, which had a correlation, but also strangely the number of DC transitions could be higher than the number of interrupts (up to exactly double!). But if we consider that one GT interrupt can have multiple GT IIR accesses (one write from GT, one write from the CPU to clear it), then it starts making more sense.

And in fact nosing around the DMC fw I can spot MMIO addresses of the GT IIR registers (among other things).

So the question I think is - why is DMC fw looking at GT IIR registers, and even if it has to for some reason, why it is triggering DC state transitions in case of pure GT command submission with no display activity whatsoever?
Comment 34 Elizabeth 2017-06-22 19:54:17 UTC
Is there any update in this case? If so, could you share the information. Thank you.
Comment 35 Imre Deak 2017-06-26 09:22:03 UTC
(In reply to Elizabeth from comment #34)
> Is there any update in this case? If so, could you share the information.
> Thank you.

This is currently blocked on a DMC register context corruption issue during DC state transitions, which is tracked in the internal bugtracker.
Comment 36 Elizabeth 2017-06-28 16:44:30 UTC
(In reply to Imre Deak from comment #35)
> (In reply to Elizabeth from comment #34)
> > Is there any update in this case? If so, could you share the information.
> > Thank you.
> 
> This is currently blocked on a DMC register context corruption issue during
> DC state transitions, which is tracked in the internal bugtracker.

Changing to REOPEN. Thanks.
Comment 37 Chris Wilson 2017-09-07 17:47:59 UTC
*** Bug 102589 has been marked as a duplicate of this bug. ***
Comment 38 Chris Wilson 2017-09-07 19:32:50 UTC
*** Bug 102563 has been marked as a duplicate of this bug. ***
Comment 39 Chris Wilson 2017-09-12 13:21:28 UTC
See also https://patchwork.freedesktop.org/series/30196/ for the same patch after seeing CI hit this problem on bxt.
Comment 40 Dmitry Rogozhkin 2017-09-12 15:40:51 UTC
Chris,

Do you know which other platforms are affected? For now we are aware of SKL and BXT, what about KBL and CFL?

And another question, the patch we consider to merge, it disables DMC entirely or it is doing something else? why we do not want to disable it fully till DMC FW will not be fixed?
Comment 41 Chris Wilson 2017-09-12 15:54:37 UTC
Honestly, the presumption is that the dmc must be good for something. Completely blacklisting the firmware until fixed is one of the options. Just will anyone notice if we do use the nuclear option?
Comment 42 valtteri.rantala 2017-09-22 08:23:26 UTC
This issue is reproducible also on KBL. On CFL this issue did not occur.
Comment 43 valtteri.rantala 2017-09-25 11:26:33 UTC
Correcting last comment. Also Happends on CFL.
Comment 44 Jani Saarinen 2017-12-05 15:11:12 UTC
Now SKL DMC 1.27 merged so we need patch on top of it?
Comment 45 Dmitry Rogozhkin 2017-12-05 20:34:59 UTC
Yes, and we have this patch in the mailing list already: https://patchwork.freedesktop.org/series/24017/#rev6. This one fixes the perf. issue for all platforms where we are aware of it (previous revision excluded SKL).
Comment 46 Chris Wilson 2017-12-08 10:34:11 UTC
commit b68763741aa29f2541c7ca58bcb0c2bb6cb5f449
Author:     Tvrtko Ursulin <tvrtko.ursulin@intel.com>
AuthorDate: Tue Dec 5 13:28:54 2017 +0000
Commit:     Imre Deak <imre.deak@intel.com>
CommitDate: Fri Dec 8 12:23:07 2017 +0200

    drm/i915: Restore GT performance in headless mode with DMC loaded
    
    It seems that the DMC likes to transition between the DC states a lot when
    there are no connected displays (no active power domains) during command
    submission.
    
    This activity on DC states has a negative impact on the performance of the
    chip with huge latencies observed in the interrupt handlers and elsewhere.
    Simple tests like igt/gem_latency -n 0 are slowed down by a factor of
    eight.
    
    Work around it by introducing a new power domain named,
    POWER_DOMAIN_GT_IRQ, associtated with the "DC off" power well, which is
    held for the duration of command submission activity.
    
    CNL has the same problem which will be addressed as a follow-up. Doing
    that requires a fix for a DC6 context corruption problem in the CNL DMC
    firmware which is yet to be released.
    
    v2:
     * Add commit text as comment in i915_gem_mark_busy. (Chris Wilson)
     * Protect macro body with braces. (Jani Nikula)
    
    v3:
     * Add dedicated power domain for clarity. (Chris, Imre)
     * Commit message and comment text updates.
     * Apply to all big-core GEN9 parts apart for Skylake which is pending DMC
       firmware release.
    
    v4:
     * Power domain should be inner to device runtime pm. (Chris)
     * Simplify NEEDS_CSR_GT_PERF_WA macro. (Chris)
     * Handle async DMC loading by moving the GT_IRQ power domain logic into
       intel_runtime_pm. (Daniel, Chris)
     * Include small core GEN9 as well. (Imre)
    
    v5
     * Special handling for async DMC load is not needed since on failure the
       power domain reference is kept permanently taken. (Imre)
    
    v6:
     * Drop the NEEDS_CSR_GT_PERF_WA macro since all firmwares have now been
       deployed. (Imre, Chris)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100572
    Testcase: igt/gem_exec_nop/headless
    Cc: Imre Deak <imre.deak@intel.com>
    Acked-by: Chris Wilson <chris@chris-wilson.co.uk> (v2)
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> (v5)
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    [Imre: Add note about applying the WA on CNL as a follow-up]
    Signed-off-by: Imre Deak <imre.deak@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171205132854.26380-1-tvrtko.ursulin@linux.intel.com

Unless we really want to keep this open for cnl?
Comment 47 Dmitry Rogozhkin 2017-12-08 15:18:01 UTC
I suggest to open another one for CNL if we will spot the issue there and link it to this one.

There is a need to have the issue fixed for 4.14 LTS kernel. How do you track fixes backport? Should ther be this/separate bug or anything else?
Comment 48 Jani Saarinen 2018-04-20 11:11:08 UTC
Closing, please re-open if still occurs.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.