102199 – Kabylake has poor performance, doesn't upclock during activity quickly with single display configurations

Bug 102199 - Kabylake has poor performance, doesn't upclock during activity quickly with single display configurations

Summary: Kabylake has poor performance, doesn't upclock during activity quickly with s...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-08-13 17:47 UTC by Lyude Paul
Modified:	2018-04-26 07:05 UTC (History)
CC List:	9 users (show)

See Also:
i915 platform:	KBL
i915 features:	display/Other

Attachments
dmidecode output (17.35 KB, text/plain) 2017-08-13 17:47 UTC, Lyude Paul	no flags	Details
dmesg from Kablake machine (139.37 KB, text/plain) 2017-08-13 17:57 UTC, Lyude Paul	no flags	Details
v4.9 backport (5.92 KB, patch) 2017-09-06 16:53 UTC, Chris Wilson	no flags	Details \| Splinter Review
v4.9 backport (5.90 KB, patch) 2017-09-06 16:54 UTC, Chris Wilson	no flags	Details \| Splinter Review
v4.9.50 backport (5.90 KB, patch) 2017-09-18 09:27 UTC, Chris Wilson	no flags	Details \| Splinter Review
Show Obsolete (2) View All

Description Lyude Paul 2017-08-13 17:47:37 UTC

Created attachment 133474 [details]
dmidecode output

So, I've been trying to figure out why gnome-shell has been running slowly on my laptop and unfortunately me and one of my coworkers have traced this down to a rather interesting power issue: i915 doesn't upclock Kabylake GPUs quickly enough to run gnome shell's animations smoothly, but only when running with a single display. For instance, when continuously entering and exiting the activities overlay according to /sys/kernel/debug/dri/0/i915_frequency_info, the GPU almost never goes above 500MHz when running animations on a single screen. The shell animations run at about 20FPS, and if I force the GPU's frequency up using i915_min_freq animations begin running at 60 FPS again.
When running with two or more displays however, this problem never seems to happen. The GPU upclocks as expected when shell animations start, and everything runs fluidly at ~55-60 FPS.

The machine in question is a razer blade stealth running Fedora 26. The display being used in the single monitor configuration is the built-in eDP 4K display. I've tried the latest 4.13 rc4 kernel, and unfortunately the performance isn't much better.

I've attached my dmidecode, and will reboot and attach a dmesg with drm.debug=0x6 in just a moment. let me know if you need any more information.

Comment 1 Lyude Paul 2017-08-13 17:57:55 UTC

Created attachment 133475 [details]
dmesg from Kablake machine

Comment 2 Jani Nikula 2017-08-14 08:59:11 UTC

Semi-clueless shot in the dark: Does loading vs. not loading the DMC firmware make a difference?

Comment 3 Chris Wilson 2017-08-14 09:22:31 UTC

(In reply to Jani Nikula from comment #2)
> Semi-clueless shot in the dark: Does loading vs. not loading the DMC
> firmware make a difference?

That's what the last round of discussion implied. Furthermore if you watch i915_rps_info that will you show you if we are trying to boost, as well as the condensed requested vs actual frequency. And there's intel-gpu-overlay, but that ideally requires Xv overlay planes to run under X and i915-perf integration.

Comment 4 Lyude Paul 2017-08-14 18:55:55 UTC

(In reply to Jani Nikula from comment #2)
> Semi-clueless shot in the dark: Does loading vs. not loading the DMC
> firmware make a difference?

No unfortunately. In fact the performance almost seems to be worse, and it doesn't look like it chooses to boost very often.

Regarding i915_rps_boost_info, I see some very interesting patterns when gnome-shell is animating. The GPU almost never boosts up even when continuously replaying the activity overlay animations, and stays at 300 MHz. Very, very rarely it will decide to boost at which point I can see mutter reporting the FPS doubling. The boost doesn't last very long though, and shortly after goes away. Even more interestingly, when the GPU is rendering the animations it is marked as busy in i915_rps_boost_info despite never actually boosting.

Now, I didn't notice this until just now but something else that's rather interesting happens when I stop replaying the animations. The second I stop, the GPU immediately boosts for a split second, then stops. This happens consistently as well.

If I monitor i915_rps_boost_info while running two displays + gnome-shell though, the GPU boosts almost immediately when the animations start, and mutter reports FPS counts much closer to what I would expect. Additionally, the boost count for gnome-shell's process (marked as system-logind) actually starts going up during animations and labeling itself as actively boosting. Unplugging the display again makes the GPU stop boosting like before.

Comment 5 Elizabeth 2017-08-14 19:17:57 UTC

Adding tag into "Whiteboard" field - ReadyForDev
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 6 Chris Wilson 2017-08-14 21:26:50 UTC

(In reply to Lyude Paul from comment #4)
> (In reply to Jani Nikula from comment #2)
> > Semi-clueless shot in the dark: Does loading vs. not loading the DMC
> > firmware make a difference?
> 
> No unfortunately. In fact the performance almost seems to be worse, and it
> doesn't look like it chooses to boost very often.
> 
> Regarding i915_rps_boost_info, I see some very interesting patterns when
> gnome-shell is animating. The GPU almost never boosts up even when
> continuously replaying the activity overlay animations, and stays at 300
> MHz. Very, very rarely it will decide to boost at which point I can see
> mutter reporting the FPS doubling. The boost doesn't last very long though,
> and shortly after goes away. Even more interestingly, when the GPU is
> rendering the animations it is marked as busy in i915_rps_boost_info despite
> never actually boosting.

Ok, so it only very rarely is waitboosting (i.e. it doesn't wait). We used to also do a boost if we detect a missed vblank interval for a pageflip. That got lost in the atomic transition... And the typical load is? The last chunk of i915_rps_info should be giving the %load during the EI interval and how far off upclocking it is.

> Now, I didn't notice this until just now but something else that's rather
> interesting happens when I stop replaying the animations. The second I stop,
> the GPU immediately boosts for a split second, then stops. This happens
> consistently as well.

That'll be a sync somewhere triggering a waitboost.
 
> If I monitor i915_rps_boost_info while running two displays + gnome-shell
> though, the GPU boosts almost immediately when the animations start, and
> mutter reports FPS counts much closer to what I would expect. Additionally,
> the boost count for gnome-shell's process (marked as system-logind) actually
> starts going up during animations and labeling itself as actively boosting.
> Unplugging the display again makes the GPU stop boosting like before.

So there's a difference in the boost counter between the two? In that case the second load is just slow enough that we are stalling for a sync (glFinish, throttling on SwapBuffers) and triggering enough of a waitboost to improve performance.

As an experiment, try

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c47ade3cb786..b88bfef368a3 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -478,6 +478,8 @@ static void __fence_set_priority(struct dma_fence *fence, int prio)
                return;
 
        rq = to_request(fence);
+       gen6_rps_boost(rq, NULL);
+
        engine = rq->engine;
        if (!engine->schedule)
                return;

That's total overkill!

Comment 7 Lyude Paul 2017-08-15 21:19:01 UTC

(In reply to Chris Wilson from comment #6)
> (In reply to Lyude Paul from comment #4)
> > (In reply to Jani Nikula from comment #2)
> > > Semi-clueless shot in the dark: Does loading vs. not loading the DMC
> > > firmware make a difference?
> > 
> > No unfortunately. In fact the performance almost seems to be worse, and it
> > doesn't look like it chooses to boost very often.
> > 
> > Regarding i915_rps_boost_info, I see some very interesting patterns when
> > gnome-shell is animating. The GPU almost never boosts up even when
> > continuously replaying the activity overlay animations, and stays at 300
> > MHz. Very, very rarely it will decide to boost at which point I can see
> > mutter reporting the FPS doubling. The boost doesn't last very long though,
> > and shortly after goes away. Even more interestingly, when the GPU is
> > rendering the animations it is marked as busy in i915_rps_boost_info despite
> > never actually boosting.
> 
> Ok, so it only very rarely is waitboosting (i.e. it doesn't wait). We used
> to also do a boost if we detect a missed vblank interval for a pageflip.
> That got lost in the atomic transition... And the typical load is? The last
> chunk of i915_rps_info should be giving the %load during the EI interval and
> how far off upclocking it is.
> 
> > Now, I didn't notice this until just now but something else that's rather
> > interesting happens when I stop replaying the animations. The second I stop,
> > the GPU immediately boosts for a split second, then stops. This happens
> > consistently as well.
> 
> That'll be a sync somewhere triggering a waitboost.
>  
> > If I monitor i915_rps_boost_info while running two displays + gnome-shell
> > though, the GPU boosts almost immediately when the animations start, and
> > mutter reports FPS counts much closer to what I would expect. Additionally,
> > the boost count for gnome-shell's process (marked as system-logind) actually
> > starts going up during animations and labeling itself as actively boosting.
> > Unplugging the display again makes the GPU stop boosting like before.
> 
> So there's a difference in the boost counter between the two? In that case
> the second load is just slow enough that we are stalling for a sync
> (glFinish, throttling on SwapBuffers) and triggering enough of a waitboost
> to improve performance.
> 
> As an experiment, try
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index c47ade3cb786..b88bfef368a3 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -478,6 +478,8 @@ static void __fence_set_priority(struct dma_fence
> *fence, int prio)
>                 return;
>  
>         rq = to_request(fence);
> +       gen6_rps_boost(rq, NULL);
> +
>         engine = rq->engine;
>         if (!engine->schedule)
>                 return;
> 
> That's total overkill!

So doing this fixed the problem entirely. As for the average load% I think the down seems to usually be around 50% (after applying the patches you mentioned in irc), but the up percentage constantly fluctuates between 0 and 100%

Comment 8 Chris Wilson 2017-08-17 08:59:34 UTC

Ok, the more refined version we previously used is at https://patchwork.freedesktop.org/series/28905/

It only boosts after a missed vblank, which should be a reasonable compromise between smooth UX and power.

Comment 9 Chris Wilson 2017-08-17 12:39:16 UTC

Better patch: https://patchwork.freedesktop.org/series/28921/

Comment 10 Lyude Paul 2017-08-17 23:46:00 UTC

(In reply to Chris Wilson from comment #9)
> Better patch: https://patchwork.freedesktop.org/series/28921/

Tested this on my machine and everything runs smoothly and quickly! Thank you a ton for the help.

Tested-by: Lyude Paul <lyude@redhat.com>

Comment 11 Ray Strode [halfline] 2017-08-18 13:47:49 UTC

is there a way the compositor can request a boost before it misses a vblank?

Comment 12 Chris Wilson 2017-08-18 13:58:32 UTC

(In reply to Ray Strode [halfline] from comment #11)
> is there a way the compositor can request a boost before it misses a vblank?

Not yet. One thing we could try is a context parameter that says "always boost batches from this context". Given that we want the compositor to be running at high priority, it would then preempt the existing workload and execute at full speed.

The only interface I'm aware of is EGL_IMG_context_priority, and we plan to restrict allowing high priority contexts to CAP_SYS_NICE clients (which the pencilled in plan of having logind grant that privilege to e.g. gnome-shell). We could overload that and say that highest priority contexts are naturally boosted.

In that scenario gnome-shell's would do all of its compositing from that special context, but it may also use a normal context for "background" rendering. With priority inheritance everything that is required by the high priority context is bumped to high priority as well.

Let me go float the idea of boosting high priority contexts...

Comment 13 Chris Wilson 2017-08-18 14:07:07 UTC

(In reply to Ray Strode [halfline] from comment #11)
> is there a way the compositor can request a boost before it misses a vblank?

Actually as odd as it seems, but a glFinish() before SwapBuffers would do the trick. That's certainly not as fancy as my context priority shenanigans.

Comment 14 Ray Strode [halfline] 2017-08-18 15:04:06 UTC

wouldn't the boosting need to happen before the drawing was started, not after its drawn and before its swapped ?

can the driver just boost whenever an input device fires off an event ? keep interactive performance up when the user is interacting...

Comment 15 Chris Wilson 2017-08-18 15:48:46 UTC

(In reply to Ray Strode [halfline] from comment #14)
> wouldn't the boosting need to happen before the drawing was started, not
> after its drawn and before its swapped ?

With the glFinish, the boost would be applied to the rendering. It's just that you end up with a synchronous step prior to SwapBuffers.

> can the driver just boost whenever an input device fires off an event ? keep
> interactive performance up when the user is interacting...

How to determine which request corresponds to the interactive stream? As a gfx driver we have no knowledge of input devices. The system as a whole struggles to identify interactive processes! The easiest way for us to identify user tasks is by looking at what affects the output. We could look for special objects (i.e. framebuffers) in the command stream and use that as a flag, but really we have to balance perf/power and waitboosting is a policy that I would much rather punt to userspace. (For me having a flag on a context that says all work submitted via that context is interactive seems a sensible compromise.)

My ideal interface for submitting tasks would include a deadline; almost as a rule user tasks have a clear deadline and opencl/transcode does not. But we are at very early stages of a GPU scheduler, a long way off SCHED_DEADLINE.

Comment 16 Ray Strode [halfline] 2017-08-18 18:10:00 UTC

(In reply to Chris Wilson from comment #15)
> (In reply to Ray Strode [halfline] from comment #14)
> > wouldn't the boosting need to happen before the drawing was started, not
> > after its drawn and before its swapped ?
> 
> With the glFinish, the boost would be applied to the rendering. It's just
> that you end up with a synchronous step prior to SwapBuffers.
oh okay.  So I gave Lyude a build with this patch as an experiment:

https://git.gnome.org/browse/mutter/commit/?h=wip/rstrode/finish-to-boost&id=4035543d4f73b9f2571c1f2dea67de5d0b1352f0

and she says it does help, but your kernel fix is all around smoother feeling. I don't know if that's because of the introduced blocking stalling other parts of the compositor or what...

> > can the driver just boost whenever an input device fires off an event ? keep
> > interactive performance up when the user is interacting...
> 
> How to determine which request corresponds to the interactive stream?
why do you need to ?  at the end of the day the question is whether the gpu is 500mhz or 1300mhz right?  just make it 1300mhz if there's input and let it downscale on its own when it stops animating and the user stops typing and swiping.

> As a
> gfx driver we have no knowledge of input devices. The system as a whole
> struggles to identify interactive processes! The easiest way for us to
> identify user tasks is by looking at what affects the output.
But why do we need to know what processes are interactive ?

Comment 17 Chris Wilson 2017-08-18 18:20:38 UTC

(In reply to Ray Strode [halfline] from comment #16)
> (In reply to Chris Wilson from comment #15)
> > > can the driver just boost whenever an input device fires off an event ? keep
> > > interactive performance up when the user is interacting...
> > 
> > How to determine which request corresponds to the interactive stream?
> why do you need to ?  at the end of the day the question is whether the gpu
> is 500mhz or 1300mhz right?  just make it 1300mhz if there's input and let
> it downscale on its own when it stops animating and the user stops typing
> and swiping.

The gfx driver doesn't have that information. I'm not sure if the kernel knows what is an active hid, and certainly there's no link to it to either the cpu scheduler or to us. From our perspective the only way we can identify such streams is by looking at what ends up on the display, and backtracking from there. The piece that pulls all of this together is the compositor.
 
> > As a
> > gfx driver we have no knowledge of input devices. The system as a whole
> > struggles to identify interactive processes! The easiest way for us to
> > identify user tasks is by looking at what affects the output.
> But why do we need to know what processes are interactive ?

Because we are looking for anything that interacts with the user to preferentially give those low latency. We identify these by contexts which are tied to files and processes. On the other hand, we can remove that guesswork if userspace is able to tell us what is high priority and what is not.

Comment 18 Ray Strode [halfline] 2017-08-18 19:35:20 UTC

(In reply to Chris Wilson from comment #17)
> The gfx driver doesn't have that information. I'm not sure if the kernel
> knows what is an active hid, and certainly there's no link to it to either
> the cpu scheduler or to us. 
I don't understand, the kernel sends input events from keyboard/mice/touchscreens to userspace via /dev/input/eventN interface, so why couldn't the clock speed get boosted any time an event got posted to one of those devices?

git grep seems to suggest there's a function called input_register_handler() that lets various parts of the kernel hook into the event stream for all devices (granted this was just 60 seconds of searching, so I may be missing something).

> From our perspective the only way we can
> identify such streams is by looking at what ends up on the display, and
> backtracking from there. The piece that pulls all of this together is the
> compositor.
But wouldn't it better if we could ramp up the clockspeed before firefox started rerendering a page getting drag scrolled, not after the compositor is responding to damage requests?

> Because we are looking for anything that interacts with the user to
> preferentially give those low latency. We identify these by contexts which
> are tied to files and processes. On the other hand, we can remove that
> guesswork if userspace is able to tell us what is high priority and what is
> not.
Okay, so it's not just video card in turbo mode or power save mode, but also certain GL contexts getting preferential access to the card.  Seems like a worthwhile thing to have, I guess, but giving gnome-shell preferential access doesn't help firefox rerender faster, right ?

Comment 19 Chris Wilson 2017-08-18 20:08:47 UTC

(In reply to Ray Strode [halfline] from comment #18)
> (In reply to Chris Wilson from comment #17)
> > The gfx driver doesn't have that information. I'm not sure if the kernel
> > knows what is an active hid, and certainly there's no link to it to either
> > the cpu scheduler or to us. 
> I don't understand, the kernel sends input events from
> keyboard/mice/touchscreens to userspace via /dev/input/eventN interface, so
> why couldn't the clock speed get boosted any time an event got posted to one
> of those devices?
> 
> git grep seems to suggest there's a function called input_register_handler()
> that lets various parts of the kernel hook into the event stream for all
> devices (granted this was just 60 seconds of searching, so I may be missing
> something).

There is no way we can predict which of those will result in a requirement for low latency output. You do not want to bake policy into the kernel, on the other hand if may be a worthwhile experiment for userspace to construct that link (i.e. pass a input fd to the gfx that it then watches, and ????) Realistically that will just result in the GPU always running at max frequency, or idling. We may as well just set a flag on the fd (that the compositor hands to its clients) that says render-as-fast-as-possible, or just switch off reclocking all together.

> > From our perspective the only way we can
> > identify such streams is by looking at what ends up on the display, and
> > backtracking from there. The piece that pulls all of this together is the
> > compositor.
> But wouldn't it better if we could ramp up the clockspeed before firefox
> started rerendering a page getting drag scrolled, not after the compositor
> is responding to damage requests?
> 
> > Because we are looking for anything that interacts with the user to
> > preferentially give those low latency. We identify these by contexts which
> > are tied to files and processes. On the other hand, we can remove that
> > guesswork if userspace is able to tell us what is high priority and what is
> > not.
> Okay, so it's not just video card in turbo mode or power save mode, but also
> certain GL contexts getting preferential access to the card.  Seems like a
> worthwhile thing to have, I guess, but giving gnome-shell preferential
> access doesn't help firefox rerender faster, right ?

What we are talking about here is a mechanism to override the workload based reclocking of the GPU for a much faster initial ramp (otherwise it uses an exponential curve, reclocking every 10-30ms); it's a power-performance tradeoff. The gpu can decode a video at minimal clocks, there is no reason for that workload to be consuming more power than required. So the key is detecting what workload is going to miss a latency deadline and only supercharging the gpu for that task. The difference is 20+W, and growing with larger and larger embedded gpus. There is the race-to-idle argument, but if the workload is not bursty like a compositor, you will be costing too much power by continually over-estimating the clocks.

There is a reasonable argument that we need to track workloads by context and reclock based on what contexts are active, rather than the current technique of just sampling the workload as a single entity on the gpu. That would give us the ability to remember that every time firefox renders a frame it will need X% of the gpu and adjust accordingly. Sadly the RPS unit is not part of the context, otherwise that would be much easier.

Also not everything firefox is doing is urgent, it will be a mix of laying out and rendering the page, plus some interactive responses that we want to minimise the latency for. How do we identify those different cases?

Comment 20 Ray Strode [halfline] 2017-08-18 20:49:09 UTC

(In reply to Chris Wilson from comment #19)
> There is no way we can predict which of those will result in a requirement
> for low latency output.
Input usually leads to screen updates...  Granted not all screen updates take longer than a vblank period when the gpu is downclocked... Still, it would certainly ramp up less often than glFinish before every swap buffer in the compositor right? 

> You do not want to bake policy into the kernel, 
But isn't everything we're talking about here heuristics (missed vblank, wait boosting)?

> the other hand if may be a worthwhile experiment for userspace to construct
> that link (i.e. pass a input fd to the gfx that it then watches, and ????)
not sure what you mean by "pass an input fd to the graphics."  certainly if there was an ioctl "PLEASE BOOST THE CLOCKSPEED OF THE GPU NOW", gnome-shell could call it before starting big animations, and whenever the user clicked or typed.

> Realistically that will just result in the GPU always running at max
> frequency, or idling.
why? if the user is watching a video they aren't using the mouse or typing...  Likewise, if they're reading a webpage, or if they're not using the computer.

But if they're moving the mouse around, they're probably going to click something. if they click something, it's probably going to fade open or scroll or whatever.  it seems like that would the right to scale up the clock?

> We may as well just set a flag on the fd (that the
> compositor hands to its clients) that says render-as-fast-as-possible, or
> just switch off reclocking all together.
not following this.

> What we are talking about here is a mechanism to override the workload based
> reclocking of the GPU for a much faster initial ramp (otherwise it uses an
> exponential curve, reclocking every 10-30ms); it's a power-performance
> tradeoff. The gpu can decode a video at minimal clocks, there is no reason
> for that workload to be consuming more power than required. So the key is
> detecting what workload is going to miss a latency deadline and only
> supercharging the gpu for that task.
Sure, this makes sense to me. But do you think there are a lot of rendering tasks that result directly from user input, that wouldn't benefit from faster initial ramp?

Thinking about it, I guess an obvious one would be typing a document. hmm. so that makes me think we shouldn't do it for keyboard events. But of course, hitting the super key should still open the shell overview fast. hmm.

> Also not everything firefox is doing is urgent, it will be a mix of laying
> out and rendering the page, plus some interactive responses that we want to
> minimise the latency for. How do we identify those different cases?
fair point.

Comment 21 Chris Wilson 2017-08-21 11:17:21 UTC

(In reply to Ray Strode [halfline] from comment #20)
> (In reply to Chris Wilson from comment #19)
> > There is no way we can predict which of those will result in a requirement
> > for low latency output.
> Input usually leads to screen updates...  Granted not all screen updates
> take longer than a vblank period when the gpu is downclocked... Still, it
> would certainly ramp up less often than glFinish before every swap buffer in
> the compositor right?

I would have said that most input events do not result in a new request. My gut says that it is the reverse, that we will be boosting for longer inappropriately by responding to each input event. Still, it a mechanism worth thinking about.

 > > You do not want to bake policy into the kernel, 
> But isn't everything we're talking about here heuristics (missed vblank,
> wait boosting)?

Yes, it doesn't belong here; it just the current hack. One side of the spectrum is having the autotune work well for the vast majority, but even that follows a rough policy that can be adjusted (min/max, but we don't allow aggressiveness to be tuned), and the other side is in letting userspace override that when necessary.
 
> > the other hand if may be a worthwhile experiment for userspace to construct
> > that link (i.e. pass a input fd to the gfx that it then watches, and ????)
> not sure what you mean by "pass an input fd to the graphics."  certainly if
> there was an ioctl "PLEASE BOOST THE CLOCKSPEED OF THE GPU NOW", gnome-shell
> could call it before starting big animations, and whenever the user clicked
> or typed.

The "pass an fd" was about tieing an input device to the gfx so it could autoboost on an event, the rest of this has been about sketching out how to make an ioctl that could do force boosts.

> > Realistically that will just result in the GPU always running at max
> > frequency, or idling.
> why? if the user is watching a video they aren't using the mouse or
> typing...  Likewise, if they're reading a webpage, or if they're not using
> the computer.
> 
> But if they're moving the mouse around, they're probably going to click
> something. if they click something, it's probably going to fade open or
> scroll or whatever.  it seems like that would the right to scale up the
> clock?

Maybe, maybe not. It's not a decision I want to make, but I don't begrudge providing a mechanism for someone else to dictate that policy.
 
> > We may as well just set a flag on the fd (that the
> > compositor hands to its clients) that says render-as-fast-as-possible, or
> > just switch off reclocking all together.
> not following this.

Earlier you pointed out that the gfx pipeline is long, and to deliver the input-output results as early as possible we should have started that pipeline at max clocks, and not just when we realise that we are about to write to the framebuffer. The alternative is that when delivering a screen update, you only include those bits that are completed, then your high priority update can skip to the head of the queue -- and should be able to perform well enough without an explicit boost, just autotuning of the workload as a whole.

> > What we are talking about here is a mechanism to override the workload based
> > reclocking of the GPU for a much faster initial ramp (otherwise it uses an
> > exponential curve, reclocking every 10-30ms); it's a power-performance
> > tradeoff. The gpu can decode a video at minimal clocks, there is no reason
> > for that workload to be consuming more power than required. So the key is
> > detecting what workload is going to miss a latency deadline and only
> > supercharging the gpu for that task.
> Sure, this makes sense to me. But do you think there are a lot of rendering
> tasks that result directly from user input, that wouldn't benefit from
> faster initial ramp?

Yes. Interactive latency is important, but we should be able to deliver it at a better point on the power curve than max clocks. 
> 
> Thinking about it, I guess an obvious one would be typing a document. hmm.
> so that makes me think we shouldn't do it for keyboard events. But of
> course, hitting the super key should still open the shell overview fast. hmm.

Yup, measuring the output delay of something like the super-key that is well defined, and setting an expectation on that is exactly what we need to do. This is what is covered by gnome-shell-perf-tool?

Going beyond that tracking an input event through to a client output request is more tricky, but could be done for simple applications (or at least mock apps).

But if you can think of more tools like gnome-shell-perf-tool that set expectations upon which we can measure ourselves against, that will be most useful. We are not as good as we should be at integrating such tests into our CI, but that we can improve.

Comment 22 Lyude Paul 2017-08-21 17:17:44 UTC

(In reply to Ray Strode [halfline] from comment #18)
> (In reply to Chris Wilson from comment #17)
> > The gfx driver doesn't have that information. I'm not sure if the kernel
> > knows what is an active hid, and certainly there's no link to it to either
> > the cpu scheduler or to us. 
> I don't understand, the kernel sends input events from
> keyboard/mice/touchscreens to userspace via /dev/input/eventN interface, so
> why couldn't the clock speed get boosted any time an event got posted to one
> of those devices?
Mentioned this to halfline over the RH IRC already, but putting it on the bz for the recrod: I'm with ickle on this; while the patch for this is technically kernel heuristics it is heuristics based off of data that the driver has much more knowledge then something like an input device. Especially since what input devices come out of an evdev node are usually not going to be 1:1 with the events that userspace ends up receiving since we have layers like libinput doing additional filtering and heuristics on the input events before handing them to the compositor.

Plus, it's certainly possible that we could have various render jobs that need to be completed by the scanout period without actually having any user input.
> 
> git grep seems to suggest there's a function called input_register_handler()
> that lets various parts of the kernel hook into the event stream for all
> devices (granted this was just 60 seconds of searching, so I may be missing
> something).
> 
> > From our perspective the only way we can
> > identify such streams is by looking at what ends up on the display, and
> > backtracking from there. The piece that pulls all of this together is the
> > compositor.
> But wouldn't it better if we could ramp up the clockspeed before firefox
> started rerendering a page getting drag scrolled, not after the compositor
> is responding to damage requests?
I think the issue here is that we really can't predict the future of render jobs, unless we're the ones creating those jobs. The only thing we really know is when those jobs must be completed by, which in the case for any application drawing to the screen (including gnome-shell, firefox, etc.)

To be honest, I very much like ickle's ideas of having render batch deadlines, since we could dynamically adjust the boost curve based on the amount of time we have left from the start of the render job until it's deadline. Assuming we mark all render jobs for things that are destined to be scanned out to the screen as having a vblank deadline. I'm thinking something that would look like this:

(each tick = 1ms)
(U = boost up) (D = boost down)
(S = start) (E = end)
(dividers going from top to bottom indicate the start of a new scanout
period)
(frequencies in MHz, and are all guesstimates)

So, let's assume we're looking at the scanout timeline for when the user starts
interacting with say, Firefox
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
|         |S|   |U|   |U| |E|   | |D|   |D|               |S| |E|
|                               |                          U    |
300             500  1000         500   300              1000

So, the idea would basically be the closer we are to the next vblank when the
time critical job starts, the faster we ramp up with boosts. We could probably
optimize this even more by holding a boost that's done in anticipation of the
next scanout across the next vblank. This might need a bit more consideration for edge cases such as a render job that is destined to end up on more then one display, but I think just falling back to "boost more" in those scenarios like we do right now would be sufficient since power isn't as important with >1 monitor. Additionally, we could also combine ickle's suggestion of putting high priority jobs like this on the top of the queue and ditching anything else and do that for the duration of the vblank or up until the point at which all time sensitive batches are complete.

Of course though, ickle has already mentioned that we're a long way off from having SCHED_DEADLINE, but maybe we could at least use this logic for any EGL_IMG_context_priority jobs instead of just blindly boosting ahead of time? 
> 
> > Because we are looking for anything that interacts with the user to
> > preferentially give those low latency. We identify these by contexts which
> > are tied to files and processes. On the other hand, we can remove that
> > guesswork if userspace is able to tell us what is high priority and what is
> > not.
> Okay, so it's not just video card in turbo mode or power save mode, but also
> certain GL contexts getting preferential access to the card.  Seems like a
> worthwhile thing to have, I guess, but giving gnome-shell preferential
> access doesn't help firefox rerender faster, right ?

Comment 23 Chris Wilson 2017-08-22 16:13:26 UTC

Applied commit 74d290f845d0736bf6b9dd22cd28dd87b270c65f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 17 13:37:06 2017 +0100

    drm/i915: Boost GPU clocks if we miss the pageflip's vblank
    
    If we miss the current vblank because the gpu was busy, that may cause a
    jitter as the frame rate temporarily drops. We try to limit the impact
    of this by then boosting the GPU clock to deliver the frame as quickly
    as possible. Originally done in commit 6ad790c0f5ac ("drm/i915: Boost GPU
    frequency if we detect outstanding pageflips") but was never forward
    ported to atomic and finally dropped in commit fd3a40242e87 ("drm/i915:
    Rip out legacy page_flip completion/irq handling").
    
    One of the most typical use-cases for this is a mostly idle desktop.
    Rendering one frame of the desktop's frontbuffer can easily be
    accomplished by the GPU running at low frequency, but often exceeds
    the time budget of the desktop compositor. The result is that animations
    such as opening the menu, doing a fullscreen switch, or even just trying
    to move a window around are slow and jerky. We need to respond within a
    frame to give the best impression of a smooth UX, as a compromise we
    instead respond if that first frame misses its goal. The result should
    be a near-imperceivable initial delay and a smooth animation even
    starting from idle. The cost, as ever, is that we spend more power than
    is strictly necessary as we overestimate the required GPU frequency and
    then try to ramp down.
    
    This of course is reactionary, too little, too late; nevertheless it is
    surprisingly effective.

It should provide us with a good safety net. Ideally we still want to tell in advance when we need to boost the clocks to hit a user deadline.

Comment 24 Chris Wilson 2017-09-06 16:53:04 UTC

Created attachment 134016 [details] [review]
v4.9 backport

Comment 25 Chris Wilson 2017-09-06 16:54:18 UTC

Created attachment 134017 [details] [review]
v4.9 backport

Comment 26 Harish 2017-09-17 20:07:35 UTC

Just curious, how do you check the current GPU frequency to see if it is ramping up properly?

Also, I've tried to apply the patch for lts kernel 4.9.50 but its failing. So I'm guessing I'm using the wrong kernel version for this patch.

Comment 27 Chris Wilson 2017-09-18 09:27:05 UTC

Created attachment 134314 [details] [review]
v4.9.50 backport

The gpu should reclock quickly that it should be tricky to catch the idle -> max transition for the missed vblank as it should idle again shortly afterwards (within 100ms). You can cat /sys/class/drm/card0/gt_act_freq_mhz, my recommended tool would be intel-gpu-overlay with the i915/pmu patches...

Comment 28 Harish 2017-10-17 11:34:07 UTC

(In reply to Chris Wilson from comment #27)
> Created attachment 134314 [details] [review] [review]
> v4.9.50 backport
> 
> The gpu should reclock quickly that it should be tricky to catch the idle ->
> max transition for the missed vblank as it should idle again shortly
> afterwards (within 100ms). You can cat /sys/class/drm/card0/gt_act_freq_mhz,
> my recommended tool would be intel-gpu-overlay with the i915/pmu patches...

Hi Chris thanks for the backport. But when I apply the patch, I don't get any message saying it was successful. It just says files were patched. Is this supposed to happen? Normally I get the #hunk succeeded message.

Comment 29 Chris Wilson 2017-10-17 14:05:49 UTC

Iirc, git apply is quiet, git am would report the new commit id, patch itself would report success.

When in doubt just grab drm-tip...

Comment 30 Chris Wilson 2018-01-18 17:20:56 UTC

Note, the rps boost was slightly reduced in ferocity in 

commit e9af4ea2b9e7e5d3caa6354be14de06b678ed0fa
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jan 18 13:16:09 2018 +0000

    drm/i915: Avoid waitboosting on the active request
    
    Watching a light workload on Baytrail (running glxgears and a 1080p
    decode), instead of the system remaining at low frequency, the glxgears
    would regularly trigger waitboosting after which it would have to spend
    a few seconds throttling back down. In this case, the waitboosting is
    counter productive as the minimal wait for glxgears doesn't prevent it
    from functioning correctly and delivering frames on time. In this case,
    glxgears happens to almost always be waiting on the current request,
    which we already expect to complete quickly (see i915_spin_request) and
    so avoiding the waitboost on the active request and spinning instead
    provides the best latency without overcommitting to upclocking.
    However, if the system falls behind we still force the waitboost.
    Similarly, we will also trigger upclocking if we detect the system is
    not delivering frames on time - again using a mechanism that tries to
    detect a miss and not preemptively upclock.
    
    v2: Also skip boosting for after missed vblank if the desired request is
    already active.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Radoslaw Szwichtenberg <radoslaw.szwichtenberg@intel.com>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180118131609.16574-1-chris@chris-wilson.co.uk

Comment 31 Jani Saarinen 2018-03-29 07:10:46 UTC

First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.

Comment 32 Jani Saarinen 2018-04-25 06:56:45 UTC

Paul, was this fixed or still issue. See comment #30.

Comment 33 Daniel van Vugt 2018-04-26 05:38:47 UTC

Sounds like the original bug was solved in August 2017 (see earlier comments) and most of us have a kernel with the fix by now.

So further discussion should probably go in new bugs.

Comment 34 Jani Saarinen 2018-04-26 07:05:07 UTC

Thanks, based on this resolving this.

Comment 35 Jani Saarinen 2018-04-26 07:05:24 UTC

Closing, please re-open is issue still exists.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.