Bug 81682 - [snb] Frequent framerate drops (cpufreq?)
Summary: [snb] Frequent framerate drops (cpufreq?)
Status: CLOSED WONTFIX
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-23 17:50 UTC by Greg Sutcliffe
Modified: 2017-07-24 22:52 UTC (History)
1 user (show)

See Also:
i915 platform: SNB
i915 features: power/Other


Attachments
Xorg config and i915 module params (1.31 KB, text/plain)
2014-07-25 10:56 UTC, Greg Sutcliffe
no flags Details
FPS drop 1 with overlay (2.23 MB, image/jpeg)
2014-07-25 10:58 UTC, Greg Sutcliffe
no flags Details
FPS drop 2 with overlay (2.12 MB, image/jpeg)
2014-07-25 10:59 UTC, Greg Sutcliffe
no flags Details
dmesg with new kernel (14.05 KB, text/plain)
2014-07-25 13:47 UTC, Greg Sutcliffe
no flags Details
Overlay during intermittent stall (2.23 MB, image/jpeg)
2014-07-28 13:21 UTC, Greg Sutcliffe
no flags Details
perf record -g -a sleep 2 during OK period (396.17 KB, application/gzip)
2014-07-28 13:32 UTC, Greg Sutcliffe
no flags Details
perf record -g -a sleep 2 during stalling period (1.17 MB, application/gzip)
2014-07-28 13:33 UTC, Greg Sutcliffe
no flags Details
Perf text for Ok period (114.08 KB, text/plain)
2014-07-28 20:48 UTC, Greg Sutcliffe
no flags Details
Perf text for stall period (116.09 KB, text/plain)
2014-07-28 20:49 UTC, Greg Sutcliffe
no flags Details
Overlay during full stall (2.45 MB, image/jpeg)
2014-07-28 21:58 UTC, Greg Sutcliffe
no flags Details
Perf data / no stall / governor=performance (113.53 KB, text/plain)
2014-08-03 17:55 UTC, Greg Sutcliffe
no flags Details
Perf data / stalling / governor=performance (118.89 KB, text/plain)
2014-08-03 17:55 UTC, Greg Sutcliffe
no flags Details

Description Greg Sutcliffe 2014-07-23 17:50:17 UTC
This was originally reported as #81402 but I've realized that I conflated two issues, and the logs I attached were for a separate issue. Thus I'm separately raising my core issue for further debugging.

I get intermittent FPS drops on the intel GPU on my laptop, usually every 5-15mins while gaming, during the whole window manager is very slow. This lasts up to a minute before generally returning to normal. Even low-intensity games like Faster Than Light etc will trigger it, although it seems to take less time to trigger with more demanding use (e.g. FTL took nearly 2 hours to trigger it, while running something like Anomaly Defenders took less than an hour, and Path of Exile (via wine and the bumblebee-enabled nVidia card) can trigger it in less than 10 mins).

Hardware: Dell L702x laptop with i915 chip. Lspci -v reports:
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
	Subsystem: Dell Device 0571

OS is Archlinux 64bit, running mesa 10.2.4-1 and xf86-video-intel-git 2.99.914-1, although similar behaviour was seen on xf86-video-intel 2.99.912-2.

The trouble I'm having is figuring out what's causing it. Nothing is logged to /sys/kernel/debug/dri/0/i915_error_state, dmesg, syslog or Xorg.0.log. I can see no unusual processes running in htop, and I can reproduce it with cron disabled, so it's not something firing up in the background. Temperature according to lm_sensors doesn't seem different when the issue is or isn't happening.

I'm totally at a loss as to how to debug the issue further, so any pointers are *gratefully* welcomed, since I've been trying to track this issue down for the better part of 4 months. I'm available on Freenode (in #intel-gfx as 'gwmngilfen') if any real-time debugging can be done.
Comment 1 Chris Wilson 2014-07-23 17:57:10 UTC
Grab and install a kernel from http://cgit.freedesktop.org/~ickle/linux-2.6/ #master and build intel-gpu-tools/overlay/intel-gpu-overlay. Use SNA and run the overlay (use "intel-gpu-overlay --position bottom-right" to place it somewhere convenient). That should show the frequency and power consumption of the GPU (along with usage and frames/30s) that will be interesting to correlate with the periods of slow behaviour if there are no obvious errors in the log.
Comment 2 Greg Sutcliffe 2014-07-25 10:55:45 UTC
Thanks Chris, with your help on IRC I got a custom kernel and the overlay working.
In case it matters, I'm using 3.16.0-rc6-290399-gd4990be with i915 builtin.

Running Space Hulk, typically the GPU seems to sit at 1000-1100MHz and consumes ~1100mW power. When the issue strikes, all the graphs show a substantial drop - power drops to 600mW, freq to 650MHz and frame rate goes through the floor. Interestingly the purple CPU graph in the top right shows an uptick in usage at the same time (typically jumping from 100-120% busy to 200-240% busy, but has been as high as 500% busy), but I can't find a separate process that might be responsible for that. Top reports the game, firefox, and intel-gpu-overlay as the top three process during such a slowdown.

I've taken a few screenshots using my mobile phone - terrible, I know, but since the overlay is outside the normal display system, my usual screenshot tools can't capture it. Happy to re-do them if there's a better way to capture the images. I've also attached my xorg config and kernel module config for i915, in case it's useful.

What's next to investigate?
Comment 3 Greg Sutcliffe 2014-07-25 10:56:40 UTC
Created attachment 103437 [details]
Xorg config and i915 module params
Comment 4 Greg Sutcliffe 2014-07-25 10:58:16 UTC
Created attachment 103438 [details]
FPS drop 1 with overlay
Comment 5 Greg Sutcliffe 2014-07-25 10:59:40 UTC
Created attachment 103440 [details]
FPS drop 2 with overlay
Comment 6 Greg Sutcliffe 2014-07-25 11:00:54 UTC
> the purple CPU graph in the top right

Sorry, I meant top-left here, obviously.
Comment 7 Chris Wilson 2014-07-25 12:57:21 UTC
That's different to what I was expecting. I thought that the gpufreq would drop, but the game would still be struggling to render. That it is still rendering excludes the possibility of a stall (either pageflip or an outright GPU hang), but you might as well double-check with that kernel you compiled didn't detect something and attach the dmesg.

Now, the next challenge is to grap a perf profile from the good/bad periods. (That's the type of integration that we want into the GPU monitoring tools, but we don't have today.)
Comment 8 Greg Sutcliffe 2014-07-25 13:46:58 UTC
(In reply to comment #7)
> That's different to what I was expecting. I thought that the gpufreq would
> drop, but the game would still be struggling to render. 

Maybe I'm misunderstanding, but that is what I'm seeing - the gpufreq drops to minimum, and the framerate drops with it. Both seem to recover at the same time too. It's not just the game either - the whole desktop (both monitors, and multiple workspaces) become very jerky too. Or do you mean complete hangs when you say "struggling to render"?

> That it is still
> rendering excludes the possibility of a stall (either pageflip or an
> outright GPU hang), but you might as well double-check with that kernel you
> compiled didn't detect something and attach the dmesg.

I did a reboot, and a dmesg -c before starting the game. Attached it's what's in dmesg after the first fps drop. Nothing in i915_error_state though.

> Now, the next challenge is to grap a perf profile from the good/bad periods.
> (That's the type of integration that we want into the GPU monitoring tools,
> but we don't have today.)

Do you have any docs I can follow on how to do that manually then, since it's not already in the tools?
Comment 9 Greg Sutcliffe 2014-07-25 13:47:48 UTC
Created attachment 103445 [details]
dmesg with new kernel
Comment 10 Greg Sutcliffe 2014-07-27 00:17:24 UTC
I've just noticed slightly different behaviour in the gpu overlay for the game I'm really interested in fixing (Path of Exile, which is mostly rendered on the nVidia card and shipped to the Intel display via the Primus VirtualGL layer).

In this case, the gpufreq always remains constant, around 650Mhz, even when the frame rate is fine. It's the power consumption that changes when the fps drops - it's typically idling around 600mW, but during an fps drop, the power consumption drops to 200mW. No idea how to interpret that, but it seemed like relevant data, so there it is.
Comment 11 Chris Wilson 2014-07-27 06:08:57 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > That's different to what I was expecting. I thought that the gpufreq would
> > drop, but the game would still be struggling to render. 
> 
> Maybe I'm misunderstanding, but that is what I'm seeing - the gpufreq drops
> to minimum, and the framerate drops with it. Both seem to recover at the
> same time too. It's not just the game either - the whole desktop (both
> monitors, and multiple workspaces) become very jerky too. Or do you mean
> complete hangs when you say "struggling to render"?

What I was expecting for a normal thermal throttling issue would be that the actual GPU frequency would be much less than the requested GPU frequency, and the GPU would stay busy and there remain lots of commands being sent. In you case, both the requested frequency and busyness of the GPU drop to nearly zero (but not actually zero). That suggests the system is stalling. Can you grab a photo of the overlay during the stall? What I am looking for this time is whether it is recording long waits and the status of rc6 during that period. That the cpu goes up during that time is interesting.

> > Now, the next challenge is to grap a perf profile from the good/bad periods.
> > (That's the type of integration that we want into the GPU monitoring tools,
> > but we don't have today.)
> 
> Do you have any docs I can follow on how to do that manually then, since
> it's not already in the tools?

Go to your kernel source, linux/tools/perf and run make. You can try "make prefix=/usr/local install" (iirc). If you find all of the extras it asks for (most importantly libunwind and libdw/libelf) when you run "./perf top" you will see what functions are taking the most cpu time. The tricky part is that you want to know this during the stall. I would suggest on a second computer (or smartphone) to log in and run "sudo perf record -g -a sleep 2" during the stall.

(In reply to comment #10)
> I've just noticed slightly different behaviour in the gpu overlay for the
> game I'm really interested in fixing (Path of Exile, which is mostly
> rendered on the nVidia card and shipped to the Intel display via the Primus
> VirtualGL layer).
> 
> In this case, the gpufreq always remains constant, around 650Mhz, even when
> the frame rate is fine. It's the power consumption that changes when the fps
> drops - it's typically idling around 600mW, but during an fps drop, the
> power consumption drops to 200mW. No idea how to interpret that, but it
> seemed like relevant data, so there it is.

Again, that seems like the system just stop sending lots of frames to show. Do you see a noticeable fluctuation in CPU activity during time as well? Could you also watch powertop and see how cpufreq fluctuates? I guess that is something I could incorporate into the overlay...
Comment 12 Greg Sutcliffe 2014-07-28 13:21:36 UTC
Created attachment 103588 [details]
Overlay during intermittent stall
Comment 13 Greg Sutcliffe 2014-07-28 13:32:04 UTC
Created attachment 103589 [details]
perf record -g -a sleep 2 during OK period
Comment 14 Greg Sutcliffe 2014-07-28 13:33:04 UTC
Created attachment 103590 [details]
perf record -g -a sleep 2 during stalling period
Comment 15 Greg Sutcliffe 2014-07-28 13:37:47 UTC
(In reply to comment #11)
> Can you grab a photo of the overlay during the stall? What I am looking for
> this time is whether it is recording long waits and the status of rc6 during
> that period. That the cpu goes up during that time is interesting.

Last night it seemed very unwilling to reproduce properly, but I did get a period of repeatedly stalling/unstalling so I grabbed a shot of that. You can see RC6 is around 95%. I'll try to get a proper shot next time it reproduces correctly

> Now, the next challenge is to grap a perf profile from the good/bad periods.

Conveniently, the one time last night it reproduced the bug properly, I had *just* finished compiling perf. So, I've attached the output of "perf record -g -a sleep 2" (gzipped for size) as well as a repeat of that command later on when it was behaving itself.

> Do you see a noticeable fluctuation in CPU activity during time as well?
> Could you also watch powertop and see how cpufreq fluctuates?

Sadly, as discussed above, it refused to reproduce properly after I installed powertop. I'll try to watch for this tonight.
Comment 16 Chris Wilson 2014-07-28 15:05:42 UTC
Oops I forgot to mention that the perf output needs to be decoded on your machine. Use "perf report -g -i <perf.record.dat> | head -1500".
Comment 17 Greg Sutcliffe 2014-07-28 20:48:39 UTC
Created attachment 103612 [details]
Perf text for Ok period
Comment 18 Greg Sutcliffe 2014-07-28 20:49:31 UTC
Created attachment 103613 [details]
Perf text for stall period
Comment 19 Greg Sutcliffe 2014-07-28 20:50:12 UTC
Upload the decoded text for the perf dumps. However, I'm seeing a lot of

Failed to open /tmp/perf-5956.map, continuing without symbols

Which I assume you'll need - how do I fix that?
Comment 20 Greg Sutcliffe 2014-07-28 21:58:28 UTC
Created attachment 103617 [details]
Overlay during full stall
Comment 21 Greg Sutcliffe 2014-07-28 22:01:45 UTC
So Path of Exile finally did another drop fps drop, so I've grabbed and added a screenshot of it - as before, you can see RC6 is ~90%. I also didn't see any major CPU usage changes during the stall.

Interesting, when it recovered, for a short time it went up to 1100MHz gpufreq, and the fps suddenly became smooth as silk, 50fps at least. Whilst it may not be the proper solution, is there a re to force the gpu up to a high freq while I'm gaming? I'm happy to continue debugging with you, but at least I could choose when I'm going to get the fps drops :)
Comment 22 Greg Sutcliffe 2014-07-28 22:02:31 UTC
> is there a re to force

Ugh, typo. Should be ... is there a way to force ...
Comment 23 Chris Wilson 2014-07-29 06:42:52 UTC
(In reply to comment #21)
> Interesting, when it recovered, for a short time it went up to 1100MHz
> gpufreq, and the fps suddenly became smooth as silk, 50fps at least. Whilst
> it may not be the proper solution, is there a re to force the gpu up to a
> high freq while I'm gaming? I'm happy to continue debugging with you, but at
> least I could choose when I'm going to get the fps drops :)

You can force the requested frequency to max with "cat /sys/class/drm/card0/gt_RP0_freq_mhz > /sys/class/drm/card0/gt_max_freq_mhz". It should not be any better than the autotuning (and likely to be worse). However, that will not stop the hardware from throttling down to meet thermal limits.

Looking at the perf output, the unknown symbols are just from a lack of debug packages, but I think we have enough to suggest that cpufreq is a major factor in the error. During the stall, it is going mad. Disabling cpufreq (echo performance > /sys/.../scaling_governor) would be a good test.
Comment 24 Greg Sutcliffe 2014-07-29 15:21:23 UTC
So I did

for n in `seq 0 7` ; do echo performance > /sys/devices/system/cpu/cpu$n/cpufreq/scaling_governor ; done

and then left the game running for a few hours while working today, and didn't see any slowdowns. Obviously it's impossible to prove a negative, but so far it's looking good. I'll give it some serious playtime over the next few days and see if it recurs at all. Progress though, so thanks!
Comment 25 Chris Wilson 2014-07-30 06:22:56 UTC
Looks like we are making progress, but just to correct one thing I said earlier: to force the gpu to request maximum frequency all the time use "cat /sys/class/drm/card0/gt_RP0_freq_mhz > /sys/class/drm/card0/gt_min_freq_mhz"
Comment 26 Chris Wilson 2014-07-31 13:25:23 UTC
Another thing to compare is the acpi cpufreq driver vs intel_pstate. That might need different kernel compilations options, or at least manually controlling which module is loaded.
Comment 27 Greg Sutcliffe 2014-07-31 17:28:58 UTC
(In reply to comment #26)
> Another thing to compare is the acpi cpufreq driver vs intel_pstate. That
> might need different kernel compilations options, or at least manually
> controlling which module is loaded.

Sorry, I'm going to be dense here - what sort of comparison are we talking about? Trying different version of drivers, or just observing them as the fps drops happen?

I've been playing heavily in the last few days, and detting th governor to performance has made a big difference, but hasn't eliminated the problem - it's occured (properly) just once, and it corrected itself before I got perf loaded up to capture it. I'm awaiting it doing it again so I can grab it.
Comment 28 Chris Wilson 2014-07-31 17:40:00 UTC
intel_pstate is a replacement cpufreq driver for recent Intel CPUs. So it should be possible to use either and so we can see if this behaviour is confined to one cpufreq driver.
Comment 29 Greg Sutcliffe 2014-08-03 17:55:07 UTC
Created attachment 103936 [details]
Perf data / no stall / governor=performance
Comment 30 Greg Sutcliffe 2014-08-03 17:55:38 UTC
Created attachment 103937 [details]
Perf data / stalling / governor=performance
Comment 31 Greg Sutcliffe 2014-08-03 17:57:07 UTC
Just for interest, I've managed to capture another stall. I've added 2 new perf dumps from ok and stalled sessions while using the performance cpufreq governor on all cpus. It's definitely happening less frequently now though. I'll try to investigate intel_pstate in the next few days and report back.
Comment 32 Greg Sutcliffe 2014-08-14 10:21:22 UTC
Sorry for the delay, life got in the way.

I've had a look through the kernel config I have available to me. I seem to have this:

Power management and ACPI options
| CPU Frequency scaling
| | x86 CPU frequency scaling drivers
| | | [*] Intel P state control (X86_INTEL_PSTATE [=y])
| | | <M> ACPI Processor P-States driver (X86_ACPI_CPUFREQ [=m])

I don't see scpi_cpufreq loaded in lsmod, so as far as I can tell, that means I'm defaulting to the pstate driver here? If so, I'll recompile the kernel without it and let acpi_cpufreq pick it up and retest.
Comment 33 Greg Sutcliffe 2014-08-26 14:35:58 UTC
Ok, so after a few days testing each variant, I've noticed the following behaviours:

With intel_pstate:

As discussed previously, setting the scheduler to "performance" (from "powersave") dramatically improves the framerate. Slowdowns still happen, but are less frequent. When they do happen, they last for some considerable time (minutes or 10s of minutes, rather than seconds). Gaming feels smooth apart from during the slowdowns.

With acpi_cpufreq:

Slowdowns still happen with this driver, but last for a much shorter timeframe (a few seconds, typically). Gameplay thus feels less interrupted, but more stuttering as the slowdowns kick in and the stop again. Changing the cpufreq scheduler to "performance" from "ondemand" had no effect on the frequency or duration of the slowdowns. Frequency of slowdown occurance was higher than pstate-with-performance, but lower than pstate-without-performance.

Certianly, it doesn;t appear to be confined to one driver, although the precise symptoms vary slightly. Chris, any further ideas for info I can get for you?
Comment 34 Chris Wilson 2014-08-26 14:49:47 UTC
Do you have any control over cooling? Might be interesting to forcing fans to maximum and then seeing if the throttling changes. I think the slowdowns are due to thermal issues, just the recovery is different between the two cpufreq drivers.
Comment 35 Greg Sutcliffe 2014-08-26 15:03:41 UTC
I'll see what I can do. Sensors isn't reporting any fan data, so it's hard to be sure. Next time I reboot, I'll check the BIOS options (and revision, maybe there's an upgrade).

However, it doesn't feel especially hot. I've been stuck at 7fps in town for the last 20min, yet sensors is reporting this:

[root@emerald ~]# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +63.0°C  (high = +86.0°C, crit = +100.0°C)
Core 0:         +58.0°C  (high = +86.0°C, crit = +100.0°C)
Core 1:         +62.0°C  (high = +86.0°C, crit = +100.0°C)
Core 2:         +59.0°C  (high = +86.0°C, crit = +100.0°C)
Core 3:         +58.0°C  (high = +86.0°C, crit = +100.0°C)

Doesn't seem especially high... or am I misreading that?
Comment 36 Chris Wilson 2014-08-26 15:06:41 UTC
Looks fine to me as well. :|
Comment 37 Greg Sutcliffe 2014-08-26 15:38:25 UTC
Sadly the BIOS does not report any fan control, and I can't seem to make sensors report anything about the fans at all. There are custom BIOSs available for the L702x, so I could try that, I guess, if I can recall how to apply things that are invariably designed for running from Windows :)
Comment 38 Jesse Barnes 2014-12-05 19:48:32 UTC
Asking Dirk to take a look from a cpufreq perspective.  We may need some APIs here between the GPU driver and cpufreq to negotiate the right perf level on each.
Comment 39 Greg Sutcliffe 2014-12-06 16:24:17 UTC
Hey, I'd forgotten about this - glad to see there's still action it. I've mostly been playing non-intensive games recently, so I've not seen the issue for a while, but I'm likely to be returning to Path of Exile (my main offender) in a week or two, so I'll keep an eye out for it.

Still happy to test things if needed, just let me know.
Comment 40 Jesse Barnes 2015-03-30 20:59:14 UTC
Any news?  Have newer kernels helped at all, maybe with some changes to cpufreq?
Comment 41 Greg Sutcliffe 2015-07-14 11:22:14 UTC
(In reply to Jesse Barnes from comment #40)
> Any news?  Have newer kernels helped at all, maybe with some changes to
> cpufreq?

Ok, so having returning to more intensive gaming, I've been keeping an eye out, and it just happened again. Framerate dropped to 3fps for no apparent reason that I can see. Cpufreq was already manually set to 'performance' via

for n in `seq 0 7` ; do echo performance > /sys/devices/system/cpu/cpu$n/cpufreq/scaling_governor ; done

Once it kicks in, it affects the whole desktop - even killing and restarting the game, I'm still stuck at 3fps.

To be honest, I've tweaked and twiddled this system so much trying to troubleshoot this in the past that I'm not even sure what my configuration is at the moment (although it's probably not much different from the above config from last year). Do you have any recommended examples for Xorg config, module settings etc, that I could try?
Comment 42 Elio 2017-02-10 20:16:15 UTC
Closing this bug since it is to old, please re-open if nee3ded


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.