44006 – [snb turbo] Gnome Shell animations slow down at random

Bug 44006 - [snb turbo] Gnome Shell animations slow down at random

Summary: [snb turbo] Gnome Shell animations slow down at random

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Ben Widawsky
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (2):	49811 50073 (view as bug list)
Depends on:
Blocks:

Reported:	2011-12-21 03:29 UTC by Alessandro Crismani
Modified:	2017-07-24 23:03 UTC (History)
CC List:	11 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Perf report head (22.74 KB, text/plain) 2011-12-22 00:44 UTC, Alessandro Crismani	no flags	Details
Perf report head - errors (2.71 KB, text/plain) 2011-12-22 01:05 UTC, Alessandro Crismani	no flags	Details
Perf report head with FAST data (31.61 KB, text/plain) 2011-12-22 06:08 UTC, Alessandro Crismani	no flags	Details
Perf record of event i915:i915_gem_request_wait_begin with FAST anmations (1.72 KB, text/plain) 2011-12-22 06:13 UTC, Alessandro Crismani	no flags	Details
Perf record of event i915:i915_gem_request_wait_begin with SLOW anmations (1.72 KB, text/plain) 2011-12-24 01:42 UTC, Alessandro Crismani	no flags	Details
Perf record of all the system with SLOW animations (101.01 KB, text/plain) 2011-12-24 01:46 UTC, Alessandro Crismani	no flags	Details
Dump of intel_reg_dumper with slow animations and GPU stuck at 650MHz (12.13 KB, text/plain) 2012-03-21 01:05 UTC, Alessandro Crismani	no flags	Details
Dump of intel_reg_dumper with fast animations and GPU scaling between 1300MHz and 650MHz (12.13 KB, text/plain) 2012-03-21 01:06 UTC, Alessandro Crismani	no flags	Details
Diff between reg dumps (482 bytes, text/plain) 2012-03-21 01:07 UTC, Alessandro Crismani	no flags	Details
Always unsuppress down interrupts (2.91 KB, patch) 2012-04-25 11:15 UTC, Chris Wilson	no flags	Details \| Splinter Review
Always unsuppress down interrupts (3.52 KB, patch) 2012-04-26 01:26 UTC, Chris Wilson	no flags	Details \| Splinter Review
Periodically sanity check powermanagement (4.70 KB, patch) 2012-04-26 01:37 UTC, Chris Wilson	no flags	Details \| Splinter Review
Periodically sanity check powermanagement (4.70 KB, patch) 2012-04-26 02:03 UTC, Chris Wilson	no flags	Details \| Splinter Review
Periodically sanity check powermanagement (4.93 KB, patch) 2012-04-26 02:18 UTC, Chris Wilson	no flags	Details \| Splinter Review
Always reset interrupts along with setting freqeuncy (5.51 KB, patch) 2012-04-27 05:55 UTC, Chris Wilson	no flags	Details \| Splinter Review
Show Obsolete (4) View All

Description Alessandro Crismani 2011-12-21 03:29:25 UTC

Hi everybody,
I am experiencing a performance downgrade in Gnome Shell over time. More
explicitly, after some time using Gnome Shell, the scaling effect "gets
corrupted" and slows down. To get fast animations back I have to reboot the system. The slowing down happens almost always when I leave the laptop idle for some time, from about an hour to longer idle periods (I am tempted to say it always happens when the laptop is idle since it never did otherwise in the last three weeks, but I kind of remember it happening when I was working on my laptop quite some time ago).

I have first reported it on Gnome Bugzilla (see https://bugzilla.gnome.org/show_bug.cgi?id=665846), where devs suggested to open a bug report here, since it seems to pertain to Intel drivers. I've tried many software combinations, including (on Arch Linux):

Kernel 3.0.x (where x was the stable minor version at different points in
time)
Kernel 3.1.rcx
Kernel 3.1.x
Kernel 3.2.rcx

Mesa 7.10.4
Mesa 7.11.0
Mesa from the master tree as of last week (and some other older snapshots).

xf86-video-intel 2.16
xf86-video-intel 2.17
xf86-video-intel git from master tree as of last week
xf86-video-intel enabling SNA, git from master tree as of last week

I've also tried Fedora 16, with similar results.

The problem is that I have no clue about how to debug this. When animations are slow I cannot see a higher CPU or memory usage. Somebody suggested perf and I recorded the system both before corruption and after it, but results seem similar. The perf data are available at:

https://dl.dropbox.com/s/nkl5llmht85o81z/fast.data?dl=1
https://dl.dropbox.com/s/96qcbbhklb2d1a8/slow.data?dl=1

I've tried monitoring the Xorg resources usage and it seems to be under control. Furthermore, killing Xorg doesn't help returning to fast animations, which are restored only upon a reboot.

Can anybody help me getting useful information on this? I feel sorry for this vague and information-less bug report, but I am lost in the dark. Besides, this is my first report here, so please be kind if something is wrong with it :)

Alessandro

Comment 1 Chris Wilson 2011-12-21 15:21:50 UTC

I think there is a clue in the slow data in that we are hitting some pixman paths. As I can't resolve that without your debugs symbols, can you do perf report -i slow.data --comm=Xorg | head -1000?

If I am the right track, perf record -f -g -a -c 1 -e i915:i915_gem_request_wait_begin will show hits during the slow periods and none during the fast.

Comment 2 Alessandro Crismani 2011-12-22 00:44:52 UTC

Created attachment 54669 [details]
Perf report head

Output of report -i slow.data --comm=Xorg | head -1000

Comment 3 Alessandro Crismani 2011-12-22 01:05:49 UTC

Created attachment 54671 [details]
Perf report head - errors

Errors on standard output when running report -i slow.data --comm=Xorg | head -1000

Am I missing debug symbols for some components? I'll gladly recompile them with debug if it's needed.

Comment 4 Chris Wilson 2011-12-22 02:09:53 UTC

Yes, you are indeed missing the debug symbols for the xserver, xf86-video-intel and libdrm/libdrm_intel. If you are going to recompile, add ./configure --enable-sna to xf86-video-intel.

Comment 5 Alessandro Crismani 2011-12-22 06:08:53 UTC

Created attachment 54695 [details]
Perf report head with FAST data

Hi Chris,

I am attaching the output of
report -i fast.data --comm=Xorg | head -1000,

where fast.data has been obtained with perf record -a -g sleep 30, after recompiling xorg-server, xf86-video-intel-git, mesa-git, libdrm 2.4.29 with debug symbols. This report is obtained with fast animations, I'll attach another one obtained with slow animations as soon as I hit that situation.

I am still getting some errors for missing symbols that I think are from the kernel. Should I recompile that as well or reports like this are enough? The errors are:

No kallsyms or vmlinux with build-id 85e2030964f6c266a13a3e4341e8eb4394c2a6b2 was found
[i915] with build id 85e2030964f6c266a13a3e4341e8eb4394c2a6b2 not found, continuing without symbols
No kallsyms or vmlinux with build-id 9a0592ddb0090f6dbab3ef9860bf5b2261a9c1c0 was found
[drm] with build id 9a0592ddb0090f6dbab3ef9860bf5b2261a9c1c0 not found, continuing without symbols
No kallsyms or vmlinux with build-id 0621f08ae28b945229f032a721c7831853282d4f was found
[drm_kms_helper] with build id 0621f08ae28b945229f032a721c7831853282d4f not found, continuing without symbols
No kallsyms or vmlinux with build-id f56d9fbb972752b03ce39d1731d4186e4a28ce2a was found
[evdev] with build id f56d9fbb972752b03ce39d1731d4186e4a28ce2a not found, continuing without symbols
No kallsyms or vmlinux with build-id 81adf06cf0545c68b41b86f5cf7821a6d8ec91bf was found
[hid] with build id 81adf06cf0545c68b41b86f5cf7821a6d8ec91bf not found, continuing without symbols
No kallsyms or vmlinux with build-id b87e9c844eea506f7c82a83fe866b54b17608f72 was found
[usbhid] with build id b87e9c844eea506f7c82a83fe866b54b17608f72 not found, continuing without symbols
No kallsyms or vmlinux with build-id 48d4b06c2a6398c859c112d0b4734e7192152022 was found
[usbcore] with build id 48d4b06c2a6398c859c112d0b4734e7192152022 not found, continuing without symbols
No kallsyms or vmlinux with build-id b8bd13c3edbb8eb39a065a3bd77aacd489934cf9 was found
[ehci_hcd] with build id b8bd13c3edbb8eb39a065a3bd77aacd489934cf9 not found, continuing without symbols
No kallsyms or vmlinux with build-id ed67840d791d608ab97312604526da6802511832 was found
[e1000e] with build id ed67840d791d608ab97312604526da6802511832 not found, continuing without symbols
No kallsyms or vmlinux with build-id ca5217db63f6f2107bda1a042e86c82b9d716426 was found
[intel_gtt] with build id ca5217db63f6f2107bda1a042e86c82b9d716426 not found, continuing without symbols
No kallsyms or vmlinux with build-id c29171663a3e4cbe082fd749dd2cb668cb7e0dc9 was found
[scsi_mod] with build id c29171663a3e4cbe082fd749dd2cb668cb7e0dc9 not found, continuing without symbols
No kallsyms or vmlinux with build-id c197dc73c319d875e19626530f56d764a5baaf68 was found
[ext4] with build id c197dc73c319d875e19626530f56d764a5baaf68 not found, continuing without symbols
No kallsyms or vmlinux with build-id c8b4e182ab878a7b55f9bd67c48aae9b1fb943e5 was found
[libata] with build id c8b4e182ab878a7b55f9bd67c48aae9b1fb943e5 not found, continuing without symbols
No kallsyms or vmlinux with build-id 2d45008140e2aed83b4a4ebf2f28d390947346a7 was found
[libahci] with build id 2d45008140e2aed83b4a4ebf2f28d390947346a7 not found, continuing without symbols


Thanks a lot!

Comment 6 Alessandro Crismani 2011-12-22 06:13:53 UTC

Created attachment 54696 [details]
Perf record of event i915:i915_gem_request_wait_begin with FAST anmations

Output of:
perf report -i fast_i915_gem_request_debug.data | head -1000

where fast_i915_gem_request_debug.data has been obtained with:
sudo perf record -f -g -a -c 1 -e i915:i915_gem_request_wait_begin

The perf record was run before hitting slow animations

Comment 7 Chris Wilson 2011-12-22 07:26:20 UTC

I think I have enough of the stacks there to be able to find the critical paths. When you hit the slow can you use head -3000, thanks.

Comment 8 Alessandro Crismani 2011-12-24 01:42:05 UTC

Created attachment 54781 [details]
Perf record of event i915:i915_gem_request_wait_begin with SLOW anmations

Obtained from perf record -f -g -a -c 1 -e i915:i915_gem_request_wait_begin

with slowed down animations

Comment 9 Alessandro Crismani 2011-12-24 01:46:04 UTC

Created attachment 54782 [details]
Perf record of all the system with SLOW animations

Obtained using perf report -i slow.data --comm=Xorg | head 4000 > slow_head.txt,

where slow.data is the output of perf record -a -g sleep 30

Comment 10 Alessandro Crismani 2011-12-24 01:51:47 UTC

Hi Chris,

I've attached the requested files, hope they are ok.

Is there any reason why this happens if I leave the laptop idling and not otherwise? Furthermore, I've noticed that while a killall Xorg doesn't help, a suspend/resume cycle gets rid of slow animations. I hope these little details might be of some help.

Lastly, since we are almost there, I'd like to wish you a happy holiday and a merry Christmas.

Alessandro

Comment 11 Alessandro Crismani 2012-01-05 05:24:07 UTC

I take back that it happens only when idle. It happened at least once when I was actively using the laptop during the holidays. Despite that, I can usually replicate it by leaving the computer sitting for about an hour.

Comment 12 Alessandro Crismani 2012-01-23 09:09:48 UTC

Chris, do you need more logs? I'll try to provide as much information as possible, since this lowers my productivity using Gnome Shell as I have to go through sleep/resume cycles.

As an update, I see it happening using the release candidate of mesa 8.0 and xf86-video-intel from git with --enable-sna=gen6. It also happened more times when I was actively working on the laptop.

Hope this isn't too much noise.

Comment 13 Chris Wilson 2012-01-26 12:53:42 UTC

Sorry for the delay, but no, there is nothing obviously wrong with the above. So if we conclude that is not the software that is running that is anomalous, we turn to the hw. Does adjusting the powersaving make any difference? In particular setting the CPU governors to 'performance', or disabling rc6 and fbc (append i915.i915_enable_fbc=0 i915.i915_enable_rc6 to your kernel boot parameters)?

Comment 14 Alessandro Crismani 2012-01-26 13:06:55 UTC

Sorry for pushing the bug report, I realised that I was impatient after clicking "save changes". I'll try to figure out if power-saving features matter in the next days and I'll report back!

Besides, if it may help you I might set up a remote access to the machine for debugging purposes. I don't know if that could be of any help, but it seems that nobody in any forum on the net has the same problem as me.

Comment 15 Chris Wilson 2012-01-26 14:09:24 UTC

If you can get access to the machine over ssh whilst it is being slow, please come on to irc and poke me, ickle in #intel-gfx on irc.freenode.net.

Comment 16 Alessandro Crismani 2012-02-05 03:39:58 UTC

Hi Chris,

Sorry for not having got in touch lately. I've been running many trials and I am convinced that the culprit here is RC6. I have never seen a slowdown when booting with i915.i915_enable_rc6=0, while they happen when the parameter is set to 1.

Note that echoing 1 to /sys/module/i915/parameters/i915_enable_rc6 when the system already slowed down doesn't help, I have to suspend and resume.

Would it be useful for you having access to the machine when it slowed down after having enabled RC6? If so, let me know which packages should have debugging symbols so I can rebuild them.

Cheers,
Alessandro

Comment 17 Alessandro Crismani 2012-02-05 03:54:43 UTC

(In reply to comment #16)

> Note that echoing 1 to /sys/module/i915/parameters/i915_enable_rc6 when the
> system already slowed down doesn't help, I have to suspend and resume.
> 

Obviously that was echoing 0.
Sorry for the added noise

Comment 18 Chris Wilson 2012-02-05 03:58:06 UTC

Yes having access to the machine, even via a proxy such as yourself, when it slows down would be extremely useful. At this moment in time, I don't have a good idea as to what we are looking for, so I expect many shots in the dark until we strike upon a clue. Having the debug symbols for the entire gfx stack, (kernel, X, xf86-video-intel, libdrm, mesa, libc etc) available would be a good start. Also installing trace-cmd alongside perf so that we can watch what is happening inside i915.ko would be useful.

As you have noticed, rc6 is only enabled/disabled at module initialisation (such as upon load or after resume).

Comment 19 Chris Wilson 2012-02-08 07:06:09 UTC

Also under suspicion is a MCE event with rc6 enabled. I've asked Alessandro to see if they correlate with the slowdowns.

Comment 20 Alessandro Crismani 2012-02-14 04:44:32 UTC

Hi Chris,

I've monitored "Machine check events" in dmesg trying to correlate them with slowdowns, but they don't seem to match.

For example, I had an handful of MCEs this morning, but the slowdown happened much later than the last MCE.

To be more clear, this is the ouptut for today of cat /var/log/kernel.log | grep "Machine check events":
Feb 14 09:26:32 localhost kernel: [  299.790478] [Hardware Error]: Machine check events logged
Feb 14 09:31:32 localhost kernel: [  599.121260] [Hardware Error]: Machine check events logged
Feb 14 09:41:32 localhost kernel: [ 1197.772901] [Hardware Error]: Machine check events logged
Feb 14 09:46:32 localhost kernel: [ 1497.103691] [Hardware Error]: Machine check events logged

The slow happened between 11:15 and 12:30, when I was away and the laptop was left idle at my lab.

I've seen a similar behaviour during these days, where I had some MCEs, but their timestamps didn't agree at all with slowdowns.

Comment 21 Eugeni Dodonov 2012-02-14 07:14:44 UTC

Could you please try with the patches at http://lists.freedesktop.org/archives/intel-gfx/2012-February/015004.html and different i915_enable_rc6 parameter values, and check if any of those improve the situation for you?

Comment 22 Alessandro Crismani 2012-02-14 07:16:56 UTC

Sure, I'll give it a go as soon as possible and I'll report back results

Thanks a lot for trying to help, it is really appreciated here!

Comment 23 Alessandro Crismani 2012-02-27 01:09:33 UTC

Here I am with some results, unfortunately not very good ones.

I've tried separately enabling RC6, deep RC6 and deepest RC6, and I've noticed slowdowns using each configuration. On the contrary, I've run four days (and idle nights) without slowdowns using i915_enable_rc6=0.

This weekend I've left the laptop downloading a huge file and it didn't slowdown, even though it was running with i915_enable_rc6=1. May the download have prevented idling (or going into RC6) and hence somewhat prevented a slowdown? Is this nonsense or does it make some sense?

Can I be more helpful in any way?
Alessandro

Comment 24 Daniel Vetter 2012-03-02 08:03:35 UTC

On a blind guess might be related to #45289 but hard to tell what's actually going on. One thing to look out for is whether we fail to ramp up the gpu clock, see the CAGF field in i915_cur_delayinfo in debugfs.

Comment 25 Alessandro Crismani 2012-03-02 09:00:42 UTC

I have indeed kind of different results with the tests of #45289.

This is when the system slowed down:

Fri Mar  2 17:48:32 2012
Sync time adjustment is 0.0549 msecs.

  80000 reps @   0.0746 msec ( 13400.0/sec): GetImage 10x10 square
  80000 reps @   0.0773 msec ( 12900.0/sec): GetImage 10x10 square
  80000 reps @   0.0880 msec ( 11400.0/sec): GetImage 10x10 square
  80000 reps @   0.1002 msec (  9980.0/sec): GetImage 10x10 square
  80000 reps @   0.0728 msec ( 13700.0/sec): GetImage 10x10 square
 400000 trep @   0.0826 msec ( 12100.0/sec): GetImage 10x10 square

    800 reps @   7.8304 msec (   128.0/sec): GetImage 500x500 square
    800 reps @   7.9460 msec (   126.0/sec): GetImage 500x500 square
    800 reps @   8.0146 msec (   125.0/sec): GetImage 500x500 square
    800 reps @   7.9976 msec (   125.0/sec): GetImage 500x500 square
    800 reps @   8.1326 msec (   123.0/sec): GetImage 500x500 square
   4000 trep @   7.9842 msec (   125.0/sec): GetImage 500x500 square

Then I put the laptop to sleep, resumed, and run the same test:

Fri Mar  2 17:51:34 2012

Sync time adjustment is 0.0319 msecs.

 160000 reps @   0.0375 msec ( 26700.0/sec): GetImage 10x10 square
 160000 reps @   0.0376 msec ( 26600.0/sec): GetImage 10x10 square
 160000 reps @   0.0379 msec ( 26400.0/sec): GetImage 10x10 square
 160000 reps @   0.0373 msec ( 26800.0/sec): GetImage 10x10 square
 160000 reps @   0.0399 msec ( 25100.0/sec): GetImage 10x10 square
 800000 trep @   0.0380 msec ( 26300.0/sec): GetImage 10x10 square

    800 reps @   7.1345 msec (   140.0/sec): GetImage 500x500 square
    800 reps @   7.0899 msec (   141.0/sec): GetImage 500x500 square
    800 reps @   7.1427 msec (   140.0/sec): GetImage 500x500 square
    800 reps @   7.1551 msec (   140.0/sec): GetImage 500x500 square
    800 reps @   7.1397 msec (   140.0/sec): GetImage 500x500 square
   4000 trep @   7.1324 msec (   140.0/sec): GetImage 500x500 square

The value of CAGF stays at 650MHz, ever (at least I've tried monitoring it and it never changed before and after the sleep/resume cycle).

Comment 26 Daniel Vetter 2012-03-02 09:30:33 UTC

> --- Comment #25 from Alessandro Crismani <alessandro.crismani@gmail.com> 2012-03-02 09:00:42 PST ---
> I have indeed kind of different results with the tests of #45289.

That's intriguing. Please recheck when your system is slow again that this
does indeed correlate (or whether it's just a one-time concidence).

> The value of CAGF stays at 650MHz, ever (at least I've tried monitoring it and
> it never changed before and after the sleep/resume cycle).

I'll only go up if something is actually being rendered, i.e. you need to
look out for whether it's fixed at 650 when the system is slow or not.

Comment 27 Alessandro Crismani 2012-03-04 15:05:19 UTC

Daniel,

It seems you were on spot :)

When I see slow downs the system is indeed unable to ramp up the GPU clock. I've tried running the x11perf test and monitoring the CAGF value and it stays constant at 650 MHz. On the other hand, after a sleep/resume cycle (that is, getting back to fast operations), as soon as I trigger the test the CAGF raches 1300 MHz.

Videos showing the behaviour are here:
slow -> https://www.dropbox.com/sh/c3fk0v8r2b5sj05/EJ5AeYjHoq/slow-x11.webm
fast -> https://www.dropbox.com/sh/bdohawi2bcac9x3/oQzbFQ6IhL/fast-x11.webm

Can I be of more help isolating why the frequency is stuck and doesn't go up?

Comment 28 Chris Wilson 2012-03-04 15:11:47 UTC

As a side note, current GPU frequency (at least as reported by i915_cur_delayinfo) does not correlate with fast/slow getimage500.

Comment 29 Chris Wilson 2012-03-04 15:16:05 UTC

Foot-in-mouth time. I was expecting to see current frequency jump, but instead the change is in render P-state. For slow getimage500, rps is 133, and for fast we jump to 131.

Comment 30 Alessandro Crismani 2012-03-05 00:53:43 UTC

Does it mean that the GPU clock is not related to slowdowns? I thought it was a good candidate.

If you need other logs or information please tell me, I'd love to help as much as possible, even tough I tend to get badly lost when it comes to technical discussion on GPU features.

Comment 31 Alessandro Crismani 2012-03-20 02:03:28 UTC

Hi everybody,

I've tried monitoring the situation for some time and the slow downs perfectly correlate with the frequency of the GPU not going up. Every time I see slowdowns the CAGF is stuck at 650 MHz. As soon as I go through sleep/resume the slowdown disappears, while the CAGF again varies between 650MHz and 1300MHz. This happened many times during the past weeks, and I always saw the same symptoms. 

I remember trying disabling RC6 and I recall that the frequency never went under 900 MHz, which might explain why I can't replicate the problem without enabling RC6.

How may I help further for identifying why the frequency does not ramp up when needed?

Comment 32 Eugeni Dodonov 2012-03-20 11:11:11 UTC

This is the same issue Eric was having I suspect, and I noticed it a couple of times as well on IVB as well.

Just to confirm, could you please attach the diff between 'intel_reg_dumper' outputs from the latest intel-gpu-tools package from git with both good and bad configuration?

Thanks!
Eugeni

Comment 33 Alessandro Crismani 2012-03-21 01:05:32 UTC

Created attachment 58796 [details]
Dump of intel_reg_dumper with slow animations and GPU stuck at 650MHz

Comment 34 Alessandro Crismani 2012-03-21 01:06:18 UTC

Created attachment 58797 [details]
Dump of intel_reg_dumper with fast animations and GPU scaling between 1300MHz and 650MHz

Comment 35 Alessandro Crismani 2012-03-21 01:07:11 UTC

Created attachment 58798 [details]
Diff between reg dumps

Comment 36 Alessandro Crismani 2012-03-21 01:09:30 UTC

Here are the outputs of intel_reg_dumper in both situations and their difference. The diff seems to be quite silent, hope it helps. Again, when having slow animations the frequency was bound to 650 MHz.

Thanks a lot for keeping up with this!

Comment 37 ValdikSS 2012-03-26 02:59:11 UTC

Alessandro Crismani, you're not alone! I'm experiencing this bug too, but I'm using kde and kwin and have no noticeable slowdowns on desktop, but when I run any game I can see fps degradation by 2 times usually. Have to suspend/resume as do you. And I definitely have this bug with rc6=0 too. Tested on both uxa and sna.
By the way, what laptop do you have? Thinkpad x220 here.

Comment 38 Alessandro Crismani 2012-03-26 03:08:35 UTC

Thinkpad T420s.

I thought it didn't happen with RC6 disabled. I noticed that the min frequency with i915_enabl_rc6=0 case is 900 MHz, which might be enough for not noticing a slow down using Gnome Shell, but may impact games. However, I'll go through a deeper testing using rc6=0 since my old analysis might have been wrong. Glad to see I'm not the only one.

Comment 39 ValdikSS 2012-03-26 04:15:31 UTC

Alessandro Crismani, what software are you using to monitor gpu frequency?

Comment 40 Alessandro Crismani 2012-03-26 05:30:34 UTC

As per comment 24, I am monitoring the value of CAGF in
/sys/kernel/debug/dri/64/i915_cur_delayinfo

Comment 41 Alessandro Crismani 2012-04-23 01:51:51 UTC

For your information,

I am seeing this on another laptop equipped with a SNB GPU. My own laptop is a Lenovo T420s, and the one I am using now comes from my lab and is a Dell Latitude E5520.

Both get stuck at 650 MHz after some time, hence it seems that the behaviour is not bound to a specific machine.

Comment 42 Chris Wilson 2012-04-25 11:15:23 UTC

Created attachment 60586 [details] [review]
Always unsuppress down interrupts

Sounds strange to believe that failing to receive the down interrupt is responsible for being stuck at minimum frequency...

Comment 43 Alessandro Crismani 2012-04-25 11:25:21 UTC

I'll try this as soon as my laptop comes back from Lenovo's support center. Does it apply to a particular kernel version? Particulalry, am I ok using it together with the last 3.4-rc?

Cheers,
Alessandro

Comment 44 Chris Wilson 2012-04-25 11:27:15 UTC

It should apply to any recent kernel, we haven't poked around in that function for a few months at least ;)

Comment 45 ValdikSS 2012-04-25 12:08:42 UTC

Patched 3.3.3, will post result tomorrow

Comment 46 ValdikSS 2012-04-25 23:29:32 UTC

No, the problem is still exists.

Comment 47 Chris Wilson 2012-04-26 01:17:57 UTC

With in the stuck condition, try: sudo intel_reg_write 0xa014 0x1a000000

(And please attach out what the value was before.)

Comment 48 Chris Wilson 2012-04-26 01:26:13 UTC

Created attachment 60601 [details] [review]
Always unsuppress down interrupts

Slightly improved patch; sanity-checker coming up.

Comment 49 ValdikSS 2012-04-26 01:32:59 UTC

(In reply to comment #47)
> With in the stuck condition, try: sudo intel_reg_write 0xa014 0x1a000000
> 
> (And please attach out what the value was before.)

$ sudo intel_reg_write 0xa014 0x1a000000
Value before: 0x0
Value after: 0x1A000000

And it's not stuck now.
Should I compile a kernel with new patch?

Comment 50 Chris Wilson 2012-04-26 01:37:49 UTC

Created attachment 60602 [details] [review]
Periodically sanity check powermanagement

Hopefully this should fixup the bug when it happens and at least leave noise in the dmesg.

Comment 51 ValdikSS 2012-04-26 01:45:18 UTC

Should I apply 60601 and 60602 patches or just 60602?

Comment 52 Chris Wilson 2012-04-26 02:00:18 UTC

(In reply to comment #51)
> Should I apply 60601 and 60602 patches or just 60602?

Both patches. 60602 should workaround it ever getting stuck again, and maybe 60601 will just work! But 60602 will leave some *ERROR* in the dmesg if the fixup is required, so we should be able to tell if 60601 is sufficient even if you also apply 60602.

I hope that was clear. :)

Comment 53 Chris Wilson 2012-04-26 02:03:30 UTC

Created attachment 60603 [details] [review]
Periodically sanity check powermanagement

The new 60602. Oops.

Comment 54 ValdikSS 2012-04-26 02:16:14 UTC

Oh, sorry please. It seems that I didn't apply the patch so don't count Comment 46.
And patch from Comment 53 should be applied on 3.4-rc kernels I believe, because there is no intel-pm.c in 3.3.3.
Applied patch only from Comment 48, will report soon.

Comment 55 Chris Wilson 2012-04-26 02:18:27 UTC

Created attachment 60605 [details] [review]
Periodically sanity check powermanagement

I will get this patch right eventually.

Comment 56 Chris Wilson 2012-04-26 02:25:46 UTC

Ho hum:

[   37.812915] [drm:gen6_sanitize_pm] *ERROR* Power management discrepancy: GEN6_RP_INTERRUPT_LIMITS expected 19000000, was 12060000
[   47.232592] [drm:gen6_sanitize_pm] *ERROR* Power management discrepancy: GEN6_RP_INTERRUPT_LIMITS expected 19000000, was 00070000

Comment 57 ValdikSS 2012-04-26 04:20:53 UTC

This time with patch from Comment 48 I got stuck again and
sudo intel_reg_write 0xa014 0x1a000000
Value before: 0xD0000
Value after: 0x1A000000
Should I apply patch from the Comment 55 to the latest 3.4-rc?

Comment 58 Chris Wilson 2012-04-26 04:26:54 UTC

(In reply to comment #57)
> This time with patch from Comment 48 I got stuck again and
> sudo intel_reg_write 0xa014 0x1a000000
> Value before: 0xD0000
> Value after: 0x1A000000
> Should I apply patch from the Comment 55 to the latest 3.4-rc?

Please do. It should automate the process of manually poking that register.

Comment 59 Alessandro Crismani 2012-04-26 05:42:39 UTC

Laptop finally back.

I can't apply the sanity check patch to 3.4-rc4, due to a missing intel_pm.c, should I fetch a kernel from some git repository?

Comment 60 Chris Wilson 2012-04-26 05:45:12 UTC

The split is currently in:

http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next

Comment 61 ValdikSS 2012-04-27 01:10:05 UTC

I compiled latest kernel from http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next, applied 2 patches and got no stucks last night because sanity check does what it should.
There are 16 errors in dmesg from sanity check
[17178.081019] [drm:gen6_sanitize_pm] *ERROR* Power management discrepancy: GEN6_RP_INTERRUPT_LIMITS expected 18000000, was 12060000
[17233.325635] [drm:gen6_sanitize_pm] *ERROR* Power management discrepancy: GEN6_RP_INTERRUPT_LIMITS expected 18000000, was 000d0000
[18048.631388] [drm:gen6_sanitize_pm] *ERROR* Power management discrepancy: GEN6_RP_INTERRUPT_LIMITS expected 000d0000, was 00000000
There is another problem which I don't think to be related and likely because of git kernel, but i'll post it anyway. I got 4 oopese this night like this
[17174.628145] ------------[ cut here ]------------
[17174.628155] WARNING: at drivers/gpu/drm/i915/i915_drv.c:465 gen6_gt_check_fifodbg.isra.3+0x40/0x50 [i915]()
[17174.628156] Hardware name: 4286CTO
[17174.628157] MMIO read or write has been dropped ffffffff
[17174.628158] Modules linked in: tp_smapi(O) thinkpad_ec(O) cpufreq_powersave tun snd_hda_codec_hdmi snd_hda_codec_conexant hid_a4tech usbhid i915 sdhci_pci arc4 thinkpad_acpi sdhci iwlwifi microcode aesni_intel aes_x86_64 aes_generic tpm_tis ghash_clmulni_intel hid mmc_core snd_hda_intel snd_hda_codec cryptd intel_agp snd_pcm snd_page_alloc snd_timer tpm tpm_bios snd_hwdep mac80211 nvram i2c_algo_bit drm_kms_helper coretemp cfg80211 drm crc32c_intel snd intel_gtt i2c_i801 acpi_cpufreq mperf i2c_core button thermal battery video ac processor mei(C) soundcore rfkill wmi evdev ext4 crc16 jbd2 mbcache ehci_hcd usbcore usb_common sd_mod ahci libahci libata scsi_mod
[17174.628186] Pid: 10315, comm: kworker/u:7 Tainted: G        WC O 3.4.0-1-intel-ga85d4bc-dirty #1
[17174.628188] Call Trace:
[17174.628193]  [<ffffffff8105062f>] warn_slowpath_common+0x7f/0xc0
[17174.628195]  [<ffffffff81050726>] warn_slowpath_fmt+0x46/0x50
[17174.628198]  [<ffffffffa0418480>] gen6_gt_check_fifodbg.isra.3+0x40/0x50 [i915]
[17174.628202]  [<ffffffffa041887e>] __gen6_gt_force_wake_put+0x1e/0x20 [i915]
[17174.628206]  [<ffffffffa0418dd8>] i915_read32+0x148/0x160 [i915]
[17174.628210]  [<ffffffffa041bd7d>] i915_update_gfx_val+0x8d/0xf0 [i915]
[17174.628215]  [<ffffffffa043bd4d>] intel_idle_update+0x6d/0x1a0 [i915]
[17174.628217]  [<ffffffff8106c57a>] process_one_work+0x12a/0x440
[17174.628222]  [<ffffffffa043bce0>] ? intel_enable_pipe+0x140/0x140 [i915]
[17174.628224]  [<ffffffff8106cdde>] worker_thread+0x12e/0x2d0
[17174.628225]  [<ffffffff8106ccb0>] ? manage_workers.isra.25+0x1f0/0x1f0
[17174.628228]  [<ffffffff810722b3>] kthread+0x93/0xa0
[17174.628231]  [<ffffffff81476c24>] kernel_thread_helper+0x4/0x10
[17174.628233]  [<ffffffff81072220>] ? kthread_freezable_should_stop+0x70/0x70
[17174.628235]  [<ffffffff81476c20>] ? gs_change+0x13/0x13
[17174.628236] ---[ end trace 12e744fba80ad589 ]---

Comment 62 Chris Wilson 2012-04-27 05:55:53 UTC

Created attachment 60664 [details] [review]
Always reset interrupts along with setting freqeuncy

I'm not mentioning the elimination of the forcewake dance for this path...

Comment 63 Chris Wilson 2012-04-27 05:56:49 UTC

Can you try the new patch along with the periodic sanity check? If we get no slow downs and no *ERROR*, maybe we are on to a winner.

Comment 64 ValdikSS 2012-04-27 07:01:51 UTC

Just 60664 and 60605? Ok.

Comment 65 Alessandro Crismani 2012-04-27 09:30:36 UTC

Hi Chris,

I run all night long, rebooted and run all day long with the previous two patches applied.

First thing, no slow downs, yay :)

I've seen some errors thrown to dmesg, but only at the very early stages after power on (I think they appeared during boot). Apart from those, my dmesg was clean.

I'll try applying the new patch as soon as possible. By the way, any chance of having a sanitize patch for 3.3.3? I am using my laptop on the go for the weekend, and I am seeing a rather high power consumption using 3.4 (not sure what causes it, I only got my laptop back yesterday and I had some major upgrades which may have broken things), hence I'll benefit from using the stable kernel.

I'll come back with more results!

Alessandro

Comment 66 ValdikSS 2012-04-27 23:52:18 UTC

Chris, everything works as expected and nice without any errors or stucks with 60664 and 60605. Thank you very much! Hope this patch to be included in 3.4.

Comment 67 Chris Wilson 2012-05-06 03:16:59 UTC

Even with the improved gen6_enable_rps() I'm still seeing the warnings from sanity check, so Daniel queued up

commit 9104183dad6314c55344d65738cd719b909a3e0a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Apr 26 11:28:42 2012 +0100

    drm/i915: Periodically sanity check power management
    
    Every time we use the device after a period of idleness, check that the
    power management setup is still sane. This is to workaround a bug
    whereby it seems that we begin suppressing power management interrupts,
    preventing SandyBridge+ from going into turbo mode.
    
    This patch does have a side-effect. It removes the mark-busy for just
    moving the cursor - we don't want to increase the render clock just for
    the sprite, though we may want to bump the display frequency. I'd argue
    that we do not, and certainly don't want to take the struct_mutex here
    due to the large latencies that introduces.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=44006
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

so that we don't suffer the abysmal performance in the interim.

Comment 68 Alessandro Crismani 2012-05-07 01:50:14 UTC

I am having trouble with the intel-drm-next kernel, in the sense that wifi keeps disconnecting and I can't work without it.

I'll wait till a new Linus's rc get sync'ed in and I'll give the new patches a round of tests as soon as that happens.

By the way, any chance of having the previous commit in 3.4?

Comment 69 Daniel Vetter 2012-05-07 02:06:01 UTC

On Mon, May 7, 2012 at 10:50 AM,  <bugzilla-daemon@freedesktop.org> wrote:
> --- Comment #68 from Alessandro Crismani <alessandro.crismani@gmail.com> 2012-05-07 01:50:14 PDT ---
> I am having trouble with the intel-drm-next kernel, in the sense that wifi
> keeps disconnecting and I can't work without it.
>
> I'll wait till a new Linus's rc get sync'ed in and I'll give the new patches a
> round of tests as soon as that happens.

You can try the drm-intel-testing branch, that contains everything
from linus up to -rc5. If you need a newer branch, I could stitch
together a quick branch for you.

> By the way, any chance of having the previous commit in 3.4?

Nope, 3.4 is for regression/severe stability fixes only now, so it
will land in 3.5.

Comment 70 Alessandro Crismani 2012-05-07 11:46:36 UTC

Still getting random disconnects with intel-drm-testing. It appears that iwlwifi has some problem under 3.4.

Despite that, I managed to work through the day using intel-drm-testing together with an ethernet cable at work, unfortunately that is not possible at home.

Again, I saw three error reports at the very first stage of system startup (I think they were spawn when booting). Then, dmesg stayed silent for the whole afternoon and I didn't experience slowdowns, which is great :)

I'll get back with a deeper analysis as soon as I find how to fix my wireless connectivity.

Comment 71 Chris Wilson 2012-05-11 11:02:08 UTC

*** Bug 49811 has been marked as a duplicate of this bug. ***

Comment 72 Daniel Vetter 2012-05-22 12:05:30 UTC

Patch merged in -fixes:

commit aa00e53b0a5afdabd5f2c4ec6a92a131b1b8618f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Apr 28 08:56:39 2012 +0100

    drm/i915: Always update RPS interrupts thresholds along with frequency

Comment 73 Chris Wilson 2012-05-22 12:17:42 UTC

*** Bug 50073 has been marked as a duplicate of this bug. ***

Comment 74 Daniel Vetter 2012-06-06 06:27:01 UTC

Reassigning to ben, he's working the hsd treadmill for us ...

Comment 75 Ben Widawsky 2012-06-12 12:07:01 UTC

For the people who've seen/hit this bug. Can you help me describe the easiest steps to reproduce?

We have some theories to test, but I am unable to reproduce this with a repeatable simple test case.

The test can require hours, but it must be simple and require no user interaction.

Comment 76 Alessandro Crismani 2012-06-12 12:12:17 UTC

My way of reproducing it was to leave the laptop idle for a certain amount of time, no user interaction, just left the notebook sitting there. Unfortunately, the amount of time was highly random, sometimes a lunch break, sometimes more than a whole night.

The other news is that I finally got around at using a newer kernel, that is 3.5-rc2, after many troubles with wireless on 3.4 and a broken UEFI on 3.5-rc1 which kept me from testing.

Using 3.5-rc2 I am not seeing slowdowns, and dmesg is free from *sanitise* errors, hence I believe that the patch that was pushed (was it pushed, right?) at least mitigated my situation.

Comment 77 Chris Wilson 2012-06-15 04:33:35 UTC

(In reply to comment #76)
> Using 3.5-rc2 I am not seeing slowdowns, and dmesg is free from *sanitise*
> errors, hence I believe that the patch that was pushed (was it pushed, right?)
> at least mitigated my situation.

Right the sanitize patch was indeed pushed along with a second patch to make it quiet. The bug is still there, just hopefully papered over now.

Comment 78 Florian Mickler 2012-07-01 03:44:17 UTC

A patch referencing this bug report has been merged in Linux v3.5-rc1:

commit 7b9e0ae6da0a7eaf2680a1a788f08df123724f3b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Apr 28 08:56:39 2012 +0100

    drm/i915: Always update RPS interrupts thresholds along with frequency

Comment 79 Daniel Vetter 2012-07-01 03:54:43 UTC

I guess until we figure out a new clue, let's just close this and hope the duct tape holds.

Comment 80 Florian Mickler 2012-10-15 20:49:35 UTC

A patch referencing a commit referencing this bug report has been merged in Linux v3.7-rc1:

commit 20b46e59dd102665ce7168baa215e5b1ee66b69b
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jul 26 11:16:14 2012 +0200

    drm/i915: Only set the down rps limit when at the loweset frequency

Comment 81 Florian Mickler 2012-10-15 21:17:54 UTC

A patch referencing a commit referencing this bug report has been merged in Linux v3.7-rc1:

commit 65bccb5c708bd9f00d24f041f4f7c45130359448
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Aug 8 17:42:52 2012 +0200

    drm/i915: fixup desired rps frequency computation

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.