Bug 27941

Summary: [gm45] hotplug storm after S3 resume (udev spins on drm device after wakeup)
Product: DRI Reporter: Tony Mantler <nicoya>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: jani.nikula, jrnieder
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kern.log with drm.debug=0xe none

Description Tony Mantler 2010-05-02 11:13:06 UTC
It seems that a few minutes after waking my laptop from sleep I'll often
notice a sudden jump in system activity. I've traced this back to udev
looping on the graphics drm device for some reason. Running udevd --debug
results in the following messages looped over and over:

1267729775.453592 [13925] event_queue_insert: seq 189967 queued, 'change'
'drm'
1267729775.453639 [13925] udev_monitor_send_device: passed 200 bytes to
monitor 0x23d72d0
1267729775.453725 [13926] worker_new: seq 189967 running
1267729775.453782 [13926] udev_device_new_from_syspath: device 0x23e4950
has devpath '/devices/pci0000:00/0000:00:02.0/drm/card0'
1267729775.453894 [13926] udev_device_read_db: device 0x23e4950 filled
with db file data
1267729775.453923 [13926] udev_rules_apply_to_event: LINK 'char/226:0'
/lib/udev/rules.d/50-udev-default.rules:2
1267729775.453951 [13926] udev_rules_apply_to_event: NAME 'dri/card0'
/lib/udev/rules.d/50-udev-default.rules:48
1267729775.454015 [13926] udev_rules_apply_to_event: RUN 'udev-acl
--action=$env{ACTION} --device=$env{DEVNAME}'
/lib/udev/rules.d/70-acl.rules:81
1267729775.454040 [13926] udev_rules_apply_to_event: RUN
'socket:@/org/freedesktop/hal/udev_event' /lib/udev/rules.d/90-hal.rules:2
1267729775.454061 [13926] udev_rules_apply_to_event: GROUP 44
/lib/udev/rules.d/91-permissions.rules:61
1267729775.454180 [13926] udev_device_update_db: created db file for
'/devices/pci0000:00/0000:00:02.0/drm/card0' in '/dev/.udev/db/drm:card0'
1267729775.454201 [13926] udev_node_add: creating device node
'/dev/dri/card0', devnum=226:0, mode=0660, uid=0, gid=44
1267729775.454221 [13926] udev_node_mknod: preserve file '/dev/dri/card0',
because it has correct dev_t
1267729775.454254 [13926] node_symlink: preserve already existing symlink
'/dev/char/226:0' to '../dri/card0'
1267729775.454289 [13926] util_run_program: 'udev-acl --action=change
--device=/dev/dri/card0' started
1267729775.458003 [13926] util_run_program: 'udev-acl --action=change
--device=/dev/dri/card0' returned with exitcode 0
1267729775.458074 [13926] udev_monitor_send_device: passed 261 bytes to
monitor 0x23e4950
1267729775.458128 [13926] udev_monitor_send_device: passed -1 bytes to
monitor 0x23e4f00
1267729775.458151 [13926] worker_new: seq 189967 processed with 0
1267729775.458221 [13925] event_queue_delete: seq 189967 done with 0

Putting the laptop to sleep and then waking it up again "warm" seems to
clear the problem such that it does not reoccur again until the next time
I wake the system "cold".

Clearly this activity has an adverse effect on battery life.

Hardware is a Lenovo X200 laptop

penelope:/home/nicoya# lspci
00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
00:03.2 IDE interface: Intel Corporation Mobile 4 Series Chipset PT IDER Controller (rev 07)
00:03.3 Serial controller: Intel Corporation Mobile 4 Series Chipset AMT SOL Redirection (rev 07)
00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation ICH9M/M-E SATA AHCI Controller (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
03:00.0 Network controller: Intel Corporation Wireless WiFi Link 5300

The kernel is the stock Debian amd64 kernel version 2.6.32-9 based on upstream 2.6.32.9.

This bug is reported in Debian's bug tracking system here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=572537
Comment 1 Jesse Barnes 2010-07-01 15:45:14 UTC

*** This bug has been marked as a duplicate of bug 25259 ***
Comment 2 Jonathan Nieder 2012-03-28 15:23:18 UTC
Tony writes[1]:

> Yes, I'm still getting the issue, sort of. My understanding is that
> some parts of the hotplug pipeline have moved around, but I still
> get a spam of hotplug events from the graphics that nearly kills the
> system after waking up. The most reliable way to trigger it is to
> wake from sleep, open firefox (iceweasel), and play a youtube video
> (flash). The symptoms will often occur without going through these
> exact steps though.
>
> Executing the command "intel_reg_write 0x61110 0x0" as root stops
> the hotplug spam and restores system functionality, though this also
> apparently stops all hotplug events so the system won't detect
> attaching an external monitor or something to the VGA port.
>
> I'm currently running linux-image-2.3.0-2-amd64 package version 3.2.12-1.
>
> I could certainly test patched kernels, as I can very reliably
> reproduce the problem.

Since 3.2.12 is newer than 2.6.35, it sounds like the fix from bug 25259
doesn't take care of these symptoms.

Any hints for tracking this down?

[1] http://bugs.debian.org/572537
Comment 3 Daniel Vetter 2012-05-11 06:49:52 UTC
Ok, to dig into this one we need full dmesg with drm.debug=0xe added to the kernel cmdline. Please also grab the dmesg with the added debug options while the problem is happening.
Comment 4 Tony Mantler 2012-05-20 22:22:58 UTC
Created attachment 61900 [details]
kern.log with drm.debug=0xe

Hotplug storm occurs upon wake on May 20th, after approx 23hrs in S3 sleep.

It's worth noting that the storm does *not* occur after a few much shorter S3 sleeps on May 19th. Makes me wonder if some register isn't getting reinitialized upon wake and is resuming with a decayed value after being powered down.
Comment 5 Chris Wilson 2012-10-21 14:30:14 UTC
Timeout. Please do reopen if you can still reproduce the issue and help us diagnose the problem, thanks.
Comment 6 Jonathan Nieder 2012-10-21 19:10:19 UTC
Um, if I understand correctly then Tony replied with the log Daniel requested and then there was no reply. What did I miss?
Comment 7 Chris Wilson 2012-10-21 20:03:01 UTC
(In reply to comment #6)
> Um, if I understand correctly then Tony replied with the log Daniel
> requested and then there was no reply. What did I miss?

Just left in NEEDINFO and I did a mass-close of unchanged bug reports in that state...
Comment 8 Chris Wilson 2012-10-22 17:00:24 UTC
All indications point towards flaky hardware, as it alternates on suspend&resume cycles between different HDMI/DP ports. I'm not aware of any particular erratum concerning gm45 hotplug detection that hasn't already been implemented, so unless this is widespread across many different manufacturers I would say it was a model, even machine, specific defect.
Comment 9 Tony Mantler 2012-10-27 19:19:55 UTC
If the exact cause can't be narrowed down, is there any sort of mitigation that might be appropriate? Rate limiting duplicate hotplug events maybe?
Comment 10 Jani Nikula 2012-12-10 14:04:23 UTC
(In reply to comment #9)
> If the exact cause can't be narrowed down, is there any sort of mitigation
> that might be appropriate? Rate limiting duplicate hotplug events maybe?

Have you tried recent kernels? In comment #1, this has already been resolved dupe of bug 25259, which in turn has been resolved dupe of bug 25327, which has been fixed. There are also plenty of other irq/hotplug related changes in recent kernels; 2.6.32.9 isn't exactly new.
Comment 11 Jonathan Nieder 2012-12-10 14:39:50 UTC
(In reply to comment #10)

> Have you tried recent kernels? In comment #1, this has already been resolved
> dupe of bug 25259,

Trying newer kernels is generally good advice for the restless, but I also
want to point your attention to existing data in this same bug:

| Since 3.2.12 is newer than 2.6.35, it sounds like the fix from bug 25259
| doesn't take care of these symptoms.

Thanks and hope that helps,
Jonathan
Comment 12 Jani Nikula 2012-12-10 15:02:54 UTC
(In reply to comment #11)
> Trying newer kernels is generally good advice for the restless, but I also
> want to point your attention to existing data in this same bug:
> 
> | Since 3.2.12 is newer than 2.6.35, it sounds like the fix from bug 25259
> | doesn't take care of these symptoms.

Thanks, I missed that somehow.

Even so, the kernel is a fast moving target, and IMHO trying, say, current upstream master is much more productive than trying to go through all the changes since 3.2 in our irq/hotplug/suspend code that might have affected this bug. Also, if the problem still persists, I think debugging on current kernels that we work on is more likely to lead to correct conclusions anyway.
Comment 13 Chris Wilson 2012-12-10 16:05:18 UTC
Tasking to Daniel, far out on his hotplug todo list is interrupt mitigation for naughty hardware.
Comment 14 Tony Mantler 2012-12-11 07:44:21 UTC
Bug/behaviour is still present as of Debian kernel package 3.2.0-4, which corresponds to 3.2.35 mainline apparently.
Comment 15 Chris Wilson 2013-05-21 15:55:51 UTC
This should be fixed by the interrupt storm detection in 3.10.
Comment 16 Chris Wilson 2013-06-07 18:18:24 UTC
Presumed now fixed.
Comment 17 Tony Mantler 2013-06-07 19:14:43 UTC
Sorry for the delay, 3.10 hasn't filtered its way down to my laptop just yet. When it does I'll check the behaviour and reopen if it's still being annoying.

As of 3.8 at least the problem still existed, and the intel_reg_write command also stopped working (just reported invalid argument when trying to run or some such) so that was a bit of a disaster. Not sure if I should open a bug for the command not working, as I theoretically won't need it once this bug is fixed.
Comment 18 Ben Hutchings 2013-06-11 01:22:04 UTC
(In reply to comment #17)
> Sorry for the delay, 3.10 hasn't filtered its way down to my laptop just
> yet. When it does I'll check the behaviour and reopen if it's still being
> annoying.

It's in Debian's experimental suite.
Comment 19 Tony Mantler 2013-06-16 20:20:43 UTC
Ok, looks like this is working now. I've seen a few kernel messages indicating the interrupt storm code is triggering, and the system appears to be remaining responsive with low CPU usage.

Thanks guys!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.