Summary: | [ILK] Multiple hibernate/thaw cycles cause kernel errors with Intel KMS (Ironlake graphics on ThinkPad T510) | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Bojan Smojver <bojan> | ||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||
Status: | CLOSED FIXED | QA Contact: | |||||
Severity: | major | ||||||
Priority: | medium | CC: | ben, chris, daniel, eugeni, jbarnes | ||||
Version: | unspecified | ||||||
Hardware: | x86-64 (AMD64) | ||||||
OS: | Linux (All) | ||||||
Whiteboard: | |||||||
i915 platform: | i915 features: | ||||||
Attachments: |
|
Description
Bojan Smojver
2011-10-11 20:08:53 UTC
I tested hibernation/thaw cycles on one of my old machines, an HP Pavilion ZE4201 notebook, which has integrated Radeon graphics (IGP 340M, which is RS200 chip). This box is a 32-bit machine, as opposed to my ThinkPad T510, which is running 64-bit stuff. Similar behaviour on hibernate/thaw - with nomodeset, no trouble. With KMS, trouble after a few hibernate/thaw cycles (NULL pointers and other kernel dumps onto the console). Interesting. Maybe the problem is not Intel specific after all. Also, removing intel_ips module does not help. Looks like bug #40241, which is also caused by KMS. (In reply to comment #3) > Looks like bug #40241, which is also caused by KMS. Yeah, it is. You can see my posts there too (unfortunately). I opened a separate bug, because I was asked to do so on the intel-gfx list. As you can see from comment #2, I am suspecting now that this is not Intel specific, but rather something common to KMS code, because I got very similar symptoms on a machine that uses radeon driver (also integrated graphics, BTW). Of course, I have no proof of this, just a feeling. If it is coming through KMS, it could probably be not intel-specific indeed. I think Jesse is working on a patch to disable KMS before suspending, this way we'll be able to rule out this possibility. Thanks for those reports, I hope we'll be able to fix it soon! (In reply to comment #5) > If it is coming through KMS, it could probably be not intel-specific indeed. Obviously, I do not understand this code enough to tell - I'm just guessing. So, use salt in abundance. :-) All I know is that on two of my machines, with different graphics hardware, hibernation works properly when I pass nomodeset to the kernel. If I leave that option out, I get trouble after several hibernate/thaw cycles. > I think Jesse is working on a patch to disable KMS before suspending, this way > we'll be able to rule out this possibility. > > Thanks for those reports, I hope we'll be able to fix it soon! Cool. I'm ready to test anything you may have! In an effort to confirm that this really is a regression, I remembered an old bug, that was fixed during Fedora 13: https://bugzilla.redhat.com/show_bug.cgi?id=537494 You will see from https://bugzilla.redhat.com/show_bug.cgi?id=537494#c67 there that another person confirmed that the problem was indeed fixed with a particular kernel release. So, I downloaded kernel-2.6.34.9-69.fc13.x86_64.rpm from Fedora 13 updates, installed it (into Fedora 15 - yeah, I know crazy) and did over 50 hibernate/thaw cycles on my ThinkPad T510. I am writing these comments from that session. No kernel errors, no segfaults - works fine. So, it does look like a regression, or at least looks like things have gotten worse (much worse) with newer kernels. Just to follow up on this a bit, I cloned Linus' tree as of today (i.e. currently staged stuff for 3.2) then pulled Keith's tree (git://people.freedesktop.org/~keithp/linux drm-intel-next) over the top and compiled. Did 26 hibernate/thaw cycles and then went to check the machine. Unfortunately, I then got: --------------------------- [ 729.195407] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 729.199345] IP: [<ffffffff8125121d>] __list_add+0x14/0x7f [ 729.203288] PGD 0 [ 729.207051] Oops: 0000 [#1] SMP [ 729.210874] CPU 0 [ 729.210901] Modules linked in: fuse ppdev parport_pc lp parport bnep bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm thinkpad_acpi snd_timer e1000e uvcvideo rfkill snd videodev media v4l2_compat_ioctl32 qcserial usb_wwan mxm_wmi wmi snd_page_alloc microcode i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr intel_ips soundcore joydev ipv6 firewire_ohci firewire_core crc_itu_t sdhci_pci sdhci mmc_core i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] [ 729.231392] [ 729.235351] Pid: 897, comm: dbus-daemon Tainted: G W 3.1.0+ #2 LENOVO 4313CTO/4313CTO [ 729.239316] RIP: 0010:[<ffffffff8125121d>] [<ffffffff8125121d>] __list_add+0x14/0x7f [ 729.243143] RSP: 0018:ffff88022ce57d60 EFLAGS: 00010286 [ 729.246941] RAX: ffff8801ab1454d0 RBX: 0000000000000000 RCX: 0000000000000054 [ 729.250770] RDX: 0000000000000000 RSI: ffff880229777100 RDI: ffff8801ab145520 [ 729.254604] RBP: ffff88022ce57d80 R08: ffff88020c7f28e8 R09: 00007f0aaeda4000 [ 729.258294] R10: 0000000000015ff8 R11: 0000000000015fa8 R12: ffff880229777100 [ 729.261820] R13: ffff8801ab145520 R14: ffff8802297770b0 R15: ffff8801ab1450b0 [ 729.265401] FS: 00007f0aaed80800(0000) GS:ffff88023bc00000(0000) knlGS:0000000000000000 [ 729.269032] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 729.272674] CR2: 0000000000000008 CR3: 000000022d275000 CR4: 00000000000006f0 [ 729.276338] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 729.279988] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 729.283616] Process dbus-daemon (pid: 897, threadinfo ffff88022ce56000, task ffff88022ca55c80) [ 729.287278] Stack: [ 729.290885] ffff88020c7f28e8 ffff88022c192d80 ffff88022a0bd500 ffff8801ab1454d0 [ 729.294593] ffff88022ce57d90 ffffffff810f1fe5 ffff88022ce57e10 ffffffff81055b6b [ 729.298289] 0000000000000000 ffff8801ab1450e8 ffff8801ab1450f0 ffff8801ab1450c8 [ 729.301966] Call Trace: [ 729.305664] [<ffffffff810f1fe5>] vma_prio_tree_add+0x81/0x95 [ 729.309405] [<ffffffff81055b6b>] dup_mm+0x2f3/0x488 [ 729.313143] [<ffffffff810566db>] copy_process+0x9b1/0x119c [ 729.316888] [<ffffffff811fdca6>] ? security_file_alloc+0x16/0x18 [ 729.320631] [<ffffffff81129db5>] ? get_empty_filp+0xa4/0x133 [ 729.324351] [<ffffffff81056ff0>] do_fork+0xef/0x22d [ 729.328029] [<ffffffff813daf9e>] ? sock_alloc_file+0xb3/0x114 [ 729.331672] [<ffffffff810440eb>] ? should_resched+0xe/0x2d [ 729.335300] [<ffffffff8149a9bd>] ? _cond_resched+0xe/0x22 [ 729.338914] [<ffffffff813d9388>] ? might_fault+0xe/0x10 [ 729.342532] [<ffffffff81016336>] sys_clone+0x28/0x2a [ 729.346128] [<ffffffff814a2ae3>] stub_clone+0x13/0x20 [ 729.349661] [<ffffffff814a27c2>] ? system_call_fastpath+0x16/0x1b [ 729.353249] Code: ad de 48 b9 00 02 20 00 00 00 ad de 48 89 13 48 89 4b 08 5e 5b 5d c3 55 48 89 e5 41 55 49 89 fd 41 54 49 89 f4 53 48 89 d3 41 50 <4c> 8b 42 08 49 39 f0 74 20 49 89 d1 48 89 f1 48 c7 c2 98 15 7e [ 729.361220] RIP [<ffffffff8125121d>] __list_add+0x14/0x7f [ 729.365218] RSP <ffff88022ce57d60> [ 729.369206] CR2: 0000000000000008 [ 729.439968] ---[ end trace a0f13f2533f6746a ]--- --------------------------- Followed by machine becoming weird and throwing a whole lot more kernel errors, which I could not capture any more. The only other error was unrelated. It looked like this: --------------------------- [ 272.029435] ------------[ cut here ]------------ [ 272.029441] WARNING: at drivers/net/ethernet/intel/e1000e/ich8lan.c:870 e1000_acquire_swflag_ich8lan+0x4f/0x143 [e1000e]() [ 272.029443] Hardware name: 4313CTO [ 272.029445] e1000e: eth0: contention for Phy access [ 272.029446] Modules linked in: fuse ppdev parport_pc lp parport bnep bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm thinkpad_acpi snd_timer e1000e uvcvideo rfkill snd videodev media v4l2_compat_ioctl32 qcserial usb_wwan mxm_wmi wmi snd_page_alloc microcode i2c_i801 iTCO_wdt iTCO_vendor_support pcspkr intel_ips soundcore joydev ipv6 firewire_ohci firewire_core crc_itu_t sdhci_pci sdhci mmc_core i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] [ 272.029480] Pid: 5712, comm: kworker/1:4 Tainted: G W 3.1.0+ #2 [ 272.029482] Call Trace: [ 272.029485] [<ffffffff81057a36>] warn_slowpath_common+0x83/0x9b [ 272.029488] [<ffffffff81057af1>] warn_slowpath_fmt+0x46/0x48 [ 272.029492] [<ffffffff81084f41>] ? smp_call_function_single+0x97/0xfd [ 272.029499] [<ffffffffa0225767>] e1000_acquire_swflag_ich8lan+0x4f/0x143 [e1000e] [ 272.029508] [<ffffffffa022e5d0>] __e1000_read_phy_reg_hv+0x4d/0x157 [e1000e] [ 272.029518] [<ffffffffa022ee8a>] e1000_read_phy_reg_hv+0x13/0x15 [e1000e] [ 272.029527] [<ffffffffa02327b3>] e1000_phy_read_status+0xf6/0x163 [e1000e] [ 272.029537] [<ffffffffa0236df2>] e1000_watchdog_task+0x104/0x5d2 [e1000e] [ 272.029540] [<ffffffff8149a93e>] ? __schedule+0x63b/0x669 [ 272.029550] [<ffffffffa0236cee>] ? e1000_update_mng_vlan+0x68/0x68 [e1000e] [ 272.029554] [<ffffffff8106eab0>] process_one_work+0x176/0x2a9 [ 272.029559] [<ffffffff8106f5be>] worker_thread+0xda/0x15d [ 272.029562] [<ffffffff8106f4e4>] ? manage_workers+0x176/0x176 [ 272.029565] [<ffffffff81072a0b>] kthread+0x84/0x8c [ 272.029568] [<ffffffff814a4934>] kernel_thread_helper+0x4/0x10 [ 272.029572] [<ffffffff81072987>] ? kthread_worker_fn+0x148/0x148 [ 272.029575] [<ffffffff814a4930>] ? gs_change+0x13/0x13 [ 272.029577] ---[ end trace a0f13f2533f67469 ]--- --------------------------- So, yeah, still there with the latest code. Please try this patch https://bugs.freedesktop.org/attachment.cgi?id=57170 (In reply to comment #9) > Please try this patch > > https://bugs.freedesktop.org/attachment.cgi?id=57170 Same. See: https://bugs.freedesktop.org/show_bug.cgi?id=40241#c14 Most likely this is cause by fbcon writes after devices supsend. Fixed with commit 3fa016a0b5c5237e9c387fc3249592b2cb5391c6 Author: Dave Airlie <airlied@redhat.com> Date: Wed Mar 28 10:48:49 2012 +0100 drm/i915: suspend fbdev device around suspend/hibernate ... which is included in 3.4-rc1. Please test that and reopen if you still experience issues. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.