Summary: | [HSW i915 MSI-7817] S4 resume on Haswell causes memory corruption (OOM, ext4_, ...) | ||
---|---|---|---|
Product: | DRI | Reporter: | Jens <jens-bugs.freedesktop.org> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | chris, intel-gfx-bugs, mauromol |
Version: | XOrg git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Jens
2014-08-20 11:46:24 UTC
Created attachment 104975 [details]
3.17-rc1 crash "Watchdog detected hard LOCKUP on CPU #x"
Here is another (it seems) non-OOM related crash on the mentioned kernel during resume after hibernate which I was only able to catch using a digicam (thus sorry for the file format).
This was also tracked at: https://bugzilla.kernel.org/show_bug.cgi?id=59321 Waiting now for feedback from Jens. In response to the comment posted at bugzilla.kernel.org: > Jens, have you seen the problem since your last report (with or w/o the fixes)? No. > Could you still try if you can reproduce the problem with the latest -nightly > kernel and the same tree with the fixes reverted (resetting to 598ae05fd937 - > "drm/i915: Emit even number of dwords when emitting LRIs"). I have checked out the linux-drm-nightly tree as of yesterday (3.18rc1, 88a443f45) and have tried several suspend/resume on one machine during "make -j4". Except for one watchdog timeout (networking) I have not experienced any problems except for the fact that the resume seems to take a little longer(?) than with 3.17rc1. Regarding 598ae05fd937, "git show" does not find it, but I found this: http://cgit.freedesktop.org/drm-intel/commit/?id=22a916aaa187946e8df724ab7838a0c13b45a9f4 Is this the same commit? Do you want me to reverse patch it onto 598ae05fd937 and retest? Regards > Jens, have you seen the problem since your last report (with or w/o the fixes)? No. Specifically, not since I compiled 3.17rc1 (see https://bugzilla.kernel.org/show_bug.cgi?id=59321). It was reproducable with all kernels I tested before. > Jens, have you seen the problem since your last report (with or w/o the fixes)?
No. But that's because I had a solution that worked (the patch for 3.17rc1 from the BKO bug report) and thus I did not test any later kernels.
I'm happy to test any other kernels and/or patches though, just tell me.
/* Note to self: think before posting. Sorry for answering this three times. */
(In reply to Jens from comment #3) > In response to the comment posted at bugzilla.kernel.org: > > > Jens, have you seen the problem since your last report (with or w/o the fixes)? > > No. > > > Could you still try if you can reproduce the problem with the latest -nightly > > kernel and the same tree with the fixes reverted (resetting to 598ae05fd937 - > > "drm/i915: Emit even number of dwords when emitting LRIs"). > > I have checked out the linux-drm-nightly tree as of yesterday (3.18rc1, > 88a443f45) and have tried several suspend/resume on one machine during "make > -j4". Except for one watchdog timeout (networking) I have not experienced > any problems except for the fact that the resume seems to take a little > longer(?) than with 3.17rc1. Ok. Not sure about the extra delay. Perhaps connected to the network timeout? But for me the important now is that with -nightly you don't see the original (more serious) problem. > Regarding 598ae05fd937, "git show" does not find it, but I found this: > http://cgit.freedesktop.org/drm-intel/commit/ > ?id=22a916aaa187946e8df724ab7838a0c13b45a9f4 > Is this the same commit? Do you want me to reverse patch it onto > 598ae05fd937 and retest? Yes that's the right commit, just before the suspend-fix patchset went in. Maybe, I gave the wrong SHA1, or -nightly got rebased (it gets rebased regularly). It would help if you could git reset to that commit and see if you can reproduce the problem. I think I will close this bug in any case, but it would help to know if it got fixed by the suspend-fix patchset (that you revert with the git reset), or something else since 3.17rc1. Thanks! (In reply to Imre Deak from comment #6) > (In reply to Jens from comment #3) > > In response to the comment posted at bugzilla.kernel.org: > > > > > Jens, have you seen the problem since your last report (with or w/o the fixes)? > > > > No. > > > > > Could you still try if you can reproduce the problem with the latest -nightly > > > kernel and the same tree with the fixes reverted (resetting to 598ae05fd937 - > > > "drm/i915: Emit even number of dwords when emitting LRIs"). > > > > I have checked out the linux-drm-nightly tree as of yesterday (3.18rc1, > > 88a443f45) and have tried several suspend/resume on one machine during "make > > -j4". Except for one watchdog timeout (networking) I have not experienced > > any problems except for the fact that the resume seems to take a little > > longer(?) than with 3.17rc1. > > Ok. Not sure about the extra delay. Perhaps connected to the network > timeout? But for me the important now is that with -nightly you don't see > the original (more serious) problem. Btw, if you want to further debug this delay, you could boot with initcall_debug which shows you how long the resume/suspend handler for each driver ran and see if anything took unusually long. Or compare the times with those you get running on 3.17rc1. Created attachment 108530 [details]
dmesg output of "preload: Corrupted page table at address 1dd4146"
Unforunately I have just experienced another error upon resume. This prevented the machine from shutting down cleanly (although it was still working when I resumed). This is the first time this happened, though.
[32938.478751] video LNXVIDEO:00: Restoring backlight state
[32993.327894] preload: Corrupted page table at address 1dd4146
[32993.327919] PGD 36141067 PUD c66a8067 PMD 4341434143414341 BAD
[32993.327936] Bad pagetable: 000b [#1] SMP
... (see attachment)
I will try the above commit tonight and see if I can reproduce the above error.
With 3.17rc5+ (the above commit) I can not reproduce the same error as above, but after a couple attempts the system simply refuses to suspend - the 'pm-hibernate' process freezes. Also it is not possible to shut down cleanly any more. There is nothing in the logs or in dmesg to hint at what might be happning (that I could recognize). I will try again tomorrow to see if I can reproduce this. (In reply to Jens from comment #8) > Created attachment 108530 [details] > dmesg output of "preload: Corrupted page table at address 1dd4146" > > Unforunately I have just experienced another error upon resume. This > prevented the machine from shutting down cleanly (although it was still > working when I resumed). This is the first time this happened, though. > > [32938.478751] video LNXVIDEO:00: Restoring backlight state > [32993.327894] preload: Corrupted page table at address 1dd4146 > [32993.327919] PGD 36141067 PUD c66a8067 PMD 4341434143414341 BAD > [32993.327936] Bad pagetable: 000b [#1] SMP > ... (see attachment) Ok, this doesn't look good. Could you try if you can reproduce it with the same kernel booting with nomodeset? (In reply to Jens from comment #9) > With 3.17rc5+ (the above commit) I can not reproduce the same error as > above, but after a couple attempts the system simply refuses to suspend - > the 'pm-hibernate' process freezes. Also it is not possible to shut down > cleanly any more. > > There is nothing in the logs or in dmesg to hint at what might be happning > (that I could recognize). > > I will try again tomorrow to see if I can reproduce this. Actually now trying this commit is less important, since even -nightly has the problem, so let's try to narrow down things in -nightly. I have not yet managed to reproduce it with "nomodeset", but I also cannot start X with this parameter, no matter if cleanly or after a resume. The lightdm background starts up, but that is it. Even more strange, when I switch to VT1 (Ctrl-ALt-F1) and back, the VT1 text with black background stays on the screen but I can reveal the X lightdm wallpaper bit by bit by "drawing" on the screen with the mouse cursor. Restarting lightdm after a resume crashes the system (keyboard frozen). No errors in dmesg. Correction: keyboard is frozen, but system can still be shut down by pressing the power button (i.e. kernel seems to be unharmed). (In reply to Jens from comment #11) > I have not yet managed to reproduce it with "nomodeset", but I also cannot > start X with this parameter, no matter if cleanly or after a resume. The > lightdm background starts up, but that is it. > > Even more strange, when I switch to VT1 (Ctrl-ALt-F1) and back, the VT1 text > with black background stays on the screen but I can reveal the X lightdm > wallpaper bit by bit by "drawing" on the screen with the mouse cursor. > > Restarting lightdm after a resume crashes the system (keyboard frozen). No > errors in dmesg. Ok, but I assume you tried to reproduce it by suspending while running something exercising the VM/filesystem like the make -j4 you did earlier? (In reply to Jens from comment #12) > Correction: keyboard is frozen, but system can still be shut down by > pressing the power button (i.e. kernel seems to be unharmed). Could you still get the dmesg somehow at this point, by ssh'ing in or using netconsole? Summary: 1) I could not reproduce the above corruption error with "nomodeset" and only using the console (no X) after about 10 suspend/resume cycles during a parallel make -j4. 2) Upon restarting lightdm and logging in, this is printed to the console when using "nomodeset": [+21.57s] WARNING: Error activating login1 session: GDBus.Error:org.freedesktop.DBus.Error.Failed: Operation not supported I do not get this when not using this boot parameter. Full boot command line: BOOT_IMAGE=/vmlinuz-3.18.0-rc1+ root=/dev/mapper/ubuntu--vg-root ro quiet splash no_console_suspend=1 netconsole=6666@192.168.178.59/eth0,6666@192.168.178.62/ crashkernel=384M-:128M vt.handoff=7 3) After console freezes, there is nothing strange in the logs (at least no backtrace, oops or anything obvious.) (In reply to Jens from comment #14) > Summary: > > 1) > I could not reproduce the above corruption error with "nomodeset" and only > using the console (no X) after about 10 suspend/resume cycles during a > parallel make -j4. > > 2) > Upon restarting lightdm and logging in, this is printed to the console when > using "nomodeset": > [+21.57s] WARNING: Error activating login1 session: > GDBus.Error:org.freedesktop.DBus.Error.Failed: Operation not supported > > I do not get this when not using this boot parameter. > > Full boot command line: BOOT_IMAGE=/vmlinuz-3.18.0-rc1+ > root=/dev/mapper/ubuntu--vg-root ro quiet splash no_console_suspend=1 > netconsole=6666@192.168.178.59/eth0,6666@192.168.178.62/ > crashkernel=384M-:128M vt.handoff=7 > > 3) > After console freezes, there is nothing strange in the logs (at least no > backtrace, oops or anything obvious.) Ok. Atm, I'm without much ideas what can cause this, I'll try to reproduce it on my Haswell here. One more thing, is i915 built-in or module for you? We end up with a different resume sequence in the two cases, so trying if you can reproduce the bug with both ways could be useful. On 3.18.0rc1 (as of 2010.10.25, 88a443f454a4d): I just got another watchdog failure for eth0 (r8169) upon resume. This time though, the ethernet succeeded to reconnect after some delay. The previous two resumes had to be rebooted because the network was dead. [13729.752788] ------------[ cut here ]------------ [13729.752796] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x24f/0x260() [13729.752807] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [13729.752807] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) snd_pcm(E) snd_seq_midi(E) snd_seq_midi_event(E) intel_rapl(E) snd_rawmidi(E) snd_seq(E) snd_seq_device(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_timer(E) shpchp(E) coretemp(E) lpc_ich(E) mei_me(E) snd(E) mei(E) kvm_intel(E) soundcore(E) kvm(E) serio_raw(E) tpm_infineon(E) 8250_fintek(E) intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) dm_crypt(E) netconsole(E) configfs(E) hid_generic(E) usbhid(E) hid(E) mxm_wmi(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) i915(E) ahci(E) libahci(E) i2c_algo_bit(E) drm_kms_helper(E) r8169(E) mii(E) drm(E) wmi(E) video(E) [13729.752846] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 3.18.0-rc1+ #1 [13729.752847] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014 [13729.752847] 0000000000000009 ffff88021eb83d48 ffffffff8176091c 000000000000c56a [13729.752851] ffff88021eb83d98 ffff88021eb83d88 ffffffff8106e1c1 0000000000000005 [13729.752852] 0000000000000000 ffff8802130e4000 0000000000000001 ffff8800c67db280 [13729.752855] Call Trace: [13729.752856] <IRQ> [<ffffffff8176091c>] dump_stack+0x46/0x58 [13729.752864] [<ffffffff8106e1c1>] warn_slowpath_common+0x81/0xa0 [13729.752867] [<ffffffff8106e226>] warn_slowpath_fmt+0x46/0x50 [13729.752869] [<ffffffff8168246f>] dev_watchdog+0x24f/0x260 [13729.752872] [<ffffffff81682220>] ? dev_graft_qdisc+0x80/0x80 [13729.752874] [<ffffffff810d271a>] call_timer_fn+0x3a/0x110 [13729.752877] [<ffffffff81682220>] ? dev_graft_qdisc+0x80/0x80 [13729.752879] [<ffffffff810d3ebf>] run_timer_softirq+0x20f/0x310 [13729.752882] [<ffffffff81072295>] __do_softirq+0xf5/0x2d0 [13729.752884] [<ffffffff81072765>] irq_exit+0x115/0x120 [13729.752887] [<ffffffff8176bc3a>] smp_apic_timer_interrupt+0x4a/0x60 [13729.752890] [<ffffffff81769d1d>] apic_timer_interrupt+0x6d/0x80 [13729.752891] <EOI> [<ffffffff81608470>] ? cpuidle_enter_state+0x70/0x170 [13729.752899] [<ffffffff8160845d>] ? cpuidle_enter_state+0x5d/0x170 [13729.752902] [<ffffffff81608627>] cpuidle_enter+0x17/0x20 [13729.752904] [<ffffffff810adb65>] cpu_startup_entry+0x2f5/0x390 [13729.752908] [<ffffffff81046446>] start_secondary+0x156/0x180 [13729.752910] ---[ end trace 2316ebb6a7713aa7 ]--- This happens quite often, more often than the machine crashing or failing to resume properly now. I compiled 3.18.0rc6+ / linux-drm-nightly as of yesterday (a834a782adf3ab4b508cd80e9082960263bcc4ed) and did one pm-hibernate/resume cycle during "make -j4" in the kernel tree. Upon resume I get this: [ 40.501301] init: samba-ad-dc main process (1405) terminated with status 1 [ 55.521833] ------------[ cut here ]------------ [ 55.521853] WARNING: CPU: 3 PID: 1943 at drivers/gpu/drm/i915/i915_gem_execbuffer.c:125 eb_lookup_vmas.isra.15+0x363/0x400 [i915]() [ 55.521854] GPU use of dumb buffer is illegal. [ 55.521855] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) intel_rapl(E) snd_pcm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_seq_midi(E) snd_seq_midi_event(E) coretemp(E) snd_rawmidi(E) snd_seq(E) kvm_intel(E) snd_seq_device(E) kvm(E) snd_timer(E) snd(E) soundcore(E) mei_me(E) shpchp(E) mei(E) lpc_ich(E) serio_raw(E) tpm_infineon(E) intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) dm_crypt(E) netconsole(E) configfs(E) hid_generic(E) usbhid(E) hid(E) mxm_wmi(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) drm_kms_helper(E) r8169(E) mii(E) drm(E) wmi(E) video(E) [ 55.521873] CPU: 3 PID: 1943 Comm: Xorg Tainted: G E 3.18.0-rc6+ #7 [ 55.521874] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014 [ 55.521875] 0000000000000009 ffff8802108efb48 ffffffff81762cfc 0000000000000000 [ 55.521876] ffff8802108efb98 ffff8802108efb88 ffffffff8106f0b1 ffff8802108efc18 [ 55.521877] ffff8802108efc38 ffff880210e73780 0000000000000001 ffff880210e737b8 [ 55.521879] Call Trace: [ 55.521882] [<ffffffff81762cfc>] dump_stack+0x46/0x58 [ 55.521885] [<ffffffff8106f0b1>] warn_slowpath_common+0x81/0xa0 [ 55.521887] [<ffffffff8106f116>] warn_slowpath_fmt+0x46/0x50 [ 55.521896] [<ffffffffa00e56b3>] eb_lookup_vmas.isra.15+0x363/0x400 [i915] [ 55.521904] [<ffffffffa00e5c6d>] i915_gem_do_execbuffer.isra.22+0x51d/0xd90 [i915] [ 55.521906] [<ffffffff811bf12c>] ? kmem_cache_alloc_trace+0x3c/0x1f0 [ 55.521915] [<ffffffffa00eca05>] ? i915_gem_object_get_pages+0x45/0xc0 [i915] [ 55.521923] [<ffffffffa00e7601>] i915_gem_execbuffer2+0xb1/0x2c0 [i915] [ 55.521930] [<ffffffffa001aa54>] drm_ioctl+0x1a4/0x630 [drm] [ 55.521933] [<ffffffff81123f0c>] ? acct_account_cputime+0x1c/0x20 [ 55.521934] [<ffffffff811f0520>] do_vfs_ioctl+0x2e0/0x4c0 [ 55.521937] [<ffffffff8109e304>] ? vtime_account_user+0x54/0x60 [ 55.521938] [<ffffffff811f0781>] SyS_ioctl+0x81/0xa0 [ 55.521940] [<ffffffff8176b3b4>] ? int_check_syscall_exit_work+0x34/0x3d [ 55.521942] [<ffffffff8176b12d>] system_call_fastpath+0x16/0x1b [ 55.521943] ---[ end trace 853866804709104b ]--- [ 55.832915] init: plymouth-upstart-bridge main process ended, respawning [ 55.835816] init: plymouth-upstart-bridge main process (2918) terminated with status 1 [ 55.835831] init: plymouth-upstart-bridge main process ended, respawning [ 58.563397] audit: type=1400 audit(1416991047.231:77): apparmor="STATUS" operation="profile_replace" name="/usr/lib/cups/backend/cups-pdf" pid=2981 comm="apparmor_parser" [ 58.563401] audit: type=1400 audit(1416991047.231:78): apparmor="STATUS" operation="profile_replace" name="/usr/sbin/cupsd" pid=2981 comm="apparmor_parser" [ 58.563595] audit: type=1400 audit(1416991047.231:79): apparmor="STATUS" operation="profile_replace" name="/usr/sbin/cupsd" pid=2981 comm="apparmor_parser" [ 815.742431] init: anacron main process (1210) killed by TERM signal [ 819.770858] PM: Syncing filesystems ... done. [ 820.315110] Freezing user space processes ... (elapsed 0.001 seconds) done. However, no more crashes, freezes or Oopses. Also, after a few suspend/resume cycles (twice in 12) I still have the problem that the network does not come up again after a resume. When it does, I get [ 3846.934341] r8169 0000:02:00.0 eth0: link up in dmesg. When it doesn't, I get [ 6221.007206] show_signal_msg: 120 callbacks suppressed [ 6221.007209] Watchdog[2700]: segfault at 0 ip 00007ffe51c623e8 sp 00007ffe41dc7560 error 6 in libcontent.so[7ffe513e8000+11d8000] [ 6243.712345] Watchdog[29313]: segfault at 0 ip 00007f49e1a3d3e8 sp 00007f49d1ba2560 error 6 in libcontent.so[7f49e11c3000+11d8000] but I don't know if these are related. I also occasionally get this [ 6520.964686] Restarting tasks ... [ 6520.964841] pci_bus 0000:04: Allocating resources [ 6520.964855] pci 0000:03:00.0: PCI bridge to [bus 04] [ 6520.964859] pci 0000:03:00.0: bridge window [io 0x3000-0x3fff] [ 6520.964866] pci 0000:03:00.0: bridge window [mem 0xdf600000-0xdf7fffff] [ 6520.964870] pci 0000:03:00.0: bridge window [mem 0xdf800000-0xdf9fffff 64bit pref] [ 6520.968218] done. [ 6520.968224] video LNXVIDEO:00: Restoring backlight state [ 6528.107156] r8169 0000:02:00.0 eth0: link down [ 6528.107204] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 6528.107448] r8169 0000:02:00.0 eth0: link down [ 6531.536977] r8169 0000:02:00.0 eth0: link up [ 6531.536983] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 6543.357696] ------------[ cut here ]------------ [ 6543.357703] WARNING: CPU: 0 PID: 20681 at net/sched/sch_generic.c:303 dev_watchdog+0x24f/0x260() [ 6543.357704] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 6543.357705] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) intel_rapl(E) snd_pcm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_seq_midi(E) snd_seq_midi_event(E) coretemp(E) snd_rawmidi(E) snd_seq(E) kvm_intel(E) snd_seq_device(E) kvm(E) snd_timer(E) snd(E) soundcore(E) mei_me(E) shpchp(E) mei(E) lpc_ich(E) serio_raw(E) tpm_infineon(E) intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) dm_crypt(E) netconsole(E) configfs(E) hid_generic(E) usbhid(E) hid(E) mxm_wmi(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) drm_kms_helper(E) r8169(E) mii(E) drm(E) wmi(E) video(E) [ 6543.357738] CPU: 0 PID: 20681 Comm: cc1 Tainted: G W E 3.18.0-rc6+ #7 [ 6543.357739] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014 [ 6543.357740] 0000000000000009 ffff88021ea03d48 ffffffff81762cfc 0000000000000000 [ 6543.357741] ffff88021ea03d98 ffff88021ea03d88 ffffffff8106f0b1 ffff88021ea03d70 [ 6543.357743] 0000000000000000 ffff88020fb08000 0000000000000001 ffff8800c65e1e80 [ 6543.357744] Call Trace: [ 6543.357745] <IRQ> [<ffffffff81762cfc>] dump_stack+0x46/0x58 [ 6543.357751] [<ffffffff8106f0b1>] warn_slowpath_common+0x81/0xa0 [ 6543.357753] [<ffffffff8106f116>] warn_slowpath_fmt+0x46/0x50 [ 6543.357755] [<ffffffff8168469f>] dev_watchdog+0x24f/0x260 [ 6543.357756] [<ffffffff81684450>] ? dev_graft_qdisc+0x80/0x80 [ 6543.357759] [<ffffffff810d39fa>] call_timer_fn+0x3a/0x110 [ 6543.357760] [<ffffffff81684450>] ? dev_graft_qdisc+0x80/0x80 [ 6543.357762] [<ffffffff810d519f>] run_timer_softirq+0x20f/0x310 [ 6543.357763] [<ffffffff810731b5>] __do_softirq+0xf5/0x2d0 [ 6543.357764] [<ffffffff81073685>] irq_exit+0x115/0x120 [ 6543.357766] [<ffffffff8176dfaa>] smp_apic_timer_interrupt+0x4a/0x60 [ 6543.357769] [<ffffffff8176c07d>] apic_timer_interrupt+0x6d/0x80 [ 6543.357769] <EOI> [ 6543.357770] ---[ end trace 853866804709104c ]--- [ 6543.375603] r8169 0000:02:00.0 eth0: link up after which the network works again. Is the network issue being worked on actively? If so, I can try on a second machine and report back. (In reply to Jens from comment #17) > I compiled 3.18.0rc6+ / linux-drm-nightly as of yesterday > (a834a782adf3ab4b508cd80e9082960263bcc4ed) and did one pm-hibernate/resume > cycle during "make -j4" in the kernel tree. Upon resume I get this: > > [ 40.501301] init: samba-ad-dc main process (1405) terminated with status 1 > [ 55.521833] ------------[ cut here ]------------ > [ 55.521853] WARNING: CPU: 3 PID: 1943 at > drivers/gpu/drm/i915/i915_gem_execbuffer.c:125 > eb_lookup_vmas.isra.15+0x363/0x400 [i915]() > [ 55.521854] GPU use of dumb buffer is illegal. > [ 55.521855] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) > snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) > snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) > intel_rapl(E) snd_pcm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) > snd_seq_midi(E) snd_seq_midi_event(E) coretemp(E) snd_rawmidi(E) snd_seq(E) > kvm_intel(E) snd_seq_device(E) kvm(E) snd_timer(E) snd(E) soundcore(E) > mei_me(E) shpchp(E) mei(E) lpc_ich(E) serio_raw(E) tpm_infineon(E) > intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) > dm_crypt(E) netconsole(E) configfs(E) hid_generic(E) usbhid(E) hid(E) > mxm_wmi(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) > aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) > ablk_helper(E) cryptd(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) > drm_kms_helper(E) r8169(E) mii(E) drm(E) wmi(E) video(E) > [ 55.521873] CPU: 3 PID: 1943 Comm: Xorg Tainted: G E > 3.18.0-rc6+ #7 > [ 55.521874] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 > 05/30/2014 > [ 55.521875] 0000000000000009 ffff8802108efb48 ffffffff81762cfc > 0000000000000000 > [ 55.521876] ffff8802108efb98 ffff8802108efb88 ffffffff8106f0b1 > ffff8802108efc18 > [ 55.521877] ffff8802108efc38 ffff880210e73780 0000000000000001 > ffff880210e737b8 > [ 55.521879] Call Trace: > [ 55.521882] [<ffffffff81762cfc>] dump_stack+0x46/0x58 > [ 55.521885] [<ffffffff8106f0b1>] warn_slowpath_common+0x81/0xa0 > [ 55.521887] [<ffffffff8106f116>] warn_slowpath_fmt+0x46/0x50 > [ 55.521896] [<ffffffffa00e56b3>] eb_lookup_vmas.isra.15+0x363/0x400 > [i915] > [ 55.521904] [<ffffffffa00e5c6d>] > i915_gem_do_execbuffer.isra.22+0x51d/0xd90 [i915] > [ 55.521906] [<ffffffff811bf12c>] ? kmem_cache_alloc_trace+0x3c/0x1f0 > [ 55.521915] [<ffffffffa00eca05>] ? i915_gem_object_get_pages+0x45/0xc0 > [i915] > [ 55.521923] [<ffffffffa00e7601>] i915_gem_execbuffer2+0xb1/0x2c0 [i915] > [ 55.521930] [<ffffffffa001aa54>] drm_ioctl+0x1a4/0x630 [drm] > [ 55.521933] [<ffffffff81123f0c>] ? acct_account_cputime+0x1c/0x20 > [ 55.521934] [<ffffffff811f0520>] do_vfs_ioctl+0x2e0/0x4c0 > [ 55.521937] [<ffffffff8109e304>] ? vtime_account_user+0x54/0x60 > [ 55.521938] [<ffffffff811f0781>] SyS_ioctl+0x81/0xa0 > [ 55.521940] [<ffffffff8176b3b4>] ? int_check_syscall_exit_work+0x34/0x3d > [ 55.521942] [<ffffffff8176b12d>] system_call_fastpath+0x16/0x1b > [ 55.521943] ---[ end trace 853866804709104b ]--- > [ 55.832915] init: plymouth-upstart-bridge main process ended, respawning > [ 55.835816] init: plymouth-upstart-bridge main process (2918) terminated > with status 1 > [ 55.835831] init: plymouth-upstart-bridge main process ended, respawning > [ 58.563397] audit: type=1400 audit(1416991047.231:77): apparmor="STATUS" > operation="profile_replace" name="/usr/lib/cups/backend/cups-pdf" pid=2981 > comm="apparmor_parser" > [ 58.563401] audit: type=1400 audit(1416991047.231:78): apparmor="STATUS" > operation="profile_replace" name="/usr/sbin/cupsd" pid=2981 > comm="apparmor_parser" > [ 58.563595] audit: type=1400 audit(1416991047.231:79): apparmor="STATUS" > operation="profile_replace" name="/usr/sbin/cupsd" pid=2981 > comm="apparmor_parser" > [ 815.742431] init: anacron main process (1210) killed by TERM signal > [ 819.770858] PM: Syncing filesystems ... done. > [ 820.315110] Freezing user space processes ... (elapsed 0.001 seconds) > done. This looks like a problem in X, trying to use an invalid GEM buffer for rendering. Does it really happen only after S4 resume, or also during normal booting? CC'ing Chris. > However, no more crashes, freezes or Oopses. > > Also, after a few suspend/resume cycles (twice in 12) I still have the > problem that the network does not come up again after a resume. When it > does, I get > > [ 3846.934341] r8169 0000:02:00.0 eth0: link up > > in dmesg. When it doesn't, I get > > [ 6221.007206] show_signal_msg: 120 callbacks suppressed > [ 6221.007209] Watchdog[2700]: segfault at 0 ip 00007ffe51c623e8 sp > 00007ffe41dc7560 error 6 in libcontent.so[7ffe513e8000+11d8000] > [ 6243.712345] Watchdog[29313]: segfault at 0 ip 00007f49e1a3d3e8 sp > 00007f49d1ba2560 error 6 in libcontent.so[7f49e11c3000+11d8000] > > but I don't know if these are related. I also occasionally get this > > [ 6520.964686] Restarting tasks ... > [ 6520.964841] pci_bus 0000:04: Allocating resources > [ 6520.964855] pci 0000:03:00.0: PCI bridge to [bus 04] > [ 6520.964859] pci 0000:03:00.0: bridge window [io 0x3000-0x3fff] > [ 6520.964866] pci 0000:03:00.0: bridge window [mem 0xdf600000-0xdf7fffff] > [ 6520.964870] pci 0000:03:00.0: bridge window [mem 0xdf800000-0xdf9fffff > 64bit pref] > [ 6520.968218] done. > [ 6520.968224] video LNXVIDEO:00: Restoring backlight state > [ 6528.107156] r8169 0000:02:00.0 eth0: link down > [ 6528.107204] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [ 6528.107448] r8169 0000:02:00.0 eth0: link down > [ 6531.536977] r8169 0000:02:00.0 eth0: link up > [ 6531.536983] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [ 6543.357696] ------------[ cut here ]------------ > [ 6543.357703] WARNING: CPU: 0 PID: 20681 at net/sched/sch_generic.c:303 > dev_watchdog+0x24f/0x260() > [ 6543.357704] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > [ 6543.357705] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) > snd_hda_codec_realtek(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) > snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) > intel_rapl(E) snd_pcm(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) > snd_seq_midi(E) snd_seq_midi_event(E) coretemp(E) snd_rawmidi(E) snd_seq(E) > kvm_intel(E) snd_seq_device(E) kvm(E) snd_timer(E) snd(E) soundcore(E) > mei_me(E) shpchp(E) mei(E) lpc_ich(E) serio_raw(E) tpm_infineon(E) > intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) > dm_crypt(E) netconsole(E) configfs(E) hid_generic(E) usbhid(E) hid(E) > mxm_wmi(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) > aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) > ablk_helper(E) cryptd(E) i915(E) ahci(E) i2c_algo_bit(E) libahci(E) > drm_kms_helper(E) r8169(E) mii(E) drm(E) wmi(E) video(E) > [ 6543.357738] CPU: 0 PID: 20681 Comm: cc1 Tainted: G W E > 3.18.0-rc6+ #7 > [ 6543.357739] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 > 05/30/2014 > [ 6543.357740] 0000000000000009 ffff88021ea03d48 ffffffff81762cfc > 0000000000000000 > [ 6543.357741] ffff88021ea03d98 ffff88021ea03d88 ffffffff8106f0b1 > ffff88021ea03d70 > [ 6543.357743] 0000000000000000 ffff88020fb08000 0000000000000001 > ffff8800c65e1e80 > [ 6543.357744] Call Trace: > [ 6543.357745] <IRQ> [<ffffffff81762cfc>] dump_stack+0x46/0x58 > [ 6543.357751] [<ffffffff8106f0b1>] warn_slowpath_common+0x81/0xa0 > [ 6543.357753] [<ffffffff8106f116>] warn_slowpath_fmt+0x46/0x50 > [ 6543.357755] [<ffffffff8168469f>] dev_watchdog+0x24f/0x260 > [ 6543.357756] [<ffffffff81684450>] ? dev_graft_qdisc+0x80/0x80 > [ 6543.357759] [<ffffffff810d39fa>] call_timer_fn+0x3a/0x110 > [ 6543.357760] [<ffffffff81684450>] ? dev_graft_qdisc+0x80/0x80 > [ 6543.357762] [<ffffffff810d519f>] run_timer_softirq+0x20f/0x310 > [ 6543.357763] [<ffffffff810731b5>] __do_softirq+0xf5/0x2d0 > [ 6543.357764] [<ffffffff81073685>] irq_exit+0x115/0x120 > [ 6543.357766] [<ffffffff8176dfaa>] smp_apic_timer_interrupt+0x4a/0x60 > [ 6543.357769] [<ffffffff8176c07d>] apic_timer_interrupt+0x6d/0x80 > [ 6543.357769] <EOI> > [ 6543.357770] ---[ end trace 853866804709104c ]--- > [ 6543.375603] r8169 0000:02:00.0 eth0: link up > > after which the network works again. > > Is the network issue being worked on actively? If so, I can try on a second > machine and report back. I'm not sure, but this is a network driver problem, so could you let the maintainers of it know about this? IIRC you opened a bug about this already. I also have this problem with thawing after hibernation (using Linux Mint 17, based on Ubuntu 14.04, hence kernel version is 3.13.0-39-generic #66-Ubuntu SMP). I read the whole original bug report on kernel.org and this, but I'm not sure I understand the state of this bug. Is it supposed to be fixed in some newer kernel versions? If not, could the network problem be somewhat related to the memory corruption problem mentioned here? In my case I see random process failing after a thawing, before the system completely crashes or freezes, so the network layer may also be impacted. Or rather the original memory corruption problem is fixed? If so, why is this "reopened"? @Mauro, the thawing issues were resolved for me after 3.17rc1 only. All kernels before that were unstable especially when hibernating under load. Check which kernel Linux Mint 17 is based on and upgrade to a newer version if necessary. @Imre: the "i915_gem" bug only occured on the first resume after first hibernation and it was the first time I saw it. However, it was also the first time I booted this kernel. I will keep my eyes open. Will update the bko report at https://bugzilla.kernel.org/show_bug.cgi?id=84681. Another issue occured (something that already occurs since 3.17rc1+): Occasionally the system will simply refuse to hibernate when calling "sudo pm-hibernate". No error, no syslog or dmesg output, nothing. When hibernating via desktop, the screensaver will enable but the system won't hibernate. This happened about once in ten hibernation attempts so far. Looking at the processes I see stuck processes: 16859 ? S 0:00 sh -c /usr/sbin/pm-hibernate 16860 ? S 0:00 /bin/sh /usr/sbin/pm-hibernate Waiting for ten minutes doesn't change anything. But yesterday the moment I called 'sudo strace -p 16860' to look at what is happening, the hibernation process woke up and continued 8-) I'll try and reproduce this. Here's the strace output (just the beginning): jens@linuxkiste:~$ sudo strace -p 16860 Process 16860 attached dup2(11, 1) = 1 close(11) = 0 open("/sys/power/disk", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 fcntl(1, F_DUPFD, 10) = 11 close(1) = 0 fcntl(11, F_SETFD, FD_CLOEXEC) = 0 dup2(4, 1) = 1 close(4) = 0 dup2(11, 1) = 1 close(11) = 0 pipe([4, 5]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa4f1076a10) = 24552 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24552, si_status=0, si_utime=0, si_stime=0} --- rt_sigreturn() = 24552 close(5) = 0 read(4, "Fri Nov 28 08:53:51 CET 2014\n", 128) = 29 read(4, "", 128) = 0 close(4) = 0 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24552 write(1, "Fri Nov 28 08:53:51 CET 2014: Aw"..., 37) = 37 pipe([4, 5]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa4f1076a10) = 24554 close(5) = 0 read(4, "Fri Nov 28 08:53:51 CET 2014\n", 128) = 29 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24554, si_status=0, si_utime=0, si_stime=0} --- rt_sigreturn() = 29 read(4, "", 128) = 0 close(4) = 0 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24554 write(1, "Fri Nov 28 08:53:51 CET 2014: Ru"..., 53) = 53 open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 fcntl(1, F_DUPFD, 10) = 11 close(1) = 0 fcntl(11, F_SETFD, FD_CLOEXEC) = 0 dup2(4, 1) = 1 close(4) = 0 fcntl(2, F_DUPFD, 10) = 12 close(2) = 0 fcntl(12, F_SETFD, FD_CLOEXEC) = 0 dup2(1, 2) = 2 write(1, "before_hooks is a shell function"..., 33) = 33 dup2(11, 1) = 1 close(11) = 0 dup2(12, 2) = 2 close(12) = 0 pipe([4, 5]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa4f1076a10) = 24555 close(5) = 0 read(4, "novatel_3g_suspend\n99video\n99_ch"..., 128) = 128 read(4, "60_wpa_supplicant\n50unload_alx\n1"..., 128) = 128 read(4, "change\n", 128) = 7 read(4, "", 128) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=24555, si_status=0, si_utime=0, si_stime=0} --- rt_sigreturn() = 0 close(4) = 0 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 24555 stat("/var/run/pm-utils/pm-suspend/storage/parameters.new", 0x7fff52a38ad0) = -1 ENOENT (No such file or directory) stat("/etc/pm/sleep.d/novatel_3g_suspend", {st_mode=S_IFREG|0755, st_size=1260, ...}) = 0 write(1, "Running hook /etc/pm/sleep.d/nov"..., 64) = 64 stat("/var/run/pm-utils/pm-suspend/storage/disable_hook:novatel_3g_suspend", 0x7fff52a38530) = -1 ENOENT (No such file or directory) stat("/var/run/pm-utils/pm-suspend/storage/disable_hook:novatel_3g_suspend", 0x7fff52a38560) = -1 ENOENT (No such file or directory) geteuid() = 0 stat("/etc/pm/sleep.d/novatel_3g_suspend", {st_mode=S_IFREG|0755, st_size=1260, ...}) = 0 faccessat(AT_FDCWD, "/etc/pm/sleep.d/novatel_3g_suspend", X_OK) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fa4f1076a10) = 24559 ... ... Does this make any sense to you at all? Thank you! I just xperienced another crash after about 20 suspend/resume cycles. I don't know if the crash is related to this bug so I submitted it at https://bugzilla.kernel.org/show_bug.cgi?id=89321. Please have a look. Thanks! Cannot reproduce any more with 3.19.0-rc2+ (drm-intel-nightly as of Jan 3, 2015). Closing resolved+worksforme. Verified by Reporter. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.