Bug 40241

Summary: [ILK] Suspend to disk: Random (frequent) reboots at resume
Product: DRI Reporter: Nicolas FRANÇOIS <nicolas.mb.francois>
Component: DRM/IntelAssignee: Jesse Barnes <jbarnes>
Status: CLOSED FIXED QA Contact:
Severity: major    
Priority: medium CC: ben, bojan, chris, eugeni, jbarnes, jrnieder
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kernel log
none
Stacktrace while hibernating (not at resume)
none
Use suspend/resume routines instead of hibernate/thaw
none
freeze workqueue on suspend none

Description Nicolas FRANÇOIS 2011-08-19 10:44:29 UTC
Hi,

I'm currently on debian wheezy (kernel 3.0.0, xserver-xorg-video-intel 2:2.15.0-3, xserver-xorg 1:7.6+8) and suspend to disk causes frequent reboots at resume. DRI looks incriminated because s2disk works perfectly from runlevel 1.

When resume succeeds, I often get a lot of relocation errors and segfaults after resume.

Please see attachments, let me know if you need more details.

Cheers,
Nicolas FRANÇOIS

lspci:
00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02)
00:16.0 Communication controller: Intel Corporation 5 Series/3400 Series Chipset HECI Controller (rev 06)
00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1b.0 Audio device: Intel Corporation 5 Series/3400 Series Chipset High Definition Audio (rev 05)
00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2 (rev 05)
00:1c.2 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 3 (rev 05)
00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation Mobile 5 Series Chipset LPC Interface Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset 4 port SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller (rev 05)
00:1f.6 Signal processing controller: Intel Corporation 5 Series/3400 Series Chipset Thermal Subsystem (rev 05)
04:00.0 System peripheral: JMicron Technology Corp. SD/MMC Host Controller (rev 80)
04:00.2 SD Host controller: JMicron Technology Corp. Standard SD Host Controller (rev 80)
04:00.3 System peripheral: JMicron Technology Corp. MS Host Controller (rev 80)
04:00.5 Ethernet controller: JMicron Technology Corp. JMC250 PCI Express Gigabit Ethernet Controller (rev 03)
05:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8191SEvB Wireless LAN Controller (rev 10)
ff:00.0 Host bridge: Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers (rev 02)
ff:00.1 Host bridge: Intel Corporation Core Processor QuickPath Architecture System Address Decoder (rev 02)
ff:02.0 Host bridge: Intel Corporation Core Processor QPI Link 0 (rev 02)
ff:02.1 Host bridge: Intel Corporation Core Processor QPI Physical 0 (rev 02)
ff:02.2 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
ff:02.3 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
Comment 1 Nicolas FRANÇOIS 2011-08-19 10:46:38 UTC
Created attachment 50382 [details]
kernel log
Comment 2 Gordon Jin 2011-08-20 18:30:01 UTC
looks like bug#36071
Comment 3 Nicolas FRANÇOIS 2011-08-21 02:15:33 UTC
Hi Gordon,

Yes this looks similar.
At resume, everything happens as expected until memory pages are loaded up to 100%, then screen flickers, and finally the kernel reboots the machine.

It never returns to userspace. I also tried netconsole, but kernel complains that my network card doesn't suport polling, no luck...

Cheers

(In reply to comment #2)
> looks like bug#36071
Comment 4 Nicolas FRANÇOIS 2011-08-21 07:50:52 UTC
Created attachment 50426 [details]
Stacktrace while hibernating (not at resume)

(Sorry for the ugly jpeg, I catched this by chance)

This is weird, it looks like it is already thawing.
Did a hard reset after this, and it rebooted after loading pages.
Comment 5 Jesse Barnes 2011-08-22 10:36:22 UTC
Looks like the monitor thread runs after we remove the IPS driver and references something it shouldn't...  Can you gdb your i915.ko and do a "list *i915_chipset_val+0xbc" and also gdb your intel_ips.ko and do a "list *ips_monitor+0x341"?
Comment 6 Nicolas FRANÇOIS 2011-08-23 02:15:34 UTC
Hi,

Here it is:
(gdb) list *i915_chipset_val+0xbc
0x23c8 is in i915_chipset_val (/tmp/buildd/linux-2.6-3.0.0/debian/build/source_amd64_none/include/linux/math64.h:18).
13      in /tmp/buildd/linux-2.6-3.0.0/debian/build/source_amd64_none/include/linux/math64.h

(gdb) list *ips_monitor+0x341
0xd6c is in ips_monitor (/tmp/buildd/linux-2.6-3.0.0/debian/build/source_amd64_none/drivers/platform/x86/intel_ips.c:943).
938     in /tmp/buildd/linux-2.6-3.0.0/debian/build/source_amd64_none/drivers/platform/x86/intel_ips.c

Cheers,
NicolaF



(In reply to comment #5)
> Looks like the monitor thread runs after we remove the IPS driver and
> references something it shouldn't...  Can you gdb your i915.ko and do a "list
> *i915_chipset_val+0xbc" and also gdb your intel_ips.ko and do a "list
> *ips_monitor+0x341"?
Comment 7 Nicolas FRANÇOIS 2011-08-29 02:08:19 UTC
Created attachment 50648 [details] [review]
Use suspend/resume routines instead of hibernate/thaw

Hi,

After further investigations, I found this bug, reported kernel side:
https://bugzilla.kernel.org/show_bug.cgi?id=37142

The symptoms are quite similar (memory corruption, which may, I think, lead to the reboot problems I experience), and the proposed patch (thanks to Rafael J. Wysocki), which I re-attach here, works perfectly for me. This is a bit dirty (there must be good reasons to do different things when suspending and hibernating), but works for me, no reboots or memory corruption in tenths of hibernate/thaw cycles.

However, the thread synchronization problem is still there, I got that null pointer dereference stacktrace once again.

Cheers,
NicolaF
Comment 8 Bojan Smojver 2011-09-19 23:18:57 UTC
Any new developments here?

Although you cannot see this (because bugzilla.kernel.org is down) that patch from Rafael (essentially replacement of freeze/thaw with suspend/resume) did not work for everyone in kernel bug #37142.

I tried as recent as 3.1.0-rc6 without any luck. Still memory corruption after several hibernate/thaw cycles.
Comment 9 Bojan Smojver 2011-09-22 23:46:53 UTC
(In reply to comment #8)
 
> I tried as recent as 3.1.0-rc6 without any luck. Still memory corruption after
> several hibernate/thaw cycles.

Also, rc7.
Comment 10 Eugeni Dodonov 2011-10-10 05:24:39 UTC
(Updating)
After investigation by Bojan Smojver on intel-gfx mailing list, the problem seems to only happen when modeset is enabled. When booting with 'nomodeset', the issue does not happens [1].

Could someone affected by this issue confirm this please?

[1] http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/6173
Comment 11 Nicolas FRANÇOIS 2011-10-10 06:16:00 UTC
(In reply to comment #10)
> (Updating)
> After investigation by Bojan Smojver on intel-gfx mailing list, the problem
> seems to only happen when modeset is enabled. When booting with 'nomodeset',
> the issue does not happens [1].
> 
> Could someone affected by this issue confirm this please?
> 
> [1] http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/6173

Hi,
It seems to work: Just performed about 10 hibernate/thaw cycles (with some suspend to disk, for the fun), and no problems for the moment.

Cheers,
NicolaF
Comment 12 arne_woerner 2012-02-03 09:50:29 UTC
but my X wont start when i use "nomodeset" kernel boot option...
[    22.637] (EE) open /dev/fb0: No such device
...
[    22.839] (II) VESA(0): VBESetVBEMode failed

how can i have X-ability _and_ suspend-ability?

-arne
Comment 13 Eugeni Dodonov 2012-02-16 10:01:48 UTC
Created attachment 57169 [details] [review]
freeze workqueue on suspend

For the ones affected by this issue, could you please try with this patch?
Comment 14 Bojan Smojver 2012-02-21 13:33:23 UTC
(In reply to comment #13)
> Created attachment 57169 [details] [review] [review]
> freeze workqueue on suspend
> 
> For the ones affected by this issue, could you please try with this patch?

The patch did not help my ThinkPad T510. I got segfaults, just like before, after about 20 something hibernate/thaw cycles. They looked like this:

[  723.970911] pm-hibernate[8884]: segfault at 0 ip 0000000000477900 sp 00007fff674d1730 error 6 in bash[400000+da000]
[  727.545054] pm-hibernate[8894]: segfault at 0 ip 0000000000477900 sp 00007fff8298a860 error 6 in bash[400000+da000]
[  731.099119] pm-hibernate[8905]: segfault at 0 ip 0000000000477900 sp 00007fff76919de0 error 6 in bash[400000+da000]
[  734.669372] pm-hibernate[8916]: segfault at 0 ip 0000000000477900 sp 00007fff4f5b3700 error 6 in bash[400000+da000]
[  738.248239] pm-hibernate[8927]: segfault at 0 ip 0000000000477900 sp 00007fff3f7c4d70 error 6 in bash[400000+da000]
[  741.816694] pm-hibernate[8950]: segfault at 0 ip 0000000000477900 sp 00007fff28b757d0 error 6 in bash[400000+da000]
[  745.311532] pm-hibernate[8961]: segfault at 0 ip 0000000000477900 sp 00007fff319232a0 error 6 in bash[400000+da000]
[  748.936928] pm-hibernate[8972]: segfault at 0 ip 0000000000477900 sp 00007fff8d0da390 error 6 in bash[400000+da000]
[  752.562089] pm-hibernate[8982]: segfault at 0 ip 0000000000477900 sp 00007fff18368f50 error 6 in bash[400000+da000]
Comment 15 Eugeni Dodonov 2012-03-30 09:53:52 UTC
Could you please try with the Dave's patch from https://lkml.org/lkml/2012/3/29/72 (the patch itself is http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=3fa016a0b5c5237e9c387fc3249592b2cb5391c6)? I am fairly sure it could solve this..
Comment 16 Chris Wilson 2012-03-30 10:08:29 UTC
We believe we finally have the root cause of so many crashes following hibernation. Please update and test, thanks.

commit 3fa016a0b5c5237e9c387fc3249592b2cb5391c6
Author: Dave Airlie <airlied@redhat.com>
Date:   Wed Mar 28 10:48:49 2012 +0100

    drm/i915: suspend fbdev device around suspend/hibernate
    
    Looking at hibernate overwriting I though it looked like a cursor,
    so I tracked down this missing piece to stop the cursor blink
    timer. I've no idea if this is sufficient to fix the hibernate
    problems people are seeing, but please test it.
    
    Both radeon and nouveau have done this for a long time.
    
    I've run this personally all night hib/resume cycles with no fails.
    
    Reviewed-by: Keith Packard <keithp@keithp.com>
    Reported-by: Petr Tesarik <kernel@tesarici.cz>
    Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Reported-by: Lots of misc segfaults after hibernate across the world.
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=37142
    Tested-by: Dave Airlie <airlied@redhat.com>
    Tested-by: Bojan Smojver <bojan@rexursive.com>
    Tested-by: Andreas Hartmann <andihartmann@01019freenet.de>
    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Airlie <airlied@redhat.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.