Bug 35648 - [SNB] Kernel may crash after resume from S4 for several times
[SNB] Kernel may crash after resume from S4 for several times
Status: VERIFIED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Intel
unspecified
All Linux (All)
: high critical
Assigned To: Chris Wilson
:
: 35846 (view as bug list)
Depends on:
Blocks: 42991 44622
  Show dependency treegraph
 
Reported: 2011-03-25 00:44 UTC by fangxun
Modified: 2012-04-10 22:20 UTC (History)
6 users (show)

See Also:


Attachments
dmesg file (123.18 KB, text/plain)
2011-03-25 00:44 UTC, fangxun
no flags Details
dmesg_s4_from_vt (123.37 KB, text/plain)
2011-03-28 02:33 UTC, fangxun
no flags Details
setting no_console_suspend (28.00 KB, application/octet-stream)
2011-07-24 22:56 UTC, Ouping Zhang
no flags Details
freeze workqueue on suspend (1.15 KB, patch)
2012-02-16 10:03 UTC, Eugeni Dodonov
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description fangxun 2011-03-25 00:44:57 UTC
Created attachment 44811 [details]
dmesg file

System Environment:
--------------------------
Arch:           x86_64
Platform:       sugarbay
Libdrm:         (master)2.4.24-7-gfd3ed34a2070fca3804baf54ece40d0bc2666226
Mesa:           (7.10)b8a077cee0f3856d5c3d4468918513515bbd0dcb
Xserver:        (master)xorg-server-1.10.0
Xf86_video_intel: (master)2.14.901-4-gee740778f5d5355c04f6fc4564f598993b106d62
Kernel: (drm-intel-fixes)f0c860246472248a534656d6cdbed5a36d1feb2e


Bug detailed description:
-------------------------
Kernel crashed when doning Suspend-To-Disk(S4) for several times in X mode on sugarbay. This is regression. The last known good commit is kernel(2.6.37)3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5.


Reproduce steps:
----------------
1. xinit&
2. echo disk > /sys/power/state
3. repeat step 2 about 2-20 times
Comment 1 Chris Wilson 2011-03-25 01:27:35 UTC
Can you do some sanity checking with the same kernel but S4 from VT and S4 without i915.ko?
Comment 2 fangxun 2011-03-28 02:33:10 UTC
Created attachment 44931 [details]
dmesg_s4_from_vt

Tested about 50 times with the same Kernel,  it also crashed on S4 from VT.  It didn't happen on S4 without i915.ko.
Comment 3 Chris Wilson 2011-03-29 00:22:42 UTC
Does Zhenyu's SNB resume patches help here? [Would be good to get some testing on those at any rate.]
Comment 4 fangxun 2011-03-29 00:40:55 UTC
Would you like to tell me where can I find Zhenyu's SNB resume patches?
Comment 5 Wang Zhenyu 2011-03-29 00:43:41 UTC
You can directly download patches from http://people.freedesktop.org/~zhen/snb_desk_suspend_0323/
Comment 6 Wang Zhenyu 2011-03-29 00:45:59 UTC
And note that Nanhai is working a workaround that need to be applied for SNB render engine after power cycle. You should ask him to test his patch too.
Comment 7 fangxun 2011-03-29 02:14:49 UTC
Kernel also crashed with Zhenyu's patches.
Comment 8 Michael Fu 2011-03-29 18:41:52 UTC
Xun, would you pls help bi-sect?  sounds like you can easily reproduce the hang in 'several rounds' of S4...

S4 on SugarBay SDV was pretty stable, Rui@ACPI team once tested 1000+ times..
Comment 9 fangxun 2011-03-30 02:46:35 UTC
It seems that is not a regression. I retest it about 60 times with kernel(2.6.37)3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5 and find it also crashes. 
BTW, retest S4 without i915.ko about 100 times and no crash happen.
Comment 10 Chris Wilson 2011-03-30 03:13:40 UTC
Adjusting priority fields to reflect severity and impact. We still need to fix it, it will just take a little longer if it was not due to a recent regression.
Comment 11 Gordon Jin 2011-05-24 19:28:51 UTC
Promoting to P1 for Q2 release consideration.

Xun, how many of our SNB machines have this problem?
Comment 12 fangxun 2011-05-26 02:30:21 UTC
(In reply to comment #11)
> Promoting to P1 for Q2 release consideration.
> Xun, how many of our SNB machines have this problem?

It happens on our two sugarbay machines.
x-sgb1: SugarBay Qual SDP (DH): i7-2600 D2 (id=0x0102, rev 09), H67 B1 (Intel DH67CL rev 03), and Host Bridge id=0x0100 (rev09) 

(x-sgb3: SugarBay desktop: i5-2500K product (id=0x0112, rev 09), H67 B1 SDP (rev 03), and Host Bridge id=0x0100 (rev09)
Comment 13 Jesse Barnes 2011-06-16 11:01:26 UTC
The panic happens in d_move, which makes me thing we're clobbering filesystem state somehow.  Is the backtrace you get consistent?  Does it still happen with 3.0-rc2?
Comment 14 Chris Wilson 2011-07-18 07:48:11 UTC
Highly unlikely to be fixed before release.
Comment 15 Gordon Jin 2011-07-18 17:22:10 UTC
I tend to put it in the P1 list -- even if we can't fix it in this release, we need maintain it as known issue in release notes.

Xun, can you answer Jesse's question (comment#13), by running the latest drm-intel-fixes?
Comment 16 fangxun 2011-07-19 02:31:28 UTC
It still happens with latest drm-intel-fixes kernel(3.0.0rc7).
Backtrace seems to be diffrent from the previous. Below is the Call trace.

Call Trace:
 kernel: [<ffffffff8110d9ce>] ? __sync_filesystem+0x75/0x75
 kernel: [<ffffffff8110d99b>] __sync_filesystem+0x42/0x75
 kernel: [<ffffffff8110d9df>] sync_one_sb+0x11/0x13
 kernel: [<ffffffff810ed4c6>] iterate_supers+0x67/0xb7
 kernel: [<ffffffff8110da21>] sys_sync+0x40/0x57
 kernel: [<ffffffff81070f2c>] hibernate+0x88/0x1b8
 kernel: [<ffffffff8106fa4c>] state_store+0x57/0xce
 kernel: [<ffffffff811df993>] kobj_attr_store+0x17/0x19
 kernel: [<ffffffff811420a2>] sysfs_write_file+0x10c/0x148
 kernel: [<ffffffff810ebd0d>] vfs_write+0xae/0x153
 kernel: [<ffffffff810ebe6b>] sys_write+0x45/0x6c
 kernel: [<ffffffff813bf6fb>] system_call_fastpath+0x16/0x1b
 kernel: Code: 48 c7 c7 40 23 69 81 45 31 ed e8 b3 f6 2a 00 49 8b 9c 24 c0 00 00 00 49 81 c4 c0 00 00 00 48 81 eb 90 00 00 00 eb 7b 4c 8d 73 20 <4c> 8b bb 48 01 00 00 4c 89 f7 e8 88 f6 2a 00 f6 43 28 38 75 07
Comment 17 Ouping Zhang 2011-07-24 22:56:03 UTC
Jesse, after setting no_console_suspend,  I get more info out, please check the attached setting no_console_suspend.log.
(In reply to comment #13)
> The panic happens in d_move, which makes me thing we're clobbering filesystem
> state somehow.  Is the backtrace you get consistent?  Does it still happen with
> 3.0-rc2?
Comment 18 Ouping Zhang 2011-07-24 22:56:46 UTC
Created attachment 49485 [details]
setting no_console_suspend
Comment 19 Jesse Barnes 2011-08-01 11:00:19 UTC
Those backtraces look unrelated to gfx; does the same panic occur even without i915 loaded (you'll need to use netconsole to see it still).
Comment 20 fangxun 2011-08-25 01:05:09 UTC
Panic still occurs with kernel 3.1.0-rc1.  it doesn't occur without i915 loaded.
Comment 21 Eugeni Dodonov 2011-10-10 14:09:23 UTC
Hi,

Could you please check if those issues happen if you disable modesetting (e.g., boot with 'nomodeset' kernel parameter)?
Comment 22 fangxun 2011-10-12 01:51:28 UTC
The issue goes away by using 'nomodeset' kernel parameter. It still happens when modeset is used.
Comment 23 fangxun 2011-11-09 00:52:33 UTC
 It still fails on SandyBride with Kernel 3.1(c3b92c8787367a8bb53d57d9789b558f1295cc96). I don't see this on IvyBridge.
Comment 24 Eugeni Dodonov 2011-11-25 11:01:17 UTC
*** Bug 35846 has been marked as a duplicate of this bug. ***
Comment 25 Eugeni Dodonov 2012-02-16 10:03:11 UTC
Created attachment 57170 [details] [review]
freeze workqueue on suspend

Could you please try with this patch and verify if it changes anything?
Comment 26 libo 2012-02-16 23:52:32 UTC
sorry, the bug still exists on 3.2.4 with that patch. (In reply to comment #25)
> Created attachment 57170 [details] [review] [review]
> freeze workqueue on suspend
> 
> Could you please try with this patch and verify if it changes anything?
Comment 27 libo 2012-02-16 23:53:55 UTC
sorry, the bug still exists on 3.2.4 with that patch. (In reply to comment #25)
> Created attachment 57170 [details] [review] [review]
> freeze workqueue on suspend
> 
> Could you please try with this patch and verify if it changes anything?
Comment 28 Eugeni Dodonov 2012-03-30 09:53:55 UTC
Could you please try with the Dave's patch from https://lkml.org/lkml/2012/3/29/72 (the patch itself is http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=3fa016a0b5c5237e9c387fc3249592b2cb5391c6)? I am fairly sure it could solve this..
Comment 29 Chris Wilson 2012-03-30 10:08:26 UTC
We believe we finally have the root cause of so many crashes following hibernation. Please update and test, thanks.

commit 3fa016a0b5c5237e9c387fc3249592b2cb5391c6
Author: Dave Airlie <airlied@redhat.com>
Date:   Wed Mar 28 10:48:49 2012 +0100

    drm/i915: suspend fbdev device around suspend/hibernate
    
    Looking at hibernate overwriting I though it looked like a cursor,
    so I tracked down this missing piece to stop the cursor blink
    timer. I've no idea if this is sufficient to fix the hibernate
    problems people are seeing, but please test it.
    
    Both radeon and nouveau have done this for a long time.
    
    I've run this personally all night hib/resume cycles with no fails.
    
    Reviewed-by: Keith Packard <keithp@keithp.com>
    Reported-by: Petr Tesarik <kernel@tesarici.cz>
    Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Reported-by: Lots of misc segfaults after hibernate across the world.
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=37142
    Tested-by: Dave Airlie <airlied@redhat.com>
    Tested-by: Bojan Smojver <bojan@rexursive.com>
    Tested-by: Andreas Hartmann <andihartmann@01019freenet.de>
    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Airlie <airlied@redhat.com>
Comment 30 fangxun 2012-04-10 22:20:13 UTC
It works fine. No crash happens. Verified with drm-intel-fixes commit 14667a4bde4361b7ac420d68a2e9e9b9b2df5231.