35648 – [SNB] Kernel may crash after resume from S4 for several times

Bug 35648 - [SNB] Kernel may crash after resume from S4 for several times

Summary: [SNB] Kernel may crash after resume from S4 for several times

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high critical
Assignee:	Chris Wilson
QA Contact:

URL:
Whiteboard:
Keywords:

Duplicates (1):	35846 (view as bug list)
Depends on:
Blocks:	42991 44622
	Show dependency tree / graph

Reported:	2011-03-25 00:44 UTC by fangxun
Modified:	2017-09-04 10:05 UTC (History)
CC List:	6 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg file (123.18 KB, text/plain) 2011-03-25 00:44 UTC, fangxun	no flags	Details
dmesg_s4_from_vt (123.37 KB, text/plain) 2011-03-28 02:33 UTC, fangxun	no flags	Details
setting no_console_suspend (28.00 KB, application/octet-stream) 2011-07-24 22:56 UTC, Ouping Zhang	no flags	Details
freeze workqueue on suspend (1.15 KB, patch) 2012-02-16 10:03 UTC, Eugeni Dodonov	no flags	Details \| Splinter Review
View All

Description fangxun 2011-03-25 00:44:57 UTC

Created attachment 44811 [details]
dmesg file

System Environment:
--------------------------
Arch:           x86_64
Platform:       sugarbay
Libdrm:         (master)2.4.24-7-gfd3ed34a2070fca3804baf54ece40d0bc2666226
Mesa:           (7.10)b8a077cee0f3856d5c3d4468918513515bbd0dcb
Xserver:        (master)xorg-server-1.10.0
Xf86_video_intel: (master)2.14.901-4-gee740778f5d5355c04f6fc4564f598993b106d62
Kernel: (drm-intel-fixes)f0c860246472248a534656d6cdbed5a36d1feb2e


Bug detailed description:
-------------------------
Kernel crashed when doning Suspend-To-Disk(S4) for several times in X mode on sugarbay. This is regression. The last known good commit is kernel(2.6.37)3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5.


Reproduce steps:
----------------
1. xinit&
2. echo disk > /sys/power/state
3. repeat step 2 about 2-20 times

Comment 1 Chris Wilson 2011-03-25 01:27:35 UTC

Can you do some sanity checking with the same kernel but S4 from VT and S4 without i915.ko?

Comment 2 fangxun 2011-03-28 02:33:10 UTC

Created attachment 44931 [details]
dmesg_s4_from_vt

Tested about 50 times with the same Kernel,  it also crashed on S4 from VT.  It didn't happen on S4 without i915.ko.

Comment 3 Chris Wilson 2011-03-29 00:22:42 UTC

Does Zhenyu's SNB resume patches help here? [Would be good to get some testing on those at any rate.]

Comment 4 fangxun 2011-03-29 00:40:55 UTC

Would you like to tell me where can I find Zhenyu's SNB resume patches?

Comment 5 Wang Zhenyu 2011-03-29 00:43:41 UTC

You can directly download patches from http://people.freedesktop.org/~zhen/snb_desk_suspend_0323/

Comment 6 Wang Zhenyu 2011-03-29 00:45:59 UTC

And note that Nanhai is working a workaround that need to be applied for SNB render engine after power cycle. You should ask him to test his patch too.

Comment 7 fangxun 2011-03-29 02:14:49 UTC

Kernel also crashed with Zhenyu's patches.

Comment 8 Michael Fu 2011-03-29 18:41:52 UTC

Xun, would you pls help bi-sect?  sounds like you can easily reproduce the hang in 'several rounds' of S4...

S4 on SugarBay SDV was pretty stable, Rui@ACPI team once tested 1000+ times..

Comment 9 fangxun 2011-03-30 02:46:35 UTC

It seems that is not a regression. I retest it about 60 times with kernel(2.6.37)3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5 and find it also crashes. 
BTW, retest S4 without i915.ko about 100 times and no crash happen.

Comment 10 Chris Wilson 2011-03-30 03:13:40 UTC

Adjusting priority fields to reflect severity and impact. We still need to fix it, it will just take a little longer if it was not due to a recent regression.

Comment 11 Gordon Jin 2011-05-24 19:28:51 UTC

Promoting to P1 for Q2 release consideration.

Xun, how many of our SNB machines have this problem?

Comment 12 fangxun 2011-05-26 02:30:21 UTC

(In reply to comment #11)
> Promoting to P1 for Q2 release consideration.
> Xun, how many of our SNB machines have this problem?

It happens on our two sugarbay machines.
x-sgb1: SugarBay Qual SDP (DH): i7-2600 D2 (id=0x0102, rev 09), H67 B1 (Intel DH67CL rev 03), and Host Bridge id=0x0100 (rev09) 

(x-sgb3: SugarBay desktop: i5-2500K product (id=0x0112, rev 09), H67 B1 SDP (rev 03), and Host Bridge id=0x0100 (rev09)

Comment 13 Jesse Barnes 2011-06-16 11:01:26 UTC

The panic happens in d_move, which makes me thing we're clobbering filesystem state somehow.  Is the backtrace you get consistent?  Does it still happen with 3.0-rc2?

Comment 14 Chris Wilson 2011-07-18 07:48:11 UTC

Highly unlikely to be fixed before release.

Comment 15 Gordon Jin 2011-07-18 17:22:10 UTC

I tend to put it in the P1 list -- even if we can't fix it in this release, we need maintain it as known issue in release notes.

Xun, can you answer Jesse's question (comment#13), by running the latest drm-intel-fixes?

Comment 16 fangxun 2011-07-19 02:31:28 UTC

It still happens with latest drm-intel-fixes kernel(3.0.0rc7).
Backtrace seems to be diffrent from the previous. Below is the Call trace.

Call Trace:
 kernel: [<ffffffff8110d9ce>] ? __sync_filesystem+0x75/0x75
 kernel: [<ffffffff8110d99b>] __sync_filesystem+0x42/0x75
 kernel: [<ffffffff8110d9df>] sync_one_sb+0x11/0x13
 kernel: [<ffffffff810ed4c6>] iterate_supers+0x67/0xb7
 kernel: [<ffffffff8110da21>] sys_sync+0x40/0x57
 kernel: [<ffffffff81070f2c>] hibernate+0x88/0x1b8
 kernel: [<ffffffff8106fa4c>] state_store+0x57/0xce
 kernel: [<ffffffff811df993>] kobj_attr_store+0x17/0x19
 kernel: [<ffffffff811420a2>] sysfs_write_file+0x10c/0x148
 kernel: [<ffffffff810ebd0d>] vfs_write+0xae/0x153
 kernel: [<ffffffff810ebe6b>] sys_write+0x45/0x6c
 kernel: [<ffffffff813bf6fb>] system_call_fastpath+0x16/0x1b
 kernel: Code: 48 c7 c7 40 23 69 81 45 31 ed e8 b3 f6 2a 00 49 8b 9c 24 c0 00 00 00 49 81 c4 c0 00 00 00 48 81 eb 90 00 00 00 eb 7b 4c 8d 73 20 <4c> 8b bb 48 01 00 00 4c 89 f7 e8 88 f6 2a 00 f6 43 28 38 75 07

Comment 17 Ouping Zhang 2011-07-24 22:56:03 UTC

Jesse, after setting no_console_suspend,  I get more info out, please check the attached setting no_console_suspend.log.
(In reply to comment #13)
> The panic happens in d_move, which makes me thing we're clobbering filesystem
> state somehow.  Is the backtrace you get consistent?  Does it still happen with
> 3.0-rc2?

Comment 18 Ouping Zhang 2011-07-24 22:56:46 UTC

Created attachment 49485 [details]
setting no_console_suspend

Comment 19 Jesse Barnes 2011-08-01 11:00:19 UTC

Those backtraces look unrelated to gfx; does the same panic occur even without i915 loaded (you'll need to use netconsole to see it still).

Comment 20 fangxun 2011-08-25 01:05:09 UTC

Panic still occurs with kernel 3.1.0-rc1.  it doesn't occur without i915 loaded.

Comment 21 Eugeni Dodonov 2011-10-10 14:09:23 UTC

Hi,

Could you please check if those issues happen if you disable modesetting (e.g., boot with 'nomodeset' kernel parameter)?

Comment 22 fangxun 2011-10-12 01:51:28 UTC

The issue goes away by using 'nomodeset' kernel parameter. It still happens when modeset is used.

Comment 23 fangxun 2011-11-09 00:52:33 UTC

 It still fails on SandyBride with Kernel 3.1(c3b92c8787367a8bb53d57d9789b558f1295cc96). I don't see this on IvyBridge.

Comment 24 Eugeni Dodonov 2011-11-25 11:01:17 UTC

*** Bug 35846 has been marked as a duplicate of this bug. ***

Comment 25 Eugeni Dodonov 2012-02-16 10:03:11 UTC

Created attachment 57170 [details] [review]
freeze workqueue on suspend

Could you please try with this patch and verify if it changes anything?

Comment 26 libo 2012-02-16 23:52:32 UTC

sorry, the bug still exists on 3.2.4 with that patch. (In reply to comment #25)
> Created attachment 57170 [details] [review] [review]
> freeze workqueue on suspend
> 
> Could you please try with this patch and verify if it changes anything?

Comment 27 libo 2012-02-16 23:53:55 UTC

sorry, the bug still exists on 3.2.4 with that patch. (In reply to comment #25)
> Created attachment 57170 [details] [review] [review]
> freeze workqueue on suspend
> 
> Could you please try with this patch and verify if it changes anything?

Comment 28 Eugeni Dodonov 2012-03-30 09:53:55 UTC

Could you please try with the Dave's patch from https://lkml.org/lkml/2012/3/29/72 (the patch itself is http://cgit.freedesktop.org/~airlied/linux/commit/?h=drm-fixes&id=3fa016a0b5c5237e9c387fc3249592b2cb5391c6)? I am fairly sure it could solve this..

Comment 29 Chris Wilson 2012-03-30 10:08:26 UTC

We believe we finally have the root cause of so many crashes following hibernation. Please update and test, thanks.

commit 3fa016a0b5c5237e9c387fc3249592b2cb5391c6
Author: Dave Airlie <airlied@redhat.com>
Date:   Wed Mar 28 10:48:49 2012 +0100

    drm/i915: suspend fbdev device around suspend/hibernate
    
    Looking at hibernate overwriting I though it looked like a cursor,
    so I tracked down this missing piece to stop the cursor blink
    timer. I've no idea if this is sufficient to fix the hibernate
    problems people are seeing, but please test it.
    
    Both radeon and nouveau have done this for a long time.
    
    I've run this personally all night hib/resume cycles with no fails.
    
    Reviewed-by: Keith Packard <keithp@keithp.com>
    Reported-by: Petr Tesarik <kernel@tesarici.cz>
    Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
    Reported-by: Lots of misc segfaults after hibernate across the world.
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=37142
    Tested-by: Dave Airlie <airlied@redhat.com>
    Tested-by: Bojan Smojver <bojan@rexursive.com>
    Tested-by: Andreas Hartmann <andihartmann@01019freenet.de>
    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Airlie <airlied@redhat.com>

Comment 30 fangxun 2012-04-10 22:20:13 UTC

It works fine. No crash happens. Verified with drm-intel-fixes commit 14667a4bde4361b7ac420d68a2e9e9b9b2df5231.

Comment 31 Jari Tahvanainen 2017-09-04 10:05:27 UTC

Closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.