65496 – [HSW regression] resume from s4 sporadically causes call trace and system hang, with warm boot

Bug 65496 - [HSW regression] resume from s4 sporadically causes call trace and system hang, with warm boot

Summary: [HSW regression] resume from s4 sporadically causes call trace and system han...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	highest critical
Assignee:	Paulo Zanoni
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (2):	66301 78056 (view as bug list)
Depends on:
Blocks:

Reported:	2013-06-07 07:51 UTC by shui yangwei
Modified:	2016-10-19 12:39 UTC (History)
CC List:	7 users (show)

See Also:	82340
i915 platform:
i915 features:

Attachments
picture: probabilistic S4 call trace and hang (1.76 MB, image/jpeg) 2013-06-07 07:51 UTC, shui yangwei	no flags	Details
picture: S4 sporadically cause call trace and system hang(-next-queued kernel) (1.33 MB, image/jpeg) 2013-06-13 07:21 UTC, shui yangwei	no flags	Details
call trace and hang on hsw desktop (2.70 MB, image/jpeg) 2013-08-08 01:33 UTC, cancan,feng	no flags	Details
call trace of S4 at 4th time on HSW ULT (2.02 MB, image/jpeg) 2013-08-08 03:19 UTC, cancan,feng	no flags	Details
ult S4 dmesg by using serial port (182.40 KB, text/plain) 2013-08-08 05:44 UTC, cancan,feng	no flags	Details
netconsole grab information (4.26 KB, text/plain) 2013-10-10 05:23 UTC, shui yangwei	no flags	Details
patch: Ben's patch in outside bugzilla comment 22 (6.22 KB, text/plain) 2013-10-12 01:42 UTC, shui yangwei	no flags	Details
Picture: S4 resume hang and screen out put call trace (1.65 MB, image/jpeg) 2013-10-22 10:05 UTC, shui yangwei	no flags	Details
Fault PDEs too (1.39 KB, patch) 2013-10-26 01:20 UTC, Ben Widawsky	no flags	Details \| Splinter Review
View All

Description shui yangwei 2013-06-07 07:51:01 UTC

Created attachment 80458 [details]
picture: probabilistic S4 call trace and hang

Environment:
-------------------
Kernel: (drm-intel-next-queued)d41ca032afdb4e12a9782df523c8798cd42aaaa3
Some additional commit info:
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Apr 11 19:49:07 2013 +0200

    drm/i915: move debug output back to the right place

Description:
--------------------
This bug is used to separate from bug #63586, run S4 test in a loop, when execute 95 times, system resume with call trace and hang. This issue is probabilistic, perhaps it will come up to your eyes for just a while S4 test or more. I append the picture of the call trace in attachment.

Reproduce step:
----------------------
1.boot up machine with command "reboot"(warm boot)
2.echo 0 > /sys/class/rtc/rtc0/wakealarm ; 
  echo +10 > /sys/class/rtc/rtc0/wakealarm; 
3.echo disk > /sys/power/state
4.loop running step 2 and 3

Comment 1 Daniel Vetter 2013-06-07 08:01:09 UTC

The important part of the oops has scrolled off the screen already :(

Can you please boot with pause_on_oops=60 so that the kernel waits 1 minute once the first oops shows up until it continues? That way you should be able to catch it.

I'll add this to our QA bug filing BKMs.

Comment 2 Gavin Hindman 2013-06-11 16:19:53 UTC

Any system-hang on resume should be considered critical - updgrading

Comment 3 shui yangwei 2013-06-13 02:04:16 UTC

(In reply to comment #1)
> The important part of the oops has scrolled off the screen already :(
> 
> Can you please boot with pause_on_oops=60 so that the kernel waits 1 minute
> once the first oops shows up until it continues? That way you should be able
> to catch it.
> 
> I'll add this to our QA bug filing BKMs.

I added pause_on_oops=60 in "/boot/grub2/grub.cfg", test with 3.9.5 RC2 release kernel, this issue also come up at 13th S4 tests. Machine will call trace and hang.

Comment 4 Yi Sun 2013-06-13 06:12:10 UTC

(In reply to comment #3)
> (In reply to comment #1)
> > The important part of the oops has scrolled off the screen already :(
> > 
> > Can you please boot with pause_on_oops=60 so that the kernel waits 1 minute
> > once the first oops shows up until it continues? That way you should be able
> > to catch it.
> > 
> > I'll add this to our QA bug filing BKMs.
> 
> I added pause_on_oops=60 in "/boot/grub2/grub.cfg", test with 3.9.5 RC2
> release kernel, this issue also come up at 13th S4 tests. Machine will call
> trace and hang.

Yangwei, You may mis-understand what Daniel said. The option just makes system pause while the error happened. Then you can take a picture and attach here, I think.
BTW, I think you just test this issue with dinq branch, but not 3.9.y branch.

Comment 5 shui yangwei 2013-06-13 07:21:19 UTC

Created attachment 80757 [details]
picture: S4 sporadically cause call trace and system hang(-next-queued kernel)

Environment:
--------------------
Kernel: (drm-intel-next-queued)a43adf0747ecde0d211f29adcbebf067f92e9cbb
Some additional commit info:
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 10 11:20:22 2013 +0100

    drm/i915: Eliminate the addr/seqno from the hangcheck warning


Description:
--------------------
I find latest -next-queued kernel is much more easier to reproduce this bug. I tried three times reliability tests, all can reproduced within 4 round S4 testing.

I appended the picture just using the oops command.

Comment 6 Paulo Zanoni 2013-06-14 14:30:37 UTC

Just to make sure: this never happens if i915.ko is blacklisted, right?

Comment 7 shui yangwei 2013-06-19 06:26:22 UTC

(In reply to comment #6)
> Just to make sure: this never happens if i915.ko is blacklisted, right?

Yes, when i915.ko is blacklisted, I loop running S4 100 times, it is 100% pass.

Comment 8 Todd Previte 2013-07-19 18:09:53 UTC

I'll see if I can reproduce this. Will provide more information as available.

Comment 9 Jesse Barnes 2013-07-22 17:35:45 UTC

Make sure you're testing the latest BIOS too; there have been fixes for suspend/resume issues for recent regressions and failures.

Comment 10 cancan,feng 2013-07-24 07:26:13 UTC

(In reply to comment #9)
> Make sure you're testing the latest BIOS too; there have been fixes for
> suspend/resume issues for recent regressions and failures.

Upgrade BIOS to v128.
Call trace appears at the 23rd time while doing a S4 cycle.

Comment 11 Paulo Zanoni 2013-08-06 13:05:13 UTC

(In reply to comment #10)
> (In reply to comment #9)
> > Make sure you're testing the latest BIOS too; there have been fixes for
> > suspend/resume issues for recent regressions and failures.
> 
> Upgrade BIOS to v128.
> Call trace appears at the 23rd time while doing a S4 cycle.

Latest BIOS is 131.3. Does it solve the issue?

Comment 12 Todd Previte 2013-08-06 14:58:46 UTC

Thus far, I have not seen the call trace on my ULT in the testing I have done. I've tried this with kernels built from linux-stable, linux-next and torvald's tree. That said, the machine has failed to fully resume from sleep, although there were no indicators in dmesg or syslog as to the cause of the failure. 

At this point, I'm going to update the BIOS per the recommendation above and retest the same kernels to see if I can either a) get the call trace, b) no longer fails to resume or c) fails in a way that yields something useful. I'll update when I have the information from that testing.

Comment 13 shui yangwei 2013-08-07 01:03:17 UTC

(In reply to comment #12)
> Thus far, I have not seen the call trace on my ULT in the testing I have
> done. I've tried this with kernels built from linux-stable, linux-next and
> torvald's tree. That said, the machine has failed to fully resume from
> sleep, although there were no indicators in dmesg or syslog as to the cause
> of the failure. 
> 
> At this point, I'm going to update the BIOS per the recommendation above and
> retest the same kernels to see if I can either a) get the call trace, b) no
> longer fails to resume or c) fails in a way that yields something useful.
> I'll update when I have the information from that testing.

An advise: please notice loop running S4 with reliability test, almost about 100 times.

Comment 14 cancan,feng 2013-08-08 01:30:53 UTC

(In reply to comment #12)
> Thus far, I have not seen the call trace on my ULT in the testing I have
> done. I've tried this with kernels built from linux-stable, linux-next and
> torvald's tree. That said, the machine has failed to fully resume from
> sleep, although there were no indicators in dmesg or syslog as to the cause
> of the failure. 
> 
> At this point, I'm going to update the BIOS per the recommendation above and
> retest the same kernels to see if I can either a) get the call trace, b) no
> longer fails to resume or c) fails in a way that yields something useful.
> I'll update when I have the information from that testing.

I just use the latest -next-queued kernel on a HSW desktop to test S4, this issue happens at 4th time. I will test ULT later and you can test desktop too. I will upgrade our BIOS version(now is v128) to see if this will be any difference. 
I take a picture of this calltrace. :)

Comment 15 cancan,feng 2013-08-08 01:33:58 UTC

Created attachment 83802 [details]
call trace and hang on hsw desktop

Comment 16 cancan,feng 2013-08-08 03:19:52 UTC

Created attachment 83804 [details]
call trace of S4 at 4th time on HSW ULT

Comment 17 cancan,feng 2013-08-08 03:22:15 UTC

I did S4 cycle on HSW ULT using latest -next-queued kernel, and call trace and hang happened at 4th time. I attached the picture as hsw_ULT_S4.jpg.

HSW ULT BIOS version: 131.1

Comment 18 cancan,feng 2013-08-08 05:39:38 UTC

I did S4 cycle again. This time I grabbed the dmesg info by serial port that will be useful for you. :)

Comment 19 cancan,feng 2013-08-08 05:40:46 UTC

(In reply to comment #18)
> I did S4 cycle again. This time I grabbed the dmesg info by serial port that
> will be useful for you. :)

Forget to say, on HSW ULT.

Comment 20 cancan,feng 2013-08-08 05:44:37 UTC

Created attachment 83806 [details]
ult S4 dmesg by using serial port

Comment 21 Daniel Vetter 2013-08-08 06:40:38 UTC

Hm, the ULT backtraces might be something else since its just the NMI handler which takes forever to run. But maybe that's just because the machine is dead already and the useful backtraces scrolled off the screen already ...

Comment 22 cancan,feng 2013-08-08 09:10:32 UTC

(In reply to comment #21)
> Hm, the ULT backtraces might be something else since its just the NMI
> handler which takes forever to run. But maybe that's just because the
> machine is dead already and the useful backtraces scrolled off the screen
> already ...

We upgraded HSW desktop's BIOS to newest version 131.3, it results in no system output and boot failure. The machine can't start up even if using the original BIOS..This situation happened on our two hsw desktop, I'm afraid I won't upgrade a third desktop temporarily..

Comment 23 Todd Previte 2013-08-13 14:52:38 UTC

I updated the BIOS to 131.3 and did not experience any of the boot or system failure issues reported. After the BIOS update, I performed multiple test sequences of varying counts of suspend/resume cycles, none of which resulted in a call trace or system hang. I also did not see any failures to resume as I previously sighted with earlier BIOS revisions.

I'm going to run 150-cycle test today and see if I can get something to happen.

Comment 24 Todd Previte 2013-08-16 19:25:32 UTC

Finally, with a fresh clone of the linux-next kernel and BIOS 131.3, I'm seeing some call traces on resume in less than 10 suspend/resume cycles. The call traces are different than those posted here, so I need to do more investigation on what's going on. I'll post clean log captures when available for comparison.

Comment 25 Rodrigo Vivi 2013-08-19 16:55:30 UTC

Todd, what is the VBIOS version on this 131.3 bios you are using?

Thanks

Comment 26 Todd Previte 2013-08-19 18:04:11 UTC

VBIOS version is 2173

Using Daniel's top of tree, I'm also able to reproduce the problem in short order. Multiple call traces in the logs after 11 or 12 runs.

Comment 27 Rodrigo Vivi 2013-08-20 15:44:51 UTC

Thanks Todd.

for the log: I asked because SuSE are still facing S4 errors on machine with VBIOS 2175. AFAIK they workaround the issue on their image, but not a real fix yet.

Comment 28 Todd Previte 2013-08-26 20:05:57 UTC

This is appearing to be a memory corruption issue, based on the fact that the call traces I'm seeing are different each time it happens. I've enabled some of the built-in memory check facilities in the kernel and built kernels from drm-intel-nightly and drm-intel-fixes. None of the kernels built from these two branches successfully boot the machine. Investigation continues.

Comment 29 Paulo Zanoni 2013-09-03 21:01:43 UTC

I've been doing some investigation and although I see backtraces mostly for file-system operations, they always seem to happen when we're doing gmbus/dp-aux. I also get "general protection fault: 0000" messages.

These errors don't crash my machine.

Comment 30 Paulo Zanoni 2013-09-04 22:08:16 UTC

(In reply to comment #29)
> I've been doing some investigation and although I see backtraces mostly for
> file-system operations, they always seem to happen when we're doing
> gmbus/dp-aux. I also get "general protection fault: 0000" messages.
> 
> These errors don't crash my machine.

I spent even more time debugging this today, and again: most of the error messages I see come when we're doing gmbus/dp-aux.

Example:

[  206.436517] [drm:gmbus_xfer], GMBUS [i915 gmbus dpb] NAK for addr: 0050 r(1)                                       
[  206.496851] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpb                         
[  206.529782] ------------[ cut here ]------------                                                                   
[  206.529787] WARNING: CPU: 3 PID: 24 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0()                            
[  206.529788] list_del corruption. prev->next should be ffff880142da7010, but was 00d5000000d40000                   
[  206.529796] Modules linked in: parport_pc bnep ppdev rfcomm bluetooth lp parport ext2 dm_crypt e1000e ptp i915 i2c_
algo_bit drm_kms_helper pps_core drm video                                                                            
[  206.529798] CPU: 3 PID: 24 Comm: ksoftirqd/3 Not tainted 3.11.0-rc7.1309041757+ #2159                              
[  206.529799] Hardware name: Intel Corporation Shark Bay Client platform/WhiteTip Mountain 1, BIOS HSWLPTU1.86C.0124.
R02.1305030131 05/03/2013                                                                                             
[  206.529801]  0000000000000009 ffff880147001bc0 ffffffff816195dd ffff880147001c08                                   
[  206.529803]  ffff880147001bf8 ffffffff81041c92 ffff880142da6fd0 0000000000000286                                   
[  206.529804]  ffffea0004db7200 ffffffff811699e6 0000000000000000 ffff880147001c58                                   
[  206.529805] Call Trace:                                                                                            
[  206.529807]  [<ffffffff816195dd>] dump_stack+0x54/0x74                                                             
[  206.529810]  [<ffffffff81041c92>] warn_slowpath_common+0x82/0xb0                                                   
[  206.529812]  [<ffffffff811699e6>] ? __d_free+0x46/0x70                                                             
[  206.529814]  [<ffffffff81041d77>] warn_slowpath_fmt+0x47/0x50                                                      
[  206.529815]  [<ffffffff811699e6>] ? __d_free+0x46/0x70                                                             
[  206.529817]  [<ffffffff812b7801>] __list_del_entry+0xa1/0xd0                                                       
[  206.529819]  [<ffffffff8114eb82>] __delete_object+0x32/0xb0                                                        
[  206.529821]  [<ffffffff8114f46c>] delete_object_full+0x1c/0x30                                                     
[  206.529824]  [<ffffffff8160c4c1>] kmemleak_free+0x21/0x50                                                          
[  206.529827]  [<ffffffff81141190>] kmem_cache_free+0x140/0x1a0                                                      
[  206.529828]  [<ffffffff811699e6>] __d_free+0x46/0x70                                                               
[  206.529831]  [<ffffffff810e1b8a>] rcu_process_callbacks+0x1ea/0x5a0                                                
[  206.529834]  [<ffffffff8104727a>] __do_softirq+0xda/0x1b0                                                          
[  206.529836]  [<ffffffff8104737d>] run_ksoftirqd+0x2d/0x60                                                          
[  206.529839]  [<ffffffff81070ff6>] smpboot_thread_fn+0x156/0x1f0                                                    
[  206.529840]  [<ffffffff81070ea0>] ? lg_global_unlock+0xb0/0xb0                                                     
[  206.529843]  [<ffffffff810687d5>] kthread+0xe5/0xf0                                                                
[  206.529845]  [<ffffffff810686f0>] ? kthread_create_on_node+0x140/0x140                                             
[  206.529847]  [<ffffffff8162936c>] ret_from_fork+0x7c/0xb0                                                          
[  206.529849]  [<ffffffff810686f0>] ? kthread_create_on_node+0x140/0x140                                             
[  206.529850] ---[ end trace 86de9cb15e206270 ]---                                                                   
[  208.249590] [drm:drm_helper_probe_single_connector_modes], [CONNECTOR:20:HDMI-A-1] disconnected                    
[  208.323782] [drm:drm_mode_getconnector], [CONNECTOR:22:?]                                                          
[  208.369864] [drm:drm_helper_probe_single_connector_modes], [CONNECTOR:22:DP-2]

Comment 31 Paulo Zanoni 2013-09-06 20:53:22 UTC

I really think our best bet is to try to bisect the bug.

Does it happen with 3.11.0?
Does it happen with 3.10.0?
Does it happen with 3.10.10?

We should probably try to find some version that works and then bisect from there.

Thanks,
Paulo

Comment 32 Todd Previte 2013-09-06 21:02:44 UTC

Agreed. This was reported 6/7, so I'm going to start with a kernel from that era to see if I can find where this began occurring.

-T

Comment 33 shui yangwei 2013-09-09 01:21:07 UTC

(In reply to comment #32)
> Agreed. This was reported 6/7, so I'm going to start with a kernel from that
> era to see if I can find where this began occurring.
> 
> -T

The exactly report time was 4/16, and we found this issue exists on much more earlier kernels than that era. You can have a look at the original Bug #63586. I have mentioned when reported.

Comment 34 Paulo Zanoni 2013-10-01 19:49:14 UTC

Hi

I did some tests, and it seems that if I disable fbcon, vgacon and their friends I can't reproduce the problem. Can you please confirm that?

Also, my tests show that the problem happens even if we don't start X. Can you also confirm that?

In the meantime, I'll keep testing.

Thanks,
Paulo

Comment 35 Daniel Vetter 2013-10-01 19:52:53 UTC

*** Bug 66301 has been marked as a duplicate of this bug. ***

Comment 36 Paulo Zanoni 2013-10-07 19:14:35 UTC

Hi

I did some more investigation and I discovered the following:

- It seems that, after resuming, if you run "slabinfo -v" (from tools/vm/), there's a good chance you'll see dmesg messages saying we detected corruption on our slabs. It seems to me that it is much much easier to reproduce the bug with "hibernate, resume, run slabinfo -v, check dmesg, hibernate, resume, etc" than with just "hibernate, resume". Can you confirm that?

- It also seems that the bug goes away if the kernel that resumes the machine doesn't load i915.ko. So an experiment you can try is: boot the machine normally, with i915.ko loaded, tell it to hibernate. Then make the machine wake-up, and use the "modprobe.blacklist=i915" option when loading the kernel that will resume the machine. After it resumes, check if the bug is there (possibly with slabinfo -v). The bug should be gone. Can you please confirm that?

Thanks,
Paulo

Comment 37 Paulo Zanoni 2013-10-08 14:05:44 UTC

Hi

Can you please try the patches from comments 22 and 23 form bug https://bugzilla.kernel.org/show_bug.cgi?id=59321 ?

Thanks,
Paulo

Comment 38 shui yangwei 2013-10-09 02:37:10 UTC

(In reply to comment #37)
> Hi
> 
> Can you please try the patches from comments 22 and 23 form bug
> https://bugzilla.kernel.org/show_bug.cgi?id=59321 ?
> 
> Thanks,
> Paulo

Apply these two patches on latest -next-queued, it comes to call trace and hang at the first round S4 testing.

Comment 39 shui yangwei 2013-10-09 03:29:26 UTC

(In reply to comment #37)
> Hi
> 
> Can you please try the patches from comments 22 and 23 form bug
> https://bugzilla.kernel.org/show_bug.cgi?id=59321 ?
> 
> Thanks,
> Paulo

Addition: 
------------------
No matter patches from comment 22 only or with 23, I find my HSW Desktop failed to suspend from S4, I saw indicator light output is 0004, and the fan isn't stop. I also tried the latest -next-queued without patches, it can resume but with call trace and hang at first round.

Comment 40 Paulo Zanoni 2013-10-09 12:50:37 UTC

CCing Ben since he wrote the patches.

Comment 41 Ben Widawsky 2013-10-09 18:10:26 UTC

(In reply to comment #39)
> (In reply to comment #37)
> > Hi
> > 
> > Can you please try the patches from comments 22 and 23 form bug
> > https://bugzilla.kernel.org/show_bug.cgi?id=59321 ?
> > 
> > Thanks,
> > Paulo
> 
> Addition: 
> ------------------
> No matter patches from comment 22 only or with 23, I find my HSW Desktop
> failed to suspend from S4, I saw indicator light output is 0004, and the fan
> isn't stop. I also tried the latest -next-queued without patches, it can
> resume but with call trace and hang at first round.

Can you please push the branch you tested somewhere so I can confirm the patches are indeed correct.

Also, can you collect the error state?

Comment 42 shui yangwei 2013-10-10 05:23:02 UTC

Created attachment 87366 [details]
netconsole  grab information

(In reply to comment #41)
> (In reply to comment #39)
> > (In reply to comment #37)
> > > Hi
> > > 
> > > Can you please try the patches from comments 22 and 23 form bug
> > > https://bugzilla.kernel.org/show_bug.cgi?id=59321 ?
> > > 
> > > Thanks,
> > > Paulo
> > 
> > Addition: 
> > ------------------
> > No matter patches from comment 22 only or with 23, I find my HSW Desktop
> > failed to suspend from S4, I saw indicator light output is 0004, and the fan
> > isn't stop. I also tried the latest -next-queued without patches, it can
> > resume but with call trace and hang at first round.
> 
> Can you please push the branch you tested somewhere so I can confirm the
> patches are indeed correct.
> 

Oh, all my tests are based on -next-queued latest and the patches also applied on it.

The commit be used yesterday:
--------------------
commit a94b013b91de055572183c6772865123fa955027
Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
Date:   Thu Sep 19 17:03:06 2013 -0300

    drm/i915: wait for IPS_ENABLE when enabling IPS

    At the end of haswell_crtc_enable we have an intel_wait_for_vblank
    with a big comment, and the message suggests it's a workaround for
    something we don't really understand. So I removed that wait and
    started getting HW state readout error messages saying that the IPS
    state is not what we expected.



> Also, can you collect the error state?

OK, get the errors through netconsole. I find there's call trace. You could find the messages from the attachment.

[   65.311787] [ BUG: systemd-udevd/2856 still has locks held! ]
[   65.311810] 3.12.0-rc3_drm-intel-next-queued_a94b01_20131009+ #1 Not tainted
[   65.311836] -------------------------------------
[   65.311854] 2 locks held by systemd-udevd/2856:
[   65.311872] Freezing user space processes ...
[   65.311872]  #0:  (microcode_mutex){+.+.+.}, at: [<ffffffffa039b0a7>] microcode_init+0xa7/0x1b4 [microcode]
[   65.311945]  #1:  (subsys mutex#4){+.+.+.}, at: [<ffffffff81415ee2>] subsys_interface_register+0x51/0xd9
[   65.311992]
[   65.311992] stack backtrace:
[   65.312010] CPU: 1 PID: 2856 Comm: systemd-udevd Not tainted 3.12.0-rc3_drm-intel-next-queued_a94b01_20131009+ #1
[   65.312048] Hardware name: Intel Corporation Shark Bay Client platform/SthiPpvRsvd2, BIOS HSWLPTU1.86C.0120.R00.1303312001 03/31/2013
[   65.312092]  ffff880438dbdee0 ffff88003731da78 ffffffff817f313c 0000000000000006
[   65.312126]  ffff880438dbdee0 ffff88003731da98 ffffffff8108df33 0000000000000004
[   65.312160]  0000000000000000 ffff88003731db08 ffffffff8104df35 ffff88003731dad8
[   65.312194] Call Trace:
[   65.312208]  [<ffffffff817f313c>] dump_stack+0x46/0x58
[   65.312230]  [<ffffffff8108df33>] debug_check_no_locks_held+0x8f/0x93
[   65.312255]  [<ffffffff8104df35>] usermodehelper_read_trylock+0xa9/0xfa
[   65.312282]  [<ffffffff810581e5>] ? __init_waitqueue_head+0x50/0x50
[   65.312308]  [<ffffffff814211bc>] _request_firmware+0x285/0x880
[   65.312331]  [<ffffffff81421847>] request_firmware+0x38/0x4c

Comment 43 Ben Widawsky 2013-10-10 19:58:15 UTC

(In reply to comment #42)
> Created attachment 87366 [details]
> netconsole  grab information
> 
> (In reply to comment #41)
> > (In reply to comment #39)
> > > (In reply to comment #37)
> > > > Hi
> > > > 
> > > > Can you please try the patches from comments 22 and 23 form bug
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=59321 ?
> > > > 
> > > > Thanks,
> > > > Paulo
> > > 
> > > Addition: 
> > > ------------------
> > > No matter patches from comment 22 only or with 23, I find my HSW Desktop
> > > failed to suspend from S4, I saw indicator light output is 0004, and the fan
> > > isn't stop. I also tried the latest -next-queued without patches, it can
> > > resume but with call trace and hang at first round.
> > 
> > Can you please push the branch you tested somewhere so I can confirm the
> > patches are indeed correct.
> > 
> 
> Oh, all my tests are based on -next-queued latest and the patches also
> applied on it.
> 
> The commit be used yesterday:
> --------------------
> commit a94b013b91de055572183c6772865123fa955027
> Author: Paulo Zanoni <paulo.r.zanoni@intel.com>
> Date:   Thu Sep 19 17:03:06 2013 -0300
> 
>     drm/i915: wait for IPS_ENABLE when enabling IPS
> 
>     At the end of haswell_crtc_enable we have an intel_wait_for_vblank
>     with a big comment, and the message suggests it's a workaround for
>     something we don't really understand. So I removed that wait and
>     started getting HW state readout error messages saying that the IPS
>     state is not what we expected.
> 
> 
> 

Can you please get me a link to the code which you've tested so I can make sure it was applied correctly. This patch fixes the issue for others, so it's surprising it doesn't fix it for you.

Comment 44 shui yangwei 2013-10-11 01:09:53 UTC

[remote "origin"]
        fetch = +refs/heads/*:refs/remotes/origin/*
        url = git://people.freedesktop.org/~danvet/drm-intel
[branch "drm-intel-next-queued"]
        remote = origin
        merge = refs/heads/drm-intel-next-queued

Comment 45 Ben Widawsky 2013-10-11 17:26:44 UTC

(In reply to comment #44)
> [remote "origin"]
>         fetch = +refs/heads/*:refs/remotes/origin/*
>         url = git://people.freedesktop.org/~danvet/drm-intel
> [branch "drm-intel-next-queued"]
>         remote = origin
>         merge = refs/heads/drm-intel-next-queued

I want to see the patched code. Can you please push it somewhere so I can see it?

Comment 46 shui yangwei 2013-10-12 01:42:59 UTC

Created attachment 87496 [details]
patch: Ben's patch in outside bugzilla comment 22

(In reply to comment #45)
> (In reply to comment #44)
> > [remote "origin"]
> >         fetch = +refs/heads/*:refs/remotes/origin/*
> >         url = git://people.freedesktop.org/~danvet/drm-intel
> > [branch "drm-intel-next-queued"]
> >         remote = origin
> >         merge = refs/heads/drm-intel-next-queued
> 
> I want to see the patched code. Can you please push it somewhere so I can
> see it?

A little puzzled here, I just applied the patch you gave on Daniel's tree(current -next-queued commit). Your patch which been used is in attachment, and I will appreciate your detail description. Quite sorry. :)

Comment 47 Guang Yang 2013-10-13 13:58:33 UTC

(In reply to comment #45)
> (In reply to comment #44)
> > [remote "origin"]
> >         fetch = +refs/heads/*:refs/remotes/origin/*
> >         url = git://people.freedesktop.org/~danvet/drm-intel
> > [branch "drm-intel-next-queued"]
> >         remote = origin
> >         merge = refs/heads/drm-intel-next-queued
> 
> I want to see the patched code. Can you please push it somewhere so I can
> see it?
Ben, on https://bugzilla.kernel.org/show_bug.cgi?id=59321 's comment 22 and 23, these two patches form you attached:
https://bugzilla.kernel.org/attachment.cgi?id=110401 and https://bugzilla.kernel.org/attachment.cgi?id=110411

Comment 48 Guang Yang 2013-10-14 09:31:50 UTC

(In reply to comment #45)
> (In reply to comment #44)
> > [remote "origin"]
> >         fetch = +refs/heads/*:refs/remotes/origin/*drm-intel-next-queued
> >         url = git://people.freedesktop.org/~danvet/drm-intel
> > [branch "drm-intel-next-queued"]
> >         remote = origin
> >         merge = refs/heads/drm-intel-next-queued
> 
> I want to see the patched code. Can you please push it somewhere so I can
> see it?
You can visit the link:http://tinderbox.sh.intel.com/drivers to access the drivers dictory of patched kernel source . We just apply your two patches on the drm-intel-next-queued (540b5d02766863c561afe9f9d56ce0425022a731 ) .  with these patches the issue still occurs.

Comment 49 Ben Widawsky 2013-10-21 22:12:14 UTC

Please retest on latest nightly.

Comment 50 shui yangwei 2013-10-22 10:05:10 UTC

Created attachment 87976 [details]
Picture: S4 resume hang and screen out put  call trace

(In reply to comment #49)
> Please retest on latest nightly.

This issue also exists on one HSW desktop, it's a unique problem on this desktop platform, S4 100% resume hang. Another HSW desktop can resume correctly, Reliability tests are on the way. Other platform like ULT and mobile is not the same. ULT is good, mobile S4 reboot.


bad desktop machine:
----------------
00:02.0 VGA compatible controller [0300]: Intel Corporation Haswell Integrated Graphics Controller [8086:0412] (rev 06)
CPU i5-4570 3.2GHz, GT2 1150MHz; BIOS version: V120;


good desktop machine:
----------------
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:0412] (rev 06)
CPU i5-4670T 2.3GHz, GT2 1200MHz; BIOS version: V131.3(FC_Production_Q87_5MB_NXP_BIOS-131.3_ME-9.0.20.1447.v2)

Comment 51 Ben Widawsky 2013-10-26 01:20:39 UTC

Created attachment 88135 [details] [review]
Fault PDEs too

Please test this patch on the latest nightly.

Comment 52 shui yangwei 2013-10-28 02:10:45 UTC

(In reply to comment #51)
> Created attachment 88135 [details] [review] [review]
> Fault PDEs too
> 
> Please test this patch on the latest nightly.

Test this patch on latest -nightly, machine will hang during suspend part, After execute S4, I find the indicator turn to "00FF" and machine is not suspend,I can ssh on the machine,then about 30sec later, the indicator turn to "0004" and screen turn to black, machine will be unconnected. 


Latest -nightly kernel also exists the same issue in comment 50.

Netconsole grasp:
---------------------
[  202.605258] netpoll: netconsole: local IP 10.239.47.103
[  202.605340] console [netcon0] enabled
[  202.605355] netconsole: network logging started
[  219.684843] PM: Syncing filesystems ... done.
[  245.422619] Freezing user space processes ... [  245.422763] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
(elapsed 0.001 seconds) done.
[  245.424184] PM: Marking nosave pages: [mem 0x0009a000-0x000fffff]
[  245.424212] PM: Marking nosave pages: [mem 0x95963000-0x95963fff]
[  245.424235] PM: Marking nosave pages: [mem 0x9f904000-0x9f907fff]
[  245.424258] PM: Marking nosave pages: [mem 0x9fa9d000-0x9fa9efff]
[  245.424281] PM: Marking nosave pages: [mem 0x9fcbb000-0x9fcbbfff]
[  245.424305] PM: Marking nosave pages: [mem 0x9fcbe000-0x9fcbefff]
[  245.424327] PM: Marking nosave pages: [mem 0x9fcc5000-0x9fcc5fff]
[  245.424350] PM: Marking nosave pages: [mem 0x9fd17000-0x9fd19fff]
[  245.424373] PM: Marking nosave pages: [mem 0xa2807000-0xa2fe9fff]
[  245.424416] PM: Marking nosave pages: [mem 0xa3000000-0xffffffff]
[  245.425157] PM: Basic memory bitmaps created
[  245.425544] PM: Preallocating image memory... done (allocated 162884 pages)
[  245.700877] PM: Allocated 651536 kbytes in 0.27 seconds (2413.09 MB/s)
[  245.700903] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  245.702972] Suspending console(s) (use no_console_suspend to debug)

Comment 53 Ben Widawsky 2013-10-28 05:48:20 UTC

Could you clarify. Are you saying the behavior with this patch, and without (on nightly) is identical?

Comment 54 shui yangwei 2013-10-28 06:03:54 UTC

(In reply to comment #53)
> Could you clarify. Are you saying the behavior with this patch, and without
> (on nightly) is identical?

Sorry, I haven't described clearly.


with patch:
----------------
After execute S4, I find the indicator turn to "00FF" and machine is not suspend,I can ssh on the machine,then about 30sec later, the indicator turn to "0004" and screen turn to black, machine will be unconnected. machine hangs there.


without patch:
----------------
Machine can resume, but there's call trace and hang. Just like comment 50 described.

Comment 55 Ben Widawsky 2013-10-29 03:12:33 UTC

(In reply to comment #54)
> (In reply to comment #53)
> > Could you clarify. Are you saying the behavior with this patch, and without
> > (on nightly) is identical?
> 
> Sorry, I haven't described clearly.
> 
> 
> with patch:
> ----------------
> After execute S4, I find the indicator turn to "00FF" and machine is not
> suspend,I can ssh on the machine,then about 30sec later, the indicator turn
> to "0004" and screen turn to black, machine will be unconnected. machine
> hangs there.
> 
> 
> without patch:
> ----------------
> Machine can resume, but there's call trace and hang. Just like comment 50
> described.

Does anything appear in dmesg during the 30 second period? Also, can you try to get an exact time (instead of "about 30sec"), and see if it's repeatably the same time every test? This may give some clues.

Thanks.

Comment 56 shui yangwei 2013-10-29 05:35:19 UTC

BIOS version updated from "HSWLPTU1.86C.0120.R00.1303312001" to "HSWPTU1.86C.0134.R00.1310022130". On latest nightly, S4 can resume well on the problematic HSW Desktop now. Strength test is on the way. I will test the kernel without patch at first, and update the status later.

Comment 57 shui yangwei 2013-10-29 08:09:30 UTC

Loop running S4 for 100 times, all passed. Maybe this issue really been fixed, or do you think we need more reliability tests for a period, if S4 worked stably, then we will comment you to close this bug.

Comment 58 Ben Widawsky 2013-10-29 15:47:15 UTC

Papered over.

Comment 59 Paulo Zanoni 2014-04-29 13:10:23 UTC

Reopening, since we applied "[PATCH] drm/i915: Undo gtt scratch pte unmapping again".

Comment 60 Paulo Zanoni 2014-04-29 13:11:03 UTC

*** Bug 78056 has been marked as a duplicate of this bug. ***

Comment 61 Guang Yang 2014-05-17 01:20:50 UTC

Paulo, any update on this issue?

Comment 62 Daniel Vetter 2014-05-19 09:01:05 UTC

This is the hsw pte sanitizing thing which regressed on earlier platforms. My proposal is to remap to the stolen range (if we can) on all platforms.

Comment 63 Guang Yang 2014-05-20 06:15:18 UTC

need a new round of retesting, hua&jinxian, please give feedback.

Comment 64 liulei 2014-06-12 05:33:47 UTC

resume from s4, system hang still can sporadically occur . I didn't find Call Trace like this bug report description.

Comment 65 liulei 2014-06-12 05:36:25 UTC

I test base on drm-intel-testing branch.

Comment 66 Daniel Vetter 2014-06-18 10:08:23 UTC

This is a regression from a regression revert, so upgrading.

Comment 67 liulei 2014-07-22 02:39:31 UTC

This issue still exits on latest -nightly.

Comment 68 liulei 2014-08-19 07:53:40 UTC

Update. S4 still causes system hang, but I didn't find Call Trace. Just like Bug 82340 descripts.

Comment 69 liulei 2014-09-16 08:32:47 UTC

Update, This issue exists on release kernel(62de88e8e65811010deac5375f8f0d8b14dc4d94).

Comment 70 Imre Deak 2014-09-16 08:58:11 UTC

The feedback I'm waiting for in bug 82340 could clarify things here, so setting that as a dependency (also based on comment 68).

Comment 71 Gordon Jin 2014-09-23 08:36:01 UTC

Lei, please check with Ming to see if https://bugs.freedesktop.org/show_bug.cgi?id=82340#c10 (system hang gone) applies here.

Comment 72 yaoming 2014-09-25 03:02:46 UTC

(In reply to comment #71)
> Lei, please check with Ming to see if
> https://bugs.freedesktop.org/show_bug.cgi?id=82340#c10 (system hang gone)
> applies here.
 
Actually I used the same machine test this two bugs. No call trace found since bug 82340 appears, so I think bug 65496 can not be reproduced.

Comment 73 Gordon Jin 2014-09-25 07:57:41 UTC

closing this one. Moving discussion to bug 82340.

Comment 74 Jari Tahvanainen 2016-10-19 12:39:55 UTC

Closing verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.