Bug 63586

Summary: [HSW ULT desktop mobile bisected] Resume from s4 causes call trace and system hang, with cold boot
Product: DRI Reporter: cancan,feng <cancan.feng>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: cancan.feng, przanoni, tiwai, yangweix.shui
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg by netconsole
none
Call Trace info
none
S4 resume with call trace and system hang
none
information of coolboot & S4
none
coolboot & S4 info of 88adff
none
coolboot & S4 screenshot of 88adff none

Description cancan,feng 2013-04-16 06:08:05 UTC
Created attachment 78055 [details]
dmesg by netconsole

Environment:
-------------------
Kernel: (drm-intel-next-queued)d41ca032afdb4e12a9782df523c8798cd42aaaa3
Some additional commit info:
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Apr 11 19:49:07 2013 +0200

    drm/i915: move debug output back to the right place

Description:
--------------------
When I do s4 on Haswell desktop, I found the machine can suspend successfully, but resume with call trace and system hang. I attached the dmesg info in the attachment. I couldn't fetch the call trace info, so I took a picture and attached it in the attachment.
If I remove i915, then do s4, the machine can suspend successfully, but just auto resume from s4 without call trace or system hang. By the way, problem of auto resume always exist.

Reproduce step:
----------------------
1.boot or reboot
2.echo disk > /sys/power/state
3.resume
Comment 1 cancan,feng 2013-04-16 06:16:28 UTC
Created attachment 78056 [details]
Call Trace info
Comment 2 Chris Wilson 2013-04-16 07:38:48 UTC
Doesn't look like a i915.ko bug. Can you reproduce without loading i915.ko?
Comment 3 cancan,feng 2013-04-17 07:36:30 UTC
(In reply to comment #2)
> Doesn't look like a i915.ko bug. Can you reproduce without loading i915.ko?

You are right!But this is not a stable issue, I can't reproduce it every times no matter loading i915.ko or not...
Comment 4 Daniel Vetter 2013-04-17 15:52:20 UTC
Is the machine otherwise solid, i.e. have you run a memory tester on it? This smells a bit fishy ...
Comment 5 Daniel Vetter 2013-04-17 16:21:55 UTC
Also, can you try to bisect this? You probably need to do a few suspend/resume cycles to make sure any given kernel really works ...
Comment 6 cancan,feng 2013-04-18 00:46:28 UTC
(In reply to comment #4)
> Is the machine otherwise solid, i.e. have you run a memory tester on it?
> This smells a bit fishy ...

Memory tester do you mean suspend to memory or something else? If so, the machine otherwise seems solid...
Comment 7 cancan,feng 2013-04-18 00:58:32 UTC
(In reply to comment #5)
> Also, can you try to bisect this? You probably need to do a few
> suspend/resume cycles to make sure any given kernel really works ...

I do ten times suspend/resume cycles with a latest kernel, 5 times resume is ok, another 5 times cause call trace and system hang..
I also try to bisect this, but as you know, this issue is not so stable, so I can't find the first bad commit..

By the way, this issue happens no matter whether i915.ko loaded or not, i.e. if remove i915.ko, sometimes it happens, sometimes it doesn't, so maybe this is not our bug?
Comment 8 Daniel Vetter 2013-04-18 09:55:06 UTC
If you blacklist i915.ko and it still happens it's certainly not our bug, but we still need to figure out what's wrong to unblock testing.

Bisecting should still be possible: As soon as you hit the backtrace you know that a given kernel version is broken. To make sure a kernel is really good just run the suspend/resume test in a loop for e.g. 100 repeats or so. Of course you first need to make sure that you have a solid baseline.

It'll take some time, but if you script the suspend loop it shouldn't take more work than a normal bisect.
Comment 9 shui yangwei 2013-04-19 05:55:21 UTC
Created attachment 78208 [details]
S4 resume with call trace and system hang

I run S4 test in a loop as you said, when execute 95 times, system resume with call trace and hang. Another finding is, if our kernel's initrd is buildin, this issue will be easy to reproduce. I append the picture of the call trace in attachment.
Comment 10 Daniel Vetter 2013-04-19 07:09:40 UTC
So just to check: This failure after 95 resume cycles was on a kernel previously considered good? I.e. do we need to assume that this bug has always been there and this is not a regression, just some timing changes made it more likely?
Comment 11 shui yangwei 2013-04-22 02:51:03 UTC
(In reply to comment #10)
> So just to check: This failure after 95 resume cycles was on a kernel
> previously considered good? I.e. do we need to assume that this bug has
> always been there and this is not a regression, just some timing changes
> made it more likely?

No, it's not a regression, because it need to run too many times S4 to reproduce this issue, perhaps we haven't tested enough before. Now we use kexec to instead of rpm package to install kernel(kexec kernel will build in initrd), which will be easy to reproduce the issue.
Comment 12 cancan,feng 2013-04-22 08:22:10 UTC
Hi, Daniel, I have a new clue that this issue seems to be related to how to start the machine, i.e. if I reboot the machine with power button(sometimes I have to do so because of machine hang), then I do S4, the machine will hang with call trace. But if I reboot the machine by command "reboot" instead of power button, then do S4, the machine can resume successfully! I did try many times, every time the same. 
In addition, I tested 3.8 release kernel, and this issue exists there too. I also tested without loading i915.ko, this issue still happens.
To sum up, so long as the machine is restarted with power button, call trace and hang! So.. what do you think I should do next?
Comment 13 cancan,feng 2013-04-26 06:32:39 UTC
(In reply to comment #10)
> So just to check: This failure after 95 resume cycles was on a kernel
> previously considered good? I.e. do we need to assume that this bug has
> always been there and this is not a regression, just some timing changes
> made it more likely?

Hi, I have double checked this issue, and found it is a regression, and I have bisected it. Here is bisect information:

88adfff1ad5019f65b9d0b4e1a4ac900fb065183 is the first bad commit
commit 88adfff1ad5019f65b9d0b4e1a4ac900fb065183
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Mar 28 10:42:01 2013 +0100

    drm/i915: hw readout support for ->has_pch_encoders

    Now we can ditch the checks in the Haswell disable code.

    v2: add support for Haswell

    Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 3680d704a4f13869e0e5094699a9c96dcf7dc45b 78d850e8e904670ad377fc92b75993dfec59933a M      drivers
Comment 14 Daniel Vetter 2013-04-26 06:53:43 UTC
It's strange that the first jpeg of a calltrace is completely different from the 2nd one. Is it always different like that?

Also, the 2nd calltrace is cut off at the beginning. Can you try to capture the top, too? There's a special kernel option to slow down kernel debug output to the console, see the boot_delay option in Documentation/kernel-parameters.txt (I hope it also works in general).
Comment 15 cancan,feng 2013-04-26 08:09:58 UTC
(In reply to comment #14)
> It's strange that the first jpeg of a calltrace is completely different from
> the 2nd one. Is it always different like that?
> 
> Also, the 2nd calltrace is cut off at the beginning. Can you try to capture
> the top, too? There's a special kernel option to slow down kernel debug
> output to the console, see the boot_delay option in
> Documentation/kernel-parameters.txt (I hope it also works in general).

I'm sorry I didn't make it clear. The first jpeg and 2nd jpeg are from two machines of HSW desktop, the issue have different symptoms on them, I think we could focus on the first jpeg of "Call Trace info" instead of "S4 resume with call trace and system hang", because the symptom on this machine is a little easier to reproduce. Bisect info is from this machine too.
I think the issue is a little complicated, and I'm not sure how to define it. At first, I thought system hang after resume from S4, because the machine can't be accessed by ssh, and have no response by keyboarding. But sometimes it will probably flash messages on screen slowly...So I got information of system coolboot and resume from S4 by serial port, I attached it in the attachment. If you need any information, please feel free to let me know.
Comment 16 cancan,feng 2013-04-26 08:10:52 UTC
Created attachment 78503 [details]
information of coolboot & S4
Comment 17 cancan,feng 2013-04-26 08:45:27 UTC
Hi, Denial, I think this issue is fixed with other issues. What do you think we first track this issue of Call Trace and flash messages on screen. I attached the information of system coolboot and do S4 which I got through serial port. I also bisect the first bad commit 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 about this issue, it's parent commit is good. So, what do you think?
Comment 18 Daniel Vetter 2013-04-27 13:05:01 UTC
(In reply to comment #17)
> Hi, Denial, I think this issue is fixed with other issues. What do you think
> we first track this issue of Call Trace and flash messages on screen. I
> attached the information of system coolboot and do S4 which I got through
> serial port. I also bisect the first bad commit
> 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 about this issue, it's parent
> commit is good. So, what do you think?

Ok, you've lost me here. A few notes:
- The backtrace in https://bugs.freedesktop.org/attachment.cgi?id=78503 is a userspace issue. No idea how that could blow up, but it looks strange.
- The kernel oops in https://bugs.freedesktop.org/attachment.cgi?id=78056 is in the lzo decompressor. That's ridiculously well-tested code (it's used everywhere) and I'd be surprised if there's really a bug in there. Also, this runs before i915.ko is loaded.

In summary I suspect that something is wrong with this machine here.

Also: Is the bisect result the same for both calltraces? It's possible that 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 broke something, but I'd never have expected issues like this ...
Comment 19 cancan,feng 2013-04-28 02:35:05 UTC
88adfff1ad5019f65b9d0b4e1a4ac900fb065183 is the first bad commit
commit 88adfff1ad5019f65b9d0b4e1a4ac900fb065183
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Mar 28 10:42:01 2013 +0100

    drm/i915: hw readout support for ->has_pch_encoders

    Now we can ditch the checks in the Haswell disable code.

    v2: add support for Haswell

    Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 3680d704a4f13869e0e5094699a9c96dcf7dc45b 78d850e8e904670ad377fc92b75993dfec59933a M      drivers

When I do S4 with kernel 88adfff1ad5019f65b9d0b4e1a4ac900fb065183, call trace appears, the machine can't be accessed. I also tested it's parents, which doesn't have this call trace. 
I attached jpeg of Call Trace of 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 and S4 information by serial port.
Comment 20 cancan,feng 2013-04-28 02:36:21 UTC
Created attachment 78567 [details]
coolboot & S4 info of 88adff
Comment 21 cancan,feng 2013-04-28 02:40:20 UTC
Created attachment 78568 [details]
coolboot & S4 screenshot of 88adff
Comment 22 Gordon Jin 2013-05-06 08:28:51 UTC
This can be reproduced on several QA machines, immediately after a cool boot. Can any developer reproduce it?
Comment 23 Chris Wilson 2013-05-24 13:26:50 UTC
I believe Paulo is the only person with access to a ULT?
Comment 24 Takashi Iwai 2013-06-05 11:01:50 UTC
We see similar S4 breakage on Haswell laptops.
In our case, 3.10-rc4 shows the hang almost immediately after a couple of S4 cycles.  The problem was seen on 3.8 and 3.9.

Also, the problem doesn't seem to specific Haswell ULT.  A machine with Haswell Mobile GT2 also hangs.

I already reported in kernel bugzilla, and a few more information are found there:
    https://bugzilla.kernel.org/show_bug.cgi?id=59321
Comment 25 shui yangwei 2013-06-06 00:51:51 UTC
(In reply to comment #24)
> We see similar S4 breakage on Haswell laptops.
> In our case, 3.10-rc4 shows the hang almost immediately after a couple of S4
> cycles.  The problem was seen on 3.8 and 3.9.
> 
> Also, the problem doesn't seem to specific Haswell ULT.  A machine with
> Haswell Mobile GT2 also hangs.
> 
> I already reported in kernel bugzilla, and a few more information are found
> there:
>     https://bugzilla.kernel.org/show_bug.cgi?id=59321

We are finding it will be seen on all HSW platforms, include ULT,Desktop and mobile. Because we find it on desktop first, then we filed this bug. After that we added the other two platforms in the title. Might its our fault that mislead you.
Comment 26 Gordon Jin 2013-06-09 03:05:53 UTC
To clarify, there are two issues mixed in this bug report:
1) steadily reproducible with cold boot
2) unsteadily reproducible (maybe 1 out of 100 times) with warm boot

The bisect result only applies to issue 1). And until that time we know they are 2 different problems and should be tracked with separate bug reports.
So this bug remains for issue 1). 
bug#65496 was filed for issue 2). I guess Takashi is seeing this one.
Comment 27 shui yangwei 2013-06-09 09:17:05 UTC
We have new findings(2013Q2 RC2, kernel:3.9.5):
--------------------------------------------
Now all of our HSW platforms(include ULT,desktop and mobile), doesn't have this issue with BIOS: 120,126 and 128. I also need to point out that, we also updated the KSC.  
So we get a conclusion: this bug might related to BIOS version or KSC, there's too many S4 issues on HSW, the bisect result perhaps be a coincidence.
Now this bug is solved, the most critical issue back to bug #65496 again.
Comment 28 Elizabeth 2017-10-06 14:46:43 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.