Summary: | [HSW ULT desktop mobile bisected] Resume from s4 causes call trace and system hang, with cold boot | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | cancan,feng <cancan.feng> | ||||||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||
Severity: | critical | ||||||||||||||||
Priority: | high | CC: | cancan.feng, przanoni, tiwai, yangweix.shui | ||||||||||||||
Version: | unspecified | ||||||||||||||||
Hardware: | Other | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Description
cancan,feng
2013-04-16 06:08:05 UTC
Created attachment 78056 [details]
Call Trace info
Doesn't look like a i915.ko bug. Can you reproduce without loading i915.ko? (In reply to comment #2) > Doesn't look like a i915.ko bug. Can you reproduce without loading i915.ko? You are right!But this is not a stable issue, I can't reproduce it every times no matter loading i915.ko or not... Is the machine otherwise solid, i.e. have you run a memory tester on it? This smells a bit fishy ... Also, can you try to bisect this? You probably need to do a few suspend/resume cycles to make sure any given kernel really works ... (In reply to comment #4) > Is the machine otherwise solid, i.e. have you run a memory tester on it? > This smells a bit fishy ... Memory tester do you mean suspend to memory or something else? If so, the machine otherwise seems solid... (In reply to comment #5) > Also, can you try to bisect this? You probably need to do a few > suspend/resume cycles to make sure any given kernel really works ... I do ten times suspend/resume cycles with a latest kernel, 5 times resume is ok, another 5 times cause call trace and system hang.. I also try to bisect this, but as you know, this issue is not so stable, so I can't find the first bad commit.. By the way, this issue happens no matter whether i915.ko loaded or not, i.e. if remove i915.ko, sometimes it happens, sometimes it doesn't, so maybe this is not our bug? If you blacklist i915.ko and it still happens it's certainly not our bug, but we still need to figure out what's wrong to unblock testing. Bisecting should still be possible: As soon as you hit the backtrace you know that a given kernel version is broken. To make sure a kernel is really good just run the suspend/resume test in a loop for e.g. 100 repeats or so. Of course you first need to make sure that you have a solid baseline. It'll take some time, but if you script the suspend loop it shouldn't take more work than a normal bisect. Created attachment 78208 [details]
S4 resume with call trace and system hang
I run S4 test in a loop as you said, when execute 95 times, system resume with call trace and hang. Another finding is, if our kernel's initrd is buildin, this issue will be easy to reproduce. I append the picture of the call trace in attachment.
So just to check: This failure after 95 resume cycles was on a kernel previously considered good? I.e. do we need to assume that this bug has always been there and this is not a regression, just some timing changes made it more likely? (In reply to comment #10) > So just to check: This failure after 95 resume cycles was on a kernel > previously considered good? I.e. do we need to assume that this bug has > always been there and this is not a regression, just some timing changes > made it more likely? No, it's not a regression, because it need to run too many times S4 to reproduce this issue, perhaps we haven't tested enough before. Now we use kexec to instead of rpm package to install kernel(kexec kernel will build in initrd), which will be easy to reproduce the issue. Hi, Daniel, I have a new clue that this issue seems to be related to how to start the machine, i.e. if I reboot the machine with power button(sometimes I have to do so because of machine hang), then I do S4, the machine will hang with call trace. But if I reboot the machine by command "reboot" instead of power button, then do S4, the machine can resume successfully! I did try many times, every time the same. In addition, I tested 3.8 release kernel, and this issue exists there too. I also tested without loading i915.ko, this issue still happens. To sum up, so long as the machine is restarted with power button, call trace and hang! So.. what do you think I should do next? (In reply to comment #10) > So just to check: This failure after 95 resume cycles was on a kernel > previously considered good? I.e. do we need to assume that this bug has > always been there and this is not a regression, just some timing changes > made it more likely? Hi, I have double checked this issue, and found it is a regression, and I have bisected it. Here is bisect information: 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 is the first bad commit commit 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Mar 28 10:42:01 2013 +0100 drm/i915: hw readout support for ->has_pch_encoders Now we can ditch the checks in the Haswell disable code. v2: add support for Haswell Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> :040000 040000 3680d704a4f13869e0e5094699a9c96dcf7dc45b 78d850e8e904670ad377fc92b75993dfec59933a M drivers It's strange that the first jpeg of a calltrace is completely different from the 2nd one. Is it always different like that? Also, the 2nd calltrace is cut off at the beginning. Can you try to capture the top, too? There's a special kernel option to slow down kernel debug output to the console, see the boot_delay option in Documentation/kernel-parameters.txt (I hope it also works in general). (In reply to comment #14) > It's strange that the first jpeg of a calltrace is completely different from > the 2nd one. Is it always different like that? > > Also, the 2nd calltrace is cut off at the beginning. Can you try to capture > the top, too? There's a special kernel option to slow down kernel debug > output to the console, see the boot_delay option in > Documentation/kernel-parameters.txt (I hope it also works in general). I'm sorry I didn't make it clear. The first jpeg and 2nd jpeg are from two machines of HSW desktop, the issue have different symptoms on them, I think we could focus on the first jpeg of "Call Trace info" instead of "S4 resume with call trace and system hang", because the symptom on this machine is a little easier to reproduce. Bisect info is from this machine too. I think the issue is a little complicated, and I'm not sure how to define it. At first, I thought system hang after resume from S4, because the machine can't be accessed by ssh, and have no response by keyboarding. But sometimes it will probably flash messages on screen slowly...So I got information of system coolboot and resume from S4 by serial port, I attached it in the attachment. If you need any information, please feel free to let me know. Created attachment 78503 [details]
information of coolboot & S4
Hi, Denial, I think this issue is fixed with other issues. What do you think we first track this issue of Call Trace and flash messages on screen. I attached the information of system coolboot and do S4 which I got through serial port. I also bisect the first bad commit 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 about this issue, it's parent commit is good. So, what do you think? (In reply to comment #17) > Hi, Denial, I think this issue is fixed with other issues. What do you think > we first track this issue of Call Trace and flash messages on screen. I > attached the information of system coolboot and do S4 which I got through > serial port. I also bisect the first bad commit > 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 about this issue, it's parent > commit is good. So, what do you think? Ok, you've lost me here. A few notes: - The backtrace in https://bugs.freedesktop.org/attachment.cgi?id=78503 is a userspace issue. No idea how that could blow up, but it looks strange. - The kernel oops in https://bugs.freedesktop.org/attachment.cgi?id=78056 is in the lzo decompressor. That's ridiculously well-tested code (it's used everywhere) and I'd be surprised if there's really a bug in there. Also, this runs before i915.ko is loaded. In summary I suspect that something is wrong with this machine here. Also: Is the bisect result the same for both calltraces? It's possible that 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 broke something, but I'd never have expected issues like this ... 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 is the first bad commit commit 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Mar 28 10:42:01 2013 +0100 drm/i915: hw readout support for ->has_pch_encoders Now we can ditch the checks in the Haswell disable code. v2: add support for Haswell Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> :040000 040000 3680d704a4f13869e0e5094699a9c96dcf7dc45b 78d850e8e904670ad377fc92b75993dfec59933a M drivers When I do S4 with kernel 88adfff1ad5019f65b9d0b4e1a4ac900fb065183, call trace appears, the machine can't be accessed. I also tested it's parents, which doesn't have this call trace. I attached jpeg of Call Trace of 88adfff1ad5019f65b9d0b4e1a4ac900fb065183 and S4 information by serial port. Created attachment 78567 [details]
coolboot & S4 info of 88adff
Created attachment 78568 [details]
coolboot & S4 screenshot of 88adff
This can be reproduced on several QA machines, immediately after a cool boot. Can any developer reproduce it? I believe Paulo is the only person with access to a ULT? We see similar S4 breakage on Haswell laptops. In our case, 3.10-rc4 shows the hang almost immediately after a couple of S4 cycles. The problem was seen on 3.8 and 3.9. Also, the problem doesn't seem to specific Haswell ULT. A machine with Haswell Mobile GT2 also hangs. I already reported in kernel bugzilla, and a few more information are found there: https://bugzilla.kernel.org/show_bug.cgi?id=59321 (In reply to comment #24) > We see similar S4 breakage on Haswell laptops. > In our case, 3.10-rc4 shows the hang almost immediately after a couple of S4 > cycles. The problem was seen on 3.8 and 3.9. > > Also, the problem doesn't seem to specific Haswell ULT. A machine with > Haswell Mobile GT2 also hangs. > > I already reported in kernel bugzilla, and a few more information are found > there: > https://bugzilla.kernel.org/show_bug.cgi?id=59321 We are finding it will be seen on all HSW platforms, include ULT,Desktop and mobile. Because we find it on desktop first, then we filed this bug. After that we added the other two platforms in the title. Might its our fault that mislead you. To clarify, there are two issues mixed in this bug report: 1) steadily reproducible with cold boot 2) unsteadily reproducible (maybe 1 out of 100 times) with warm boot The bisect result only applies to issue 1). And until that time we know they are 2 different problems and should be tracked with separate bug reports. So this bug remains for issue 1). bug#65496 was filed for issue 2). I guess Takashi is seeing this one. We have new findings(2013Q2 RC2, kernel:3.9.5): -------------------------------------------- Now all of our HSW platforms(include ULT,desktop and mobile), doesn't have this issue with BIOS: 120,126 and 128. I also need to point out that, we also updated the KSC. So we get a conclusion: this bug might related to BIOS version or KSC, there's too many S4 issues on HSW, the bisect result perhaps be a coincidence. Now this bug is solved, the most critical issue back to bug #65496 again. Closing old verified. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.