On a Haswell AIO, system fails to suspend or hibernate, oops in i915. System runs Ubuntu 12.04.2, with kernel that has haswell support backported. 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:0c12] (rev 01) (prog-if 00 [VGA controller]) Subsystem: Dell Device [1028:05a7] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 11 Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=4M] Region 2: Memory at e0000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at f000 [size=64] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP-
Created attachment 75573 [details] s3 log with drm.debug=0xe
Created attachment 75574 [details] s4 log with drm.debug=0xe
Your backport looks incomplete, but there is no OOPS there, nor an indication that suspend failed - at least i915.ko shutdown albeit rather noisily. So why do you think it is not suspending?
This time we ran s3 suspend/resume tests continuously and it failed to suspend in the 10th iteration, I put the full kern.log at http://ubuntuone.com/6EwIGft3lJLAnmG0x4mHrs. Tried drm-intel-nightly from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2013-02-23-raring/, system could boot to lightdm with a black screen (there is a sound notification of lightdm).
The tenth resume in that session completed at 815s, following a suspend-resume cycle of 40s. 5s after it finished resuming, logging ceased. What happened then is a mystery.
Created attachment 75741 [details] s3 log with nightly minus cc464b2a17c59ad here's a newer log with drm-intel-nightly, with cc464b2a17c59ad reverted hoping to fix the blank-screen issue with eDP
(In reply to comment #6) > Created attachment 75741 [details] > s3 log with nightly minus cc464b2a17c59ad > > here's a newer log with drm-intel-nightly, with cc464b2a17c59ad reverted > hoping to fix the blank-screen issue with eDP Timo, I don't see any logs from the suspend there. But there is tons to fixup during boot up first - DP link training failures etc
Paulo, do the warning here match our current expectations for HSW startup? Before digging into the complicated resume failure we should clean up the initialisation.
(In reply to comment #8) > Paulo, do the warning here match our current expectations for HSW startup? > Before digging into the complicated resume failure we should clean up the > initialisation. Hi I'd start by trying to fix all those "unclaimed register" messages first. The current upstream code is not really triggering these messages. What's the value of intel_dp->output_reg? What's the value of ch_ctl and ch_data on intel_dp_aux_ch? I think the first step is to add code to check for unclaimed registers on i915_read##x and also dump_stack() in case we find unclaimed registers. Thanks, Paulo
Created attachment 76304 [details] patch This patch can fix the issue we have on our hardware. Paulo or Chris, could you please help reviewing it?
Nope, the cause of that bug is improper serialisation of the hotplug workqueue across suspend/resume. Assuming that it is not fixed by commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Mar 5 09:50:58 2013 +0100 drm/i915: enable irqs earlier when resuming
Presumably fixed with commit b8efb17b3d687695b81485f606fc4e6c35a50f9a Author: Zhang Rui <rui.zhang@intel.com> Date: Tue Feb 5 15:41:53 2013 +0800 i915: ignore lid open event when resuming Thanks for reporting this issue and please reopen if this isn't fixed.
I don't think below patch will fix this bug. First this machine has no lid, Second the hot plug event generate when system enter into sleep, not resume. See the attachment kern.log message. thanks (In reply to comment #12) > Presumably fixed with commit b8efb17b3d687695b81485f606fc4e6c35a50f9a > Author: Zhang Rui <rui.zhang@intel.com> Date: Tue Feb 5 15:41:53 2013 > +0800 i915: ignore lid open event when resuming Thanks for > reporting this issue and please reopen if this isn't fixed.
Created attachment 76509 [details] kern.log with drm.debug=0xe System enter into sleep from 64.056462 second, at 64.526217 second, hotplug function is called.
We still can reproduce this issue by using 3.9-rc8 kernel. Please let me know if there is anything I can help.
Can you please attach a drm.debug=0xe dmesg from the latest drm-intel-nightly branch which shows the backtrace?
Created attachment 78494 [details] kern.log, system resumes automatically after entering S3 I got the kernel from here http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2013-04-25-raring/ The system wakes up automatically while entering S3. But not every time it will wake up correctly, sometimes it hangs while waking up.
I still can't see any oops in the logs your attaching ... everything seems to work as expected. We need that oops (everything, including backtrace and dumps) to make progress here.
Can you please retest with intel_iommu=igfx_off added to the kernel cmdline?
(In reply to comment #17) > Created attachment 78494 [details] > kern.log, system resumes automatically after entering S3 > > I got the kernel from here > http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2013-04-25- > raring/ > The system wakes up automatically while entering S3. > But not every time it will wake up correctly, sometimes it hangs while > waking up. This machine is a very early SDV pre-production stepping(PCI ID 0x0c12) and it also has a Radeon graphics card attached. The machine on comment #1 has the same PCI ID, so it is also a very early pre-production stepping. Can you please confirm if the problem still happens on newer hardware? If this bug is specific to these old SDVs, I'm not sure if they're worth fixing. Haswell machines with PCI IDs 0x0CXX are really old SDVs, perhaps we could even consider dropping support for them. Newer SDVs and real production HW have PCI IDs starting with 0x04XX (normal Haswell), 0x0AXX (ULT) and 0x0DXX (CRW).
Paulo, dropping SDV pci ids sounds like a good idea. Volunteered for that patch?
(In reply to comment #21) > Paulo, dropping SDV pci ids sounds like a good idea. Volunteered for that > patch? Can we just declare not to support these SDV, instead of removing the code? I still want my HSW SDVs (not affected by this bug) in use for some time.
(In reply to comment #22) > (In reply to comment #21) > > Paulo, dropping SDV pci ids sounds like a good idea. Volunteered for that > > patch? > > Can we just declare not to support these SDV, instead of removing the code? > I still want my HSW SDVs (not affected by this bug) in use for some time. I don't object dropping 0CXX. I just want to preserve 04XX.
(In reply to comment #23) > I don't object dropping 0CXX. I just want to preserve 04XX. 0x04xx are the release pciids and also used by later sdvs. 0x0cxx is used by really early sdvs and we've already started to remove specific hacks for them. So I think we can drop the 0x0cxx ids without upsetting anyone.
(In reply to comment #24) > (In reply to comment #23) > > I don't object dropping 0CXX. I just want to preserve 04XX. > > 0x04xx are the release pciids and also used by later sdvs. 0x0cxx is used by > really early sdvs and we've already started to remove specific hacks for > them. So I think we can drop the 0x0cxx ids without upsetting anyone. Patch merged: "drm/i915: print a message when we detect an early Haswell SDV". So now we print a dmesg message whenever someone is using the 0x0CXX machines. The driver still loads, but at least we tell the users to expect problems, and when they report bugs we'll be able to look at dmesg, find the message and tell them to try to reproduce the bug on real-world hardware. http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=175d3c1b176af5ad2196064a66a45e97582239d5 Closing bug. If you can still reproduce the bug on other Haswell machines with a recent BIOS, please reopen this bug or open a new bug report. Thanks, Paulo
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.