61508 – [HSW AIO] i915 oops during suspend/hibernate

Bug 61508 - [HSW AIO] i915 oops during suspend/hibernate

Summary: [HSW AIO] i915 oops during suspend/hibernate

Status:	CLOSED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	Paulo Zanoni
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-02-26 13:53 UTC by Anthony Wong
Modified:	2017-07-24 22:58 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
s3 log with drm.debug=0xe (207.22 KB, text/plain) 2013-02-26 13:55 UTC, Anthony Wong	no flags	Details
s4 log with drm.debug=0xe (75.85 KB, text/plain) 2013-02-26 14:00 UTC, Anthony Wong	no flags	Details
s3 log with nightly minus cc464b2a17c59ad (302.35 KB, text/plain) 2013-03-01 13:56 UTC, Timo Aaltonen	no flags	Details
patch (892 bytes, text/plain) 2013-03-11 03:48 UTC, Anthony Wong	no flags	Details
kern.log with drm.debug=0xe (98.56 KB, text/plain) 2013-03-14 03:57 UTC, XiongZhang	no flags	Details
kern.log, system resumes automatically after entering S3 (255.74 KB, text/plain) 2013-04-26 02:13 UTC, AceLan Kao	no flags	Details
Show Obsolete (1) View All

Description Anthony Wong 2013-02-26 13:53:59 UTC

On a Haswell AIO, system fails to suspend or hibernate, oops in i915.
System runs Ubuntu 12.04.2, with kernel that has haswell support backported.

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:0c12] (rev 01) (prog-if 00 [VGA controller])
	Subsystem: Dell Device [1028:05a7]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=4M]
	Region 2: Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at f000 [size=64]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000
	Capabilities: [d0] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a4] PCI Advanced Features
		AFCap: TP+ FLR+
		AFCtrl: FLR-
		AFStatus: TP-

Comment 1 Anthony Wong 2013-02-26 13:55:29 UTC

Created attachment 75573 [details]
s3 log with drm.debug=0xe

Comment 2 Anthony Wong 2013-02-26 14:00:08 UTC

Created attachment 75574 [details]
s4 log with drm.debug=0xe

Comment 3 Chris Wilson 2013-02-26 14:01:35 UTC

Your backport looks incomplete, but there is no OOPS there, nor an indication that suspend failed - at least i915.ko shutdown albeit rather noisily.

So why do you think it is not suspending?

Comment 4 Anthony Wong 2013-02-27 15:23:21 UTC

This time we ran s3 suspend/resume tests continuously and it failed to suspend in the 10th iteration, I put the full kern.log at http://ubuntuone.com/6EwIGft3lJLAnmG0x4mHrs. Tried drm-intel-nightly from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2013-02-23-raring/, system could boot to lightdm with a black screen (there is a sound notification of lightdm).

Comment 5 Chris Wilson 2013-02-27 15:31:34 UTC

The tenth resume in that session completed at 815s, following a suspend-resume cycle of 40s. 5s after it finished resuming, logging ceased. What happened then is a mystery.

Comment 6 Timo Aaltonen 2013-03-01 13:56:41 UTC

Created attachment 75741 [details]
s3 log with nightly minus cc464b2a17c59ad

here's a newer log with drm-intel-nightly, with cc464b2a17c59ad reverted hoping to fix the blank-screen issue with eDP

Comment 7 Chris Wilson 2013-03-03 18:15:31 UTC

(In reply to comment #6)
> Created attachment 75741 [details]
> s3 log with nightly minus cc464b2a17c59ad
> 
> here's a newer log with drm-intel-nightly, with cc464b2a17c59ad reverted
> hoping to fix the blank-screen issue with eDP

Timo, I don't see any logs from the suspend there. But there is tons to fixup during boot up first - DP link training failures etc

Comment 8 Chris Wilson 2013-03-04 10:52:56 UTC

Paulo, do the warning here match our current expectations for HSW startup? Before digging into the complicated resume failure we should clean up the initialisation.

Comment 9 Paulo Zanoni 2013-03-04 19:44:55 UTC

(In reply to comment #8)
> Paulo, do the warning here match our current expectations for HSW startup?
> Before digging into the complicated resume failure we should clean up the
> initialisation.

Hi

I'd start by trying to fix all those "unclaimed register" messages first. The current upstream code is not really triggering these messages. What's the value of intel_dp->output_reg? What's the value of ch_ctl and ch_data on intel_dp_aux_ch? I think the first step is to add code to check for unclaimed registers on i915_read##x and also dump_stack() in case we find unclaimed registers.

Thanks,
Paulo

Comment 10 Anthony Wong 2013-03-11 03:48:52 UTC

Created attachment 76304 [details]
patch

This patch can fix the issue we have on our hardware.

Paulo or Chris, could you please help reviewing it?

Comment 11 Chris Wilson 2013-03-11 08:42:20 UTC

Nope, the cause of that bug is improper serialisation of the hotplug workqueue across suspend/resume. Assuming that it is not fixed by

commit 15239099d7a7a9ecdc1ccb5b187ae4cda5488ff9
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Mar 5 09:50:58 2013 +0100

    drm/i915: enable irqs earlier when resuming

Comment 12 Daniel Vetter 2013-03-11 17:48:45 UTC

Presumably fixed with

commit b8efb17b3d687695b81485f606fc4e6c35a50f9a
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Tue Feb 5 15:41:53 2013 +0800

    i915: ignore lid open event when resuming
    
Thanks for reporting this issue and please reopen if this isn't fixed.

Comment 13 XiongZhang 2013-03-14 03:54:35 UTC

I don't think below patch will fix this bug. First this machine has no lid, Second the hot plug event generate when system enter into sleep, not resume. See the attachment kern.log message.

thanks 
(In reply to comment #12)
> Presumably fixed with

commit b8efb17b3d687695b81485f606fc4e6c35a50f9a
> Author: Zhang Rui <rui.zhang@intel.com>
Date:   Tue Feb 5 15:41:53 2013
> +0800

    i915: ignore lid open event when resuming
    
Thanks for
> reporting this issue and please reopen if this isn't fixed.

Comment 14 XiongZhang 2013-03-14 03:57:51 UTC

Created attachment 76509 [details]
kern.log with drm.debug=0xe

System enter into sleep from 64.056462 second, at 64.526217 second, hotplug function is called.

Comment 15 AceLan Kao 2013-04-25 08:04:48 UTC

We still can reproduce this issue by using 3.9-rc8 kernel.
Please let me know if there is anything I can help.

Comment 16 Daniel Vetter 2013-04-25 12:37:31 UTC

Can you please attach a drm.debug=0xe dmesg from the latest drm-intel-nightly branch which shows the backtrace?

Comment 17 AceLan Kao 2013-04-26 02:13:36 UTC

Created attachment 78494 [details]
kern.log, system resumes automatically after entering S3

I got the kernel from here
http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2013-04-25-raring/
The system wakes up automatically while entering S3.
But not every time it will wake up correctly, sometimes it hangs while waking up.

Comment 18 Daniel Vetter 2013-04-26 07:13:17 UTC

I still can't see any oops in the logs your attaching ... everything seems to work as expected. We need that oops (everything, including backtrace and dumps) to make progress here.

Comment 19 Daniel Vetter 2013-07-11 19:23:05 UTC

Can you please retest with intel_iommu=igfx_off added to the kernel cmdline?

Comment 20 Paulo Zanoni 2013-07-30 16:33:19 UTC

(In reply to comment #17)
> Created attachment 78494 [details]
> kern.log, system resumes automatically after entering S3
> 
> I got the kernel from here
> http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/2013-04-25-
> raring/
> The system wakes up automatically while entering S3.
> But not every time it will wake up correctly, sometimes it hangs while
> waking up.

This machine is a very early SDV pre-production stepping(PCI ID 0x0c12) and it also has a Radeon graphics card attached. The machine on comment #1 has the same PCI ID, so it is also a very early pre-production stepping.

Can you please confirm if the problem still happens on newer hardware? If this bug is specific to these old SDVs, I'm not sure if they're worth fixing.

Haswell machines with PCI IDs 0x0CXX are really old SDVs, perhaps we could even consider dropping support for them. Newer SDVs and real production HW have PCI IDs starting with 0x04XX (normal Haswell), 0x0AXX (ULT) and 0x0DXX (CRW).

Comment 21 Daniel Vetter 2013-08-04 21:26:07 UTC

Paulo, dropping SDV pci ids sounds like a good idea. Volunteered for that patch?

Comment 22 Gordon Jin 2013-08-12 02:08:30 UTC

(In reply to comment #21)
> Paulo, dropping SDV pci ids sounds like a good idea. Volunteered for that
> patch?

Can we just declare not to support these SDV, instead of removing the code? I still want my HSW SDVs (not affected by this bug) in use for some time.

Comment 23 Gordon Jin 2013-08-12 02:11:04 UTC

(In reply to comment #22)
> (In reply to comment #21)
> > Paulo, dropping SDV pci ids sounds like a good idea. Volunteered for that
> > patch?
> 
> Can we just declare not to support these SDV, instead of removing the code?
> I still want my HSW SDVs (not affected by this bug) in use for some time.

I don't object dropping 0CXX. I just want to preserve 04XX.

Comment 24 Daniel Vetter 2013-08-12 05:18:46 UTC

(In reply to comment #23)
> I don't object dropping 0CXX. I just want to preserve 04XX.

0x04xx are the release pciids and also used by later sdvs. 0x0cxx is used by really early sdvs and we've already started to remove specific hacks for them. So I think we can drop the 0x0cxx ids without upsetting anyone.

Comment 25 Paulo Zanoni 2013-08-13 18:56:32 UTC

(In reply to comment #24)
> (In reply to comment #23)
> > I don't object dropping 0CXX. I just want to preserve 04XX.
> 
> 0x04xx are the release pciids and also used by later sdvs. 0x0cxx is used by
> really early sdvs and we've already started to remove specific hacks for
> them. So I think we can drop the 0x0cxx ids without upsetting anyone.

Patch merged: "drm/i915: print a message when we detect an early Haswell SDV".

So now we print a dmesg message whenever someone is using the 0x0CXX machines. The driver still loads, but at least we tell the users to expect problems, and when they report bugs we'll be able to look at dmesg, find the message and tell them to try  to reproduce the bug on real-world hardware.

http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued&id=175d3c1b176af5ad2196064a66a45e97582239d5

Closing bug. If you can still reproduce the bug on other Haswell machines with a recent BIOS, please reopen this bug or open a new bug report.

Thanks,
Paulo

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.