Bug 34056

Summary: [SNB] GPU hang after Gnome starts
Product: xorg Reporter: Matt Turner <mattst88>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: medium CC: mengmeng.meng
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg showing the failure
none
Xorg.0.log
none
kernel config
none
i915_error_state.gz
none
i915_error_state.gz
none
dmesg from working 2.6.37
none
Poll the FIFO for free entries before writing the register none

Description Matt Turner 2011-02-08 16:59:52 UTC
Created attachment 43142 [details]
dmesg showing the failure

With 2.6.38-rc3+, I'm seeing a GPU hang and a total loss of graphics very soon after Gnome starts (within 2~3 seconds). It did not happen with 2.6.37.

lspci:
00:00.0 Host bridge: Intel Corporation Device 0100 (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Device 0112 (rev 09)
00:16.0 Communication controller: Intel Corporation Device 1c3a (rev 04)
00:1a.0 USB Controller: Intel Corporation Device 1c2d (rev 04)
00:1b.0 Audio device: Intel Corporation Device 1c20 (rev 04)
00:1c.0 PCI bridge: Intel Corporation Device 1c10 (rev b4)
00:1c.2 PCI bridge: Intel Corporation Device 1c14 (rev b4)
00:1c.3 PCI bridge: Intel Corporation 82801 PCI Bridge (rev b4)
00:1c.4 PCI bridge: Intel Corporation Device 1c18 (rev b4)
00:1d.0 USB Controller: Intel Corporation Device 1c26 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Device 1c4a (rev 04)
00:1f.2 SATA controller: Intel Corporation Device 1c02 (rev 04)
00:1f.3 SMBus: Intel Corporation Device 1c22 (rev 04)
02:00.0 USB Controller: Device 1b6f:7023 (rev 01)
03:00.0 PCI bridge: Device 1b21:1080 (rev 01)
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

# lspci -vv -s 0:2.0
00:02.0 VGA compatible controller: Intel Corporation Device 0112 (rev 09) (prog-if 00 [VGA controller])
	Subsystem: ASRock Incorporation Device 0112
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 43
	Region 0: Memory at fe000000 (64-bit, non-prefetchable) [size=4M]
	Region 2: Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at f000 [size=64]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee0f00c  Data: 4171
	Capabilities: [d0] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a4] PCI Advanced Features
		AFCap: TP+ FLR+
		AFCtrl: FLR-
		AFStatus: TP-
	Kernel driver in use: i915
Comment 1 Matt Turner 2011-02-08 17:07:50 UTC
Created attachment 43143 [details]
Xorg.0.log
Comment 2 Matt Turner 2011-02-08 17:08:37 UTC
Created attachment 43144 [details]
kernel config
Comment 3 Chris Wilson 2011-02-09 01:37:23 UTC
Can you please merge drm-intel-next [git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git] and attach the /sys/kernel/debug/dri/0/i915_error_state?
Comment 4 Matt Turner 2011-02-26 13:46:59 UTC
Created attachment 43861 [details]
i915_error_state.gz

i915_error_state with the current linus-2.6 + drm-intel-next.

Same thing. Maybe a bit worse. GDM works fine, but after login when the desktop begins to appear, the GPU goes into a restart cycle.
Comment 5 Matt Turner 2011-03-03 19:49:34 UTC
Created attachment 44102 [details]
i915_error_state.gz

Still present as of b65a0e0c84cf489bfa00d6aa6c48abc5a237100f.

I updated libdrm/mesa/xserver/xf86-video-intel to the latest from git a few minutes ago, no changes with .38. It still dies.

2.6.37 on the other hand, fine. Do I need to be reporting this on the kernel bugzilla?
Comment 6 Matt Turner 2011-03-03 22:12:55 UTC
Created attachment 44106 [details]
dmesg from working 2.6.37
Comment 7 Chris Wilson 2011-03-04 01:27:16 UTC
Same IPEHR, hmm. What's the lspci for this chip?
Comment 8 Chris Wilson 2011-03-04 01:40:24 UTC
Ok, not the same. Just forgot to look for the renamed file, d'oh.

Hmm. It looks like the write to the tail went astray, judging by the IPEHR; I don't see any other reason for it not to have advanced.


Try this to see if it makes the hang disappear:


diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index bdf4ceb..5e26b5e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -264,9 +264,11 @@ void __gen6_force_wake_get(struct drm_i915_private *dev_priv)
 {
 	int count;
 
+#if 0
 	count = 0;
 	while (count++ < 50 && (I915_READ_NOTRACE(FORCEWAKE_ACK) & 1))
 		udelay(10);
+#endif
 
 	I915_WRITE_NOTRACE(FORCEWAKE, 1);
 	POSTING_READ(FORCEWAKE);
@@ -278,8 +280,10 @@ void __gen6_force_wake_get(struct drm_i915_private *dev_priv)
 
 void __gen6_force_wake_put(struct drm_i915_private *dev_priv)
 {
+#if 0
 	I915_WRITE_NOTRACE(FORCEWAKE, 0);
 	POSTING_READ(FORCEWAKE);
+#endif
 }
 
 static int i915_drm_freeze(struct drm_device *dev)
Comment 9 Matt Turner 2011-03-04 09:49:40 UTC
(In reply to comment #8)
> Try this to see if it makes the hang disappear:

This does indeed make the hang/reboot-cycle go away.

Also, dmesg from .37 and .38 differ as to stolen memory. I didn't modify any BIOS settings to make this happen.
Comment 10 Chris Wilson 2011-03-04 11:34:25 UTC
Created attachment 44137 [details] [review]
Poll the FIFO for free entries before writing the register
Comment 11 Matt Turner 2011-03-04 11:58:53 UTC
(In reply to comment #10)
> Created an attachment (id=44137) [details]
> Poll the FIFO for free entries before writing the register

That fixes it!

Please have a
Reported-and-Tested-by: Matt Turner <mattst88@gmail.com>

Thanks a ton, Chris. :)
Comment 12 Chris Wilson 2011-03-04 12:02:38 UTC
Matt, thanks a lot for that quick testing. I'll send it onwards shortly (I'm just waiting to hear if fixes another issue as well).
Comment 13 Chris Wilson 2011-03-04 12:06:28 UTC
*** Bug 34421 has been marked as a duplicate of this bug. ***
Comment 14 Chris Wilson 2011-03-06 01:09:37 UTC
Pushed to -fixes:

commit 91355834646328e7edc6bd25176ae44bcd7386c7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 4 19:22:40 2011 +0000

    drm/i915: Do not overflow the MMADDR write FIFO
    
    Whilst the GT is powered down (rc6), writes to MMADDR are placed in a
    FIFO by the System Agent. This is a limited resource, only 64 entries, of
    which 20 are reserved for Display and PCH writes, and so we must take
    care not to queue up too many writes. To avoid this, there is counter
    which we can poll to ensure there are sufficient free entries in the
    fifo.
    
    "Issuing a write to a full FIFO is not supported; at worst it could
    result in corruption or a system hang."
    
    Reported-and-Tested-by: Matt Turner <mattst88@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=34056
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 15 meng 2011-03-06 19:31:49 UTC
(In reply to comment #13)
> *** Bug 34421 has been marked as a duplicate of this bug. ***

Test in commit 913558,it works fine when open and close Terminal in gnome-desktop.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.