Bug 29252

Summary: [Arrandale] Hung WAIT_FOR_EVENT when running rss-glx-skyrocket
Product: DRI Reporter: Philippe Troin <phil>
Component: DRM/IntelAssignee: Jesse Barnes <jbarnes>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
i915_error_state
none
trigger scanline wait at pipe off time
none
intel_reg_dumper after screen saver triggered GPU hang
none
i915_error_state without i915-clear-scanline-wait.patch
none
i915_error_state with i915-clear-scanline-wait.patch
none
intel_reg_dumper without the i915-clear-scanline-wait.patch
none
intel_reg_dumper with the i915-clear-scanline-wait.patch
none
My variant upon Jesse's idea.
none
i915_error_state.txt after chris's patch
none
intel_reg_dumper.txt after chris's patch none

Description Philippe Troin 2010-07-25 21:24:18 UTC
Created attachment 37390 [details]
i915_error_state

I am using the following software:

 - Fedora kernel-2.6.33.6-147
 - Fedora 13
 - libdrm-2.4.21-2.fc13
 - xorg-x11-drv-intel-2.12 (from rawhide)
 - xorg-x11-server-Xorg-1.8.2-2.fc13

uname:

Linux air 2.6.33.6-147.fc13.x86_64 #1 SMP Thu Jul 8 18:16:22 PDT 2010 
x86_64 GNU/Linux

lspci:

00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 12)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, fast devsel, latency 0
    Capabilities: [e0] Vendor Specific Information: Len=0c <?>
    Kernel driver in use: agpgart-intel

00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated
Graphics Controller (rev 12) (prog-if 00 [VGA controller])
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, fast devsel, latency 0, IRQ 47
    Memory at d0000000 (64-bit, non-prefetchable) [size=4M]
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    I/O ports at 5058 [size=8]
    Expansion ROM at <unassigned> [disabled]
    Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [d0] Power Management version 2
    Capabilities: [a4] PCI Advanced Features
    Kernel driver in use: i915
    Kernel modules: i915

00:19.0 Ethernet controller: Intel Corporation 82577LM Gigabit Network
Connection (rev 06)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, fast devsel, latency 0, IRQ 48
    Memory at d4700000 (32-bit, non-prefetchable) [size=128K]
    Memory at d472a000 (32-bit, non-prefetchable) [size=4K]
    I/O ports at 5020 [size=32]
    Capabilities: [c8] Power Management version 2
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [e0] PCI Advanced Features
    Kernel driver in use: e1000e
    Kernel modules: e1000e

00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2
Enhanced Host Controller (rev 06) (prog-if 20 [EHCI])
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 0, IRQ 16
    Memory at d4729000 (32-bit, non-prefetchable) [size=1K]
    Capabilities: [50] Power Management version 2
    Capabilities: [58] Debug port: BAR=1 offset=00a0
    Capabilities: [98] PCI Advanced Features
    Kernel driver in use: ehci_hcd

00:1b.0 Audio device: Intel Corporation 5 Series/3400 Series Chipset High
Definition Audio (rev 06)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, fast devsel, latency 0, IRQ 50
    Memory at d4720000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [50] Power Management version 2
    Capabilities: [60] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Capabilities: [130] Root Complex Link
    Kernel driver in use: HDA Intel
    Kernel modules: snd-hda-intel

00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express
Root Port 1 (rev 06) (prog-if 00 [Normal decode])
    Flags: bus master, fast devsel, latency 0
    Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
    Memory behind bridge: d4600000-d46fffff
    Capabilities: [40] Express Root Port (Slot+), MSI 00
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [90] Subsystem: Hewlett-Packard Company Device 7008
    Capabilities: [a0] Power Management version 2
    Kernel driver in use: pcieport
    Kernel modules: shpchp

00:1c.1 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express
Root Port 2 (rev 06) (prog-if 00 [Normal decode])
    Flags: bus master, fast devsel, latency 0
    Bus: primary=00, secondary=02, subordinate=42, sec-latency=0
    I/O behind bridge: 00003000-00004fff
    Memory behind bridge: d0600000-d45fffff
    Prefetchable memory behind bridge: 00000000d4900000-00000000d4afffff
    Capabilities: [40] Express Root Port (Slot+), MSI 00
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [90] Subsystem: Hewlett-Packard Company Device 7008
    Capabilities: [a0] Power Management version 2
    Kernel driver in use: pcieport
    Kernel modules: shpchp

00:1c.3 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express
Root Port 4 (rev 06) (prog-if 00 [Normal decode])
    Flags: bus master, fast devsel, latency 0
    Bus: primary=00, secondary=43, subordinate=43, sec-latency=0
    Memory behind bridge: d0500000-d05fffff
    Capabilities: [40] Express Root Port (Slot+), MSI 00
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [90] Subsystem: Hewlett-Packard Company Device 7008
    Capabilities: [a0] Power Management version 2
    Kernel driver in use: pcieport
    Kernel modules: shpchp

00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2
Enhanced Host Controller (rev 06) (prog-if 20 [EHCI])
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 0, IRQ 20
    Memory at d4728000 (32-bit, non-prefetchable) [size=1K]
    Capabilities: [50] Power Management version 2
    Capabilities: [58] Debug port: BAR=1 offset=00a0
    Capabilities: [98] PCI Advanced Features
    Kernel driver in use: ehci_hcd

00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev a6) (prog-if
01 [Subtractive decode])
    Flags: bus master, fast devsel, latency 0
    Bus: primary=00, secondary=44, subordinate=48, sec-latency=32
    I/O behind bridge: 00002000-00002fff
    Memory behind bridge: d0400000-d04fffff
    Prefetchable memory behind bridge: 00000000d8000000-00000000dbffffff
    Capabilities: [50] Subsystem: Hewlett-Packard Company Device 7008

00:1f.0 ISA bridge: Intel Corporation Mobile 5 Series Chipset LPC Interface
Controller (rev 06)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 0
    Capabilities: [e0] Vendor Specific Information: Len=10 <?>
    Kernel modules: iTCO_wdt

00:1f.2 SATA controller: Intel Corporation 5 Series/3400 Series Chipset 6 port
SATA AHCI Controller (rev 06) (prog-if 01 [AHCI 1.0])
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 46
    I/O ports at 5048 [size=8]
    I/O ports at 5064 [size=4]
    I/O ports at 5040 [size=8]
    I/O ports at 5060 [size=4]
    I/O ports at 5000 [size=32]
    Memory at d4727000 (32-bit, non-prefetchable) [size=2K]
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [70] Power Management version 3
    Capabilities: [a8] SATA HBA v1.0
    Capabilities: [b0] PCI Advanced Features
    Kernel driver in use: ahci

00:1f.6 Signal processing controller: Intel Corporation 5 Series/3400 Series
Chipset Thermal Subsystem (rev 06)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, fast devsel, latency 0, IRQ 10
    Memory at d4726000 (64-bit, non-prefetchable) [size=4K]
    Capabilities: [50] Power Management version 3
    Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-

43:00.0 Network controller: Intel Corporation Centrino Advanced-N 6200 (rev 35)
    Subsystem: Intel Corporation Centrino Advanced-N 6200 2x2 AGN
    Flags: bus master, fast devsel, latency 0, IRQ 49
    Memory at d0500000 (64-bit, non-prefetchable) [size=8K]
    Capabilities: [c8] Power Management version 3
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [e0] Express Endpoint, MSI 00
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [140] Device Serial Number 00-23-14-ff-ff-77-aa-48
    Kernel driver in use: iwlagn
    Kernel modules: iwlagn

44:06.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 06)
(prog-if 10 [OHCI])
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 64, IRQ 20
    Memory at d0401000 (32-bit, non-prefetchable) [size=2K]
    Capabilities: [dc] Power Management version 2
    Kernel driver in use: firewire_ohci
    Kernel modules: firewire-ohci

44:06.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host
Adapter (rev 25)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 64, IRQ 22
    Memory at d0403000 (32-bit, non-prefetchable) [size=256]
    Capabilities: [80] Power Management version 2
    Kernel driver in use: sdhci-pci
    Kernel modules: sdhci-pci

44:06.2 System peripheral: Ricoh Co Ltd R5C843 MMC Host Controller (rev 14)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 0, IRQ 11
    Memory at d0402000 (32-bit, non-prefetchable) [size=256]
    Capabilities: [80] Power Management version 2

44:06.3 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev bb)
    Subsystem: Hewlett-Packard Company Device 7008
    Flags: bus master, medium devsel, latency 168, IRQ 22
    Memory at d0400000 (32-bit, non-prefetchable) [size=4K]
    Bus: primary=44, secondary=45, subordinate=48, sec-latency=176
    Memory window 0: d8000000-dbfff000 (prefetchable)
    Memory window 1: dc000000-dffff000
    I/O window 0: 00002000-000020ff
    I/O window 1: 00002400-000024ff
    16-bit legacy interface ports at 0001
    Kernel driver in use: yenta_cardbus
    Kernel modules: yenta_socket

ff:00.0 Host bridge: Intel Corporation Core Processor QuickPath Architecture
Generic Non-core Registers (rev 02)
    Subsystem: Intel Corporation Device 8086
    Flags: bus master, fast devsel, latency 0

ff:00.1 Host bridge: Intel Corporation Core Processor QuickPath Architecture
System Address Decoder (rev 02)
    Subsystem: Intel Corporation Device 8086
    Flags: bus master, fast devsel, latency 0

ff:02.0 Host bridge: Intel Corporation Core Processor QPI Link 0 (rev 02)
    Subsystem: Intel Corporation Device 8086
    Flags: bus master, fast devsel, latency 0

ff:02.1 Host bridge: Intel Corporation Core Processor QPI Physical 0 (rev 02)
    Subsystem: Intel Corporation Device 8086
    Flags: bus master, fast devsel, latency 0

ff:02.2 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
    Subsystem: Intel Corporation Device 8086
    Flags: bus master, fast devsel, latency 0

ff:02.3 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
    Subsystem: Intel Corporation Device 8086
    Flags: bus master, fast devsel, latency 0

Other relevant info:

VGA BIOS: https://bugs.freedesktop.org/attachment.cgi?id=37080

The hang happened while rss-glx-skyrocket (a rss-glx screensaver) was running.

dmesg:

[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 953036 at 931817)
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

(plenty of these in a loop)

I have attached a dump of /sys/kernel/debug/dri/0/i915_error_state.

Phil.
Comment 1 Chris Wilson 2010-07-26 00:53:38 UTC
WAIT_FOR_EVENT hang. Did you notice anything else happening at the time, like a modeset change, dpms on/off, unplugging a monitor or two?

Meh, I need to also include a full register dump in the error state.
Comment 2 Philippe Troin 2010-07-26 07:35:24 UTC
(In reply to comment #1)
> WAIT_FOR_EVENT hang. Did you notice anything else happening at the time, like a
> modeset change, dpms on/off, unplugging a monitor or two?

Yes, during the "run", the DPMS Off kicked in.
I don't if the hand occurred at the time the DPMS went Off, or when the DPMS when on (or in between for that matter).

> Meh, I need to also include a full register dump in the error state.

Do you want the output of intel_reg_dumper next time it happens?
Comment 3 Jesse Barnes 2010-07-26 13:00:38 UTC
Created attachment 37403 [details] [review]
trigger scanline wait at pipe off time

I wonder if this patch helps?  The intent is to trigger any outstanding scanline wait event before shutting off the pipe.  When the pipe shuts off, it should end up stopping on the first line of the next frame, so hopefully this register programming is correct.
Comment 4 Philippe Troin 2010-07-26 21:32:30 UTC
(In reply to comment #3)
> Created an attachment (id=37403) [details]
> trigger scanline wait at pipe off time
> 
> I wonder if this patch helps?  The intent is to trigger any outstanding
> scanline wait event before shutting off the pipe.  When the pipe shuts off, it
> should end up stopping on the first line of the next frame, so hopefully this
> register programming is correct.

I'm recompiling a kernel right now with this patch.
I will report on its effect later.

Anything you'd want if I notice a hang again?

Thanks!
Phil.
Comment 5 Dongxu Li 2010-08-05 19:37:57 UTC
Created attachment 37613 [details]
intel_reg_dumper after screen saver triggered GPU hang

after running xscreensaver-demo, GPU hangs with,

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 802356 at 799091)

and intel_reg_dumper was taken.
Comment 6 Dongxu Li 2010-08-05 20:00:04 UTC
(In reply to comment #3)
> Created an attachment (id=37403) [details]
> trigger scanline wait at pipe off time
> 
> I wonder if this patch helps?  The intent is to trigger any outstanding
> scanline wait event before shutting off the pipe.  When the pipe shuts off, it
> should end up stopping on the first line of the next frame, so hopefully this
> register programming is correct.

no, this patch doesn't help. GPU still hangs on xscreensaver-demo.

tested on 2.6.35 kernel, xf86-video-intel-2.12.0, xorg-server-1.8.2
Comment 7 Chris Wilson 2010-08-06 02:12:24 UTC
Can you also grab the error state of the hang with the patch applied so we can confirm the bug is identical? It'll be typical if fixing the WAIT_FOR_EVENT hang means we just hit mesa submitting an illegal op...
Comment 8 Dongxu Li 2010-08-06 05:33:51 UTC
Created attachment 37632 [details]
i915_error_state without i915-clear-scanline-wait.patch

error state after the GPU hang
Comment 9 Dongxu Li 2010-08-06 05:35:29 UTC
Created attachment 37633 [details]
i915_error_state with i915-clear-scanline-wait.patch

error state after the GPU hang, with i915-clear-scanline-wait.patch
Comment 10 Dongxu Li 2010-08-06 05:37:01 UTC
Created attachment 37634 [details]
intel_reg_dumper without the i915-clear-scanline-wait.patch

intel_reg_dumper after the GPU hang
Comment 11 Dongxu Li 2010-08-06 05:38:15 UTC
Created attachment 37635 [details]
intel_reg_dumper with the i915-clear-scanline-wait.patch

intel_reg_dumper after the GPU hang, with the i915-clear-scanline-wait.patch
Comment 12 Chris Wilson 2010-08-08 04:03:55 UTC
Created attachment 37688 [details] [review]
My variant upon Jesse's idea.

(Note this will only apply on top of my for-anholt series of pending patches.)
Comment 13 Dongxu Li 2010-08-08 13:31:19 UTC
(In reply to comment #12)
> Created an attachment (id=37688) [details]
> My variant upon Jesse's idea.
> 
> (Note this will only apply on top of my for-anholt series of pending patches.)

would you please give me something which applicable to stable kernel 2.6.35? I do not know how to dig up the so-called -anholt series patches.
Comment 14 Dongxu Li 2010-08-08 23:31:12 UTC
Created attachment 37717 [details]
i915_error_state.txt after chris's patch
Comment 15 Dongxu Li 2010-08-08 23:34:05 UTC
Created attachment 37718 [details]
intel_reg_dumper.txt after chris's patch
Comment 16 Chris Wilson 2010-09-06 10:05:28 UTC
Ok, this looks mighty dubious:

0x0903c15c:      0x79000002: 3DSTATE_DRAWING_RECTANGLE
0x0903c160:      0x00000000:    top left: 0,0
0x0903c164:      0x00000000:    bottom right: 0,0
0x0903c168:      0x00000000:    origin: 0,0

And the hang is indicative that the batchbuffer is itself the cause. This hang is sufficiently different from the original WAIT_FOR_EVENT hang, and the 0x0 surface could be a vital clue to the original bug.
Comment 17 Dongxu Li 2010-09-20 06:55:45 UTC
GPU hang happens more often for 2.6.35 kernel. Basically, machine is unusable with 2.6.35.
Comment 18 Chris Wilson 2010-11-13 01:56:20 UTC
commit 85345517fe6d4de27b0d6ca19fef9d28ac947c4a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Nov 13 09:49:11 2010 +0000

    drm/i915: Retire any pending operations on the old scanout when switching
    
    An old and oft reported bug, is that of the GPU hanging on a
    MI_WAIT_FOR_EVENT following a mode switch. The cause is that the GPU is
    waiting on a scanline counter on an inactive pipe, and so waits for a
    very long time until eventually the user reboots his machine.
    
    We can prevent this either by moving the WAIT into the kernel and
    thereby incurring considerable cost on every swapbuffers, or by waiting
    for the GPU to retire the last batch that accesses the framebuffer
    before installing a new one. As mode switches are much rarer than swap
    buffers, this looks like an easy choice.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=28964
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29252
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.