27108 – Ironlake interrupt wedging

Bug 27108 - Ironlake interrupt wedging

Summary: Ironlake interrupt wedging

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	7.5 (2009.10)
Hardware:	Other All

Importance:	high normal
Assignee:	Eric Anholt
QA Contact:	Xorg Project Team

URL:	https://bugzilla.novell.com/show_bug....
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-03-16 07:38 UTC by Matthias Hopf
Modified:	2010-05-16 15:13 UTC (History)
CC List:	9 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Xorg.0.log (21.80 KB, text/plain) 2010-03-16 07:38 UTC, Matthias Hopf	no flags	Details
intel_gpu_dump.log.gz (135.77 KB, application/octet-stream) 2010-03-16 07:47 UTC, Matthias Hopf	no flags	Details
View All

Description Matthias Hopf 2010-03-16 07:38:57 UTC

Created attachment 34108 [details]
Xorg.0.log

This is no joke.

On some machines the Xserver freezes inside the intel driver in a ioctl (all output ceases) until the mouse is moved. I assume moving the mouse triggers some interrupt that unfortunately was lost in the drm module, which in turn frees up drm resources that block the ioctl.

Triggering the bug is extremely difficult, and seems to depend on weird circumstances (I considered air pressure or state of the moon for a while). We had one machine shipped to us that exposed the bug, when it arrived (it even woke up from suspend, so it was in exactly the same state) we weren't able to reproduce even on this machine. But colleagues we trust have seen the bug with their very eyes.

By remote debugging I was able to get quite some information from the machine in frozen state - each time the issue occurred it froze with equivalent stack frames:


The machine is running a 2.6.32.5 32bit linux kernel. Xserver version is 1.6.5, intel driver version is 2.10.0. libdrm is 2.4.17 with the most important fixes in 2.4.18 included.


Xserver/driver backtrace during freeze (local vars and code in driver only, context and/or exact source is available if needed):

 #0  0xffffe424 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb7274909 in ioctl () from /lib/libc.so.6
No symbol table info available.
#2  0xb713993f in drm_intel_gem_bo_start_gtt_access (bo=0x96cb9d0, write_enable=0) at intel_bufmgr_gem.c:1145
        bufmgr_gem = (drm_intel_bufmgr_gem *) 0x8234aa0
        set_domain = {handle = 598, read_domains = 64, write_domain = 0}
        ret = <value optimized out>
    1145                    ret = ioctl(bufmgr_gem->fd,
#3  0xb71399f5 in drm_intel_gem_bo_wait_rendering (bo=0x96cb9d0) at intel_bufmgr_gem.c:1123
	No locals.
    1123            drm_intel_gem_bo_start_gtt_access(bo, 0);
#4  0xb71366f2 in drm_intel_bo_wait_rendering (bo=0x96cb9d0) at intel_bufmgr.c:133
	No locals.
    133             bo->bufmgr->bo_wait_rendering(bo);
#5  0xb708c445 in i830_uxa_block_handler (screen=0x8234d20) at i830_uxa.c:789
        intel = (intel_screen_private *) 0x82325c0
    789                     dri_bo_wait_rendering(intel->front_buffer->bo);
#6  0xb707a92c in I830BlockHandler (i=0, blockData=0x0, pTimeout=0xbff4a478, pReadmask=0x821c3a0) at i830_driver.c:1004
        screen = (ScreenPtr) 0x8234d20
        scrn = (ScrnInfoPtr) 0x8231dd8
        intel = (intel_screen_private *) 0x82325c0
#7  0x0817dc1b in AnimCurScreenBlockHandler (screenNum=0, blockData=0x0, pTimeout=0xbff4a478, pReadmask=0x821c3a0) at animcur.c:222
#8  0x081465a8 in compBlockHandler (i=0, blockData=0x0, pTimeout=0xbff4a478, pReadmask=0x821c3a0) at compinit.c:158
#9  0x08091168 in BlockHandler (pTimeout=0xbff4a478, pReadmask=0x821c3a0) at dixutils.c:384
#10 0x08132e4d in WaitForSomething (pClientsReady=0x9994640) at WaitFor.c:216
#11 0x0808d156 in Dispatch () at dispatch.c:386
#12 0x08071fbd in main (argc=9, argv=0xbff4a5c4, envp=Cannot access memory at address 0x400c6467) at main.c:397


Kernel thread backtrace:

[  804.351460] X             S addc8eee     0  2995   2994 0x00400000
[  804.351465]  f3625e04 00003082 00000004 addc8eee 000000b8 f875e044 00003202 f8789b2a
[  804.351474]  c084cb60 c084cb60 f4eb0d30 f4eb0fe4 c1207b60 00000000 c1207b60 f875e044
[  804.351481]  00000001 f4eb0fe4 f4eb0d30 f3625e30 f41d3000 00028d46 f875d290 f41d3e48
[  804.351489] Call Trace:
[  804.351504]  [<f875d290>] i915_do_wait_request+0x2d0/0x350 [i915]
[  804.351531]  [<f875e174>] i915_gem_object_set_to_gtt_domain+0x34/0xc0 [i915]
[  804.351558]  [<f875e4c4>] i915_gem_set_domain_ioctl+0xc4/0x150 [i915]
[  804.351585]  [<f823e68c>] drm_ioctl+0x15c/0x340 [drm]
[  804.351596]  [<c030e1e8>] vfs_ioctl+0x78/0x90
[  804.351602]  [<c030e663>] do_vfs_ioctl+0x373/0x3f0
[  804.351608]  [<c030e78a>] sys_ioctl+0xaa/0xb0
[  804.351613]  [<c02030a4>] sysenter_do_call+0x12/0x22
[  804.351622]  [<ffffe424>] 0xffffe424


I will attach Xorg.0.log (which contains nothing obvious) and intel_gpu_dump output.

Comment 1 Matthias Hopf 2010-03-16 07:47:11 UTC

Created attachment 34109 [details]
intel_gpu_dump.log.gz

Batchbuffer contains the following entries:

0x04123000:      0x54300804: XY_COLOR_BLT (rgb enabled, alpha enabled, dst tile 1)
0x04123004:      0x03f00580:    format 8888, pitch 1408, clipping disabled
0x04123008:      0x025b0147:    (327,603)
0x0412300c:      0x026b014c:    (332,619)
0x04123010:      0x04436000:    offset 0x04436000
0x04123014:      0x00ffffff:    color
0x04123018:      0x02000000: MI_FLUSH
0x0412301c:      0x05000000: MI_BATCH_BUFFER_END


Ringbuffer seems to have HEAD == TAIL, at least I cannot find TAIL.
It's completely filled up with patterns similar (but not 100% identical) to

0x000025e8: HEAD 0x02000000: MI_FLUSH
0x000025ec:      0x00000000: MI_NOOP
0x000025f0:      0x10800001: MI_STORE_DATA_INDEX
0x000025f4:      0x00000080:    dword 1
0x000025f8:      0x00027a18:    dword 2
0x000025fc:      0x01000000: MI_USER_INTERRUPT
0x00002600:      0x18800180: MI_BATCH_BUFFER_START
0x00002604:      0x04094000:    dword 1
[... continues with MI_FLUSH]

Comment 2 Matthias Hopf 2010-03-16 07:49:09 UTC

For internal records: this bug is associated with Novell bug
https://bugzilla.novell.com/show_bug.cgi?id=567723

Comment 3 Matthias Hopf 2010-03-17 11:44:16 UTC

One additional thought: mouse moves create SIGIO, right? In that case the ioctl() would return with EINTR, so this pretty much explains why moving the mouse has an effect.

Comment 4 Eric Anholt 2010-03-17 15:07:38 UTC

Kernel configured without MSI?

Comment 5 Stefan Dirsch 2010-03-18 01:05:58 UTC

I see a similar, maybe identical, at least I believe related issue when starting
mutter on top of a plain Xserver. After some mouse movements the screen no longer gets repainted including mouse pointer. How to reproduce:

X & 
xlock -update 1 & 
mutter
<move the mouse around until screen no longer gets repainted>

Stack trace for mutter process:

cat /proc/955/stack 
[<c0433924>] i915_wait_request+0x13a/0x1ba
[<c0433a30>] i915_gem_object_wait_rendering+0x28/0x2a
[<c0433a5e>] i915_gem_object_set_to_gtt_domain+0x2c/0x6f
[<c0433f8c>] i915_gem_set_domain_ioctl+0x94/0x108
[<c04203c4>] drm_ioctl+0x206/0x286
[<c02c5bfe>] vfs_ioctl+0x50/0x69
[<c02c5feb>] do_vfs_ioctl+0x326/0x34f
[<c02c6054>] sys_ioctl+0x40/0x5a
[<c0202bc9>] syscall_call+0x7/0xb
[<ffffffff>] 0xffffffff

This is on Ironlake (8086:0046) with 

- xf86-video-intel 2.8.1, 
- Mesa 7.7
- libdrm 2.4.17 (with commit 4f0f871)
- xorg-server 1.6.3.901.
- Kernel 2.6.31.12

After attaching with gdb to mutter process and doing a 'continue' in gdb repainting works fine again. If it hangs again, pressing Ctrl-C followed by 'continue' in gdb fixes the issue reliably. 

It seems the issue only occurs when you move the mouse cursor to the top of the screen, where some mutter menu pops up.

> Kernel configured without MSI?

Not sure what you mean. One of these?

# zcat /proc/config.gz |grep -i msi
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_MSI_LAPTOP=m
CONFIG_MSI_WMI=m

Comment 6 Stefan Dirsch 2010-03-18 01:51:22 UTC

Looks like my issue is fixed in SLE11-SP1-RC1 kernel 2.6.32.9-0.5, whereas it's still broken in SLE11-SP1-Beta5 kernel 2.6.32.8-0.3. AFAICS we didn't add any additional patches between these kernel packages. Thus it appears to be fixed upstream between 2.6.32.8 and 2.6.32.9.

Comment 7 Matthias Hopf 2010-03-18 03:42:19 UTC

My issue is NOT fixed with 2.6.33.

Comment 8 Stefan Dirsch 2010-03-18 04:23:16 UTC

Just for the record. My issue *is* fixed with the same 2.6.33 kernel package (2.6.33-5-pae).

Comment 9 Eric Anholt 2010-03-22 15:16:30 UTC

Are all the machines in question Ironlake?  The interrupt handling on those is different, and there may still be bugs.

Comment 10 Stefan Dirsch 2010-03-22 20:57:59 UTC

Yes, this issue has only be seen on Ironlake so far.

Comment 11 Eric Anholt 2010-03-26 12:36:33 UTC

OK.  Zhenyu has done most of the work on the Ironlake interrupt handler, but I think he's busy with other modesetting stuff right now.  I'm running Ironlake hardware daily for my GL development now, and haven't run into this, though.

Comment 12 Matthias Hopf 2010-03-30 02:32:14 UTC

We haven't been able to reproduce here as well, but colleagues *did* see it with their own eyes on a partner's site. I used scripts for debugging, and can get remote access if there's anything to try out.

The effect seems to be really rare (machine wise), but on machines where it is reproducible it seems that you can trigger it easily. But it also seems to depend on the air pressure or whatever, after shipping a laptop (it even arrived suspended, and did resume successfully) the effect wasn't reproducible. Sigh.

Comment 13 Wang Zhenyu 2010-04-08 19:04:41 UTC

Matthias, I have a recent irq patch for ILK at http://lists.freedesktop.org/archives/intel-gfx/2010-April/006444.html, which hopefully make first level irq enable/disable more reliable on ILK. Our media team seems require that patch, so please help to test. If ok, we can push this for stable kernels.

Comment 14 Matthias Hopf 2010-04-21 09:11:38 UTC

(In reply to comment #13)
> Matthias, I have a recent irq patch for ILK at
> http://lists.freedesktop.org/archives/intel-gfx/2010-April/006444.html, which

Thanks, we had that patch tried, with no effect.

For what it's worth, 2.6.34rc3 seems to be much more stable, but the issue still pops up from time to time.

Comment 15 Eric Anholt 2010-04-23 11:17:37 UTC

commit e552eb7038a36d9b18860f525aa02875e313fe16
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Wed Apr 21 11:39:23 2010 -0700

    drm/i915: use PIPE_CONTROL instruction on Ironlake and Sandy Bridge
    
    Since 965, the hardware has supported the PIPE_CONTROL command, which
    provides fine grained GPU cache flushing control.  On recent chipsets,
    this instruction is required for reliable interrupt and sequence number
    reporting in the driver.
    
    So add support for this instruction, including workarounds, on Ironlake
    and Sandy Bridge hardware.
    
    https://bugs.freedesktop.org/show_bug.cgi?id=27108
    
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Tested-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Eric Anholt <eric@anholt.net>

commit 1918ad77f7f908ed67cf37c505c6ad4ac52f1ecf
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Fri Apr 23 09:32:23 2010 -0700

    drm/i915: fix non-Ironlake 965 class crashes
    
    My PIPE_CONTROL fix (just sent via Eric's tree) was buggy; I was
    testing a whole set of patches together and missed a conversion to the
    new HAS_PIPE_CONTROL macro, which will cause breakage on non-Ironlake
    965 class chips.  Fortunately, the fix is trivial and has been tested.
    
    Be sure to use the HAS_PIPE_CONTROL macro in i915_get_gem_seqno, or
    we'll end up reading the wrong graphics memory, likely causing hangs,
    crashes, or worse.
    
    Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
    Reported-by: Toralf Förster <toralf.foerster@gmx.de>
    Tested-by: Toralf Förster <toralf.foerster@gmx.de>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Comment 16 Matthias Hopf 2010-04-27 03:19:41 UTC

It seems to turn out that the issue is related to a BIOS change. In any case, the latest patches are currently checked whether they fix the issue as well, and whether they have any side effects.

Thanks so far.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.