Bug 55845

Summary: [ivb hw context regression] GPU hang in compositor(?)
Product: DRI Reporter: Zoltan Kovacs <giszo.k>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: highest CC: ben, chris, daniel, enrico.tagliavini, jbarnes
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output
none
i915_error_state
none
Xorg log
none
Screen corruption
none
Xorg log 1st hang
none
Xorg log 2nd hang
none
i915_error_state 1st hang
none
i915_error_state 2nd hang
none
System log
none
Error state with patch on comment #33 applied none

Description Zoltan Kovacs 2012-10-10 17:54:03 UTC
Created attachment 68411 [details]
dmesg output

The X server freezes randomly while I am using my PC for normal desktop usage (web browsing, movie watching, etc.)

The box is a Lenovo IdeaPad U410 ultrabook with Core i7 Ivy Bridge CPU and I am using the integrated Intel HD 4000 graphics card. The operating system running on it is Gentoo Linux.

I found these lines in dmesg that could be relevant:
[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state

Here are the versions of the packages I am using:
Linux kernel 3.6.1
X.Org X Server 1.13.0 Release Date: 2012-09-05
xf86-video-intel: 2.20.9
mesa: 9.0_pre20120918
Comment 1 Zoltan Kovacs 2012-10-10 17:55:17 UTC
Created attachment 68412 [details]
i915_error_state
Comment 2 Zoltan Kovacs 2012-10-10 17:56:01 UTC
Created attachment 68413 [details]
Xorg log
Comment 3 Chris Wilson 2012-10-10 22:08:16 UTC
Death sometime before the flush following a mesa batch completes. First I would make sure you have all the latest w/a, so pull one of Daniel's famous kernels.

http://cgit.freedesktop.org/~danvet/drm-intel #drm-intel-fixes
Comment 4 Zoltan Kovacs 2012-10-11 16:44:21 UTC
Installed a 3.6.1 kernel with the patches found in the drm-intel-fixes branch you mentioned.

After a few days of testing hopefully I will have good news. Yesterday I was unable to use the PC for an hour without having to restart the X server. :(
Comment 5 Zoltan Kovacs 2012-10-11 17:40:18 UTC
The problem still exists even with the patched kernel.
Comment 6 Chris Wilson 2012-10-18 12:35:28 UTC
The next step is trying the latest mesa-8.0 and mesa-9.0 releases.
Comment 7 Zoltan Kovacs 2012-10-18 12:41:50 UTC
Already tried both of them and unfortunately I had the same problem.

I saw that a new version (2.20.10) of the xf86-video-intel package is out, I will try it too.
Comment 8 Chris Wilson 2012-10-18 13:27:07 UTC
I'd pick the ddx from git to avoid a bug on IvyBridge with ComponentAlpha glyphs (if you use subpixel antialiasing on fonts).

Also, can you attach a few more error-states in case one of those captures a more obvious cause?
Comment 9 Zoltan Kovacs 2012-10-18 13:31:22 UTC
What is ddx? Could you put the URL of the GIT repository here?

Oh, I have one more information that is probably important. In the last few days I disabled all the desktop effects in KDE and now the system seems to be stable.
Comment 10 Chris Wilson 2012-10-18 13:37:34 UTC
The Display Dependent X (ddx) is xf86-video-intel (http://cgit.freedesktop.org/xorg/driver/xf86-video-intel). There is no urgent need to upgrade that if kwin without desktop effects is stable.

That last statement points towards an interaction between mesa and the kernel as being the cause, either missing a workaround or just plain broken.
Comment 11 Zoltan Kovacs 2012-10-18 13:42:07 UTC
What else could I do to help you? Collect more error-states?
Comment 12 Chris Wilson 2012-10-18 13:46:04 UTC
Yes, more error states would definitely help. Narrowing down if there is any particular effect that seems to trigger a hang would be a big help as well.
Comment 13 Zoltan Kovacs 2012-10-18 13:55:15 UTC
Okay, I will try to get some more.
Comment 14 Enrico Tagliavini 2012-10-29 20:15:04 UTC
Hi there. I have this problem too. I have an HD 4000 (gen6) from an i7 3612QM. Happens in KDE 4.8.5 with both kernel 3.6.4 and 3.7-rc3. libdrm 2.4.39, xorg-server 1.12.2, mesa 9.0, video-intel 2.20.12 on gentoo linux. UXA seems more problematic then SNA, so I might end up using UXA to reproduce this issue more and add my error-states, if this is ok for Chris.

But for me mesa 8.0.4 works like a charm, I have this problem only in mesa 9.0.

I have to add I have random corruption on the top of the screen coming and going when moving the mouse around. It should not trigger desktop effects this way. It might be firefox given it is almost always the foreground application and it is using the whole screen. I will do more testing with time adding some error state if I can find some time.
Comment 15 Zoltan Kovacs 2012-10-30 07:30:47 UTC
Mesa 8 really fixes the problem for you?

I also tried mesa 8 when Chris suggested but the system also freezed with it once. Unfortunately I was not able to get dmesg output that time so I couldn't be sure it was the same problem. However after I thought the problem is the same I went back to mesa 9.

Probably I will give another go to mesa 8 with the hope it fixes my problem too.
Comment 16 Enrico Tagliavini 2012-10-30 11:17:45 UTC
(In reply to comment #15)
> Mesa 8 really fixes the problem for you?

mesa 8.0.4 just works for me. No corruptions, no hungs. mesa 9.0 regressed.

I discovered something important: I always get a corruption on the top of the screen when a tooltip pops up. *Always*. Even if the tooltip is on the bottom of the screen (keep the mouse over an application of the taskbar for example). This should be quite easy to reproduce I think. The co rruption sometimes goes away almost immediatly, sometimes I have to trigger a refresh of the interested section.

About the error state: I've spent all the morning trying to crash the driver. I failed! It is quite hard to hung the GPU. I keep trying.

@Chris: Is there something I can do to help debug the corruption?
Comment 17 Chris Wilson 2012-10-30 11:27:05 UTC
(In reply to comment #16)
> @Chris: Is there something I can do to help debug the corruption?

Can you capture the corrutption with a photo? I see some rendering corruption in the top-left corner on various games (a few blocks of white typically) and just wondering if this is the same.
Comment 18 Enrico Tagliavini 2012-10-30 13:07:08 UTC
Created attachment 69306 [details]
Screen corruption

Here it is :). I reproduced the corruption with mesa git master as of 2 hours ago. Usually the corruption goes away in a fraction of seconds, but not always. In this case it disappeared when I triggered a refresh of the top of the screen. Rarely also the mouse pointer can get corrupted, as the window decoration.
Comment 19 Chris Wilson 2012-10-30 13:16:29 UTC
That matches what I've seen as well with recent mesa. Thanks. And from the photo we can clearly see that it is a tiling corruption - some render surface has the wrong attributes.
Comment 20 Stefano Avallone 2012-10-30 13:47:04 UTC
Created attachment 69307 [details]
Xorg log 1st hang
Comment 21 Stefano Avallone 2012-10-30 13:47:49 UTC
Hello, I just "managed" to get the GPU hang twice in one hour...

My system:
Ivybridge desktop (core i7-3770 with HD 4000)
kernel 3.6.3 (64 bit)
xserver 1.13.0 with SNA enabled
mesa 9.0
xf86-video-intel 2.20.12
libdrm 2.4.39

I am attaching:

- the X server logs
- the content of i915_error_state after each hang
- the system log (I use systemd) of the last week (you can see that other hangs occurred in the past, but I had not enabled the drm debug). The last two hangs, which are related to the attached logs, occurred today at 13:13 and 14:20. Only kernel messages are shown in this log.

If you need more information or you want me to do some testing, just ask.
Comment 22 Stefano Avallone 2012-10-30 13:48:26 UTC
Created attachment 69309 [details]
Xorg log 2nd hang
Comment 23 Stefano Avallone 2012-10-30 13:49:12 UTC
Created attachment 69310 [details]
i915_error_state 1st hang
Comment 24 Stefano Avallone 2012-10-30 13:49:43 UTC
Created attachment 69311 [details]
i915_error_state 2nd hang
Comment 25 Stefano Avallone 2012-10-30 13:50:07 UTC
Created attachment 69312 [details]
System log
Comment 26 Enrico Tagliavini 2012-10-30 13:58:41 UTC
(In reply to comment #19)
> That matches what I've seen as well with recent mesa. Thanks. And from the
> photo we can clearly see that it is a tiling corruption - some render
> surface has the wrong attributes.

If I can do something else just let me know. I have no problem compiling mesa master or official branches. I can also manage to apply custom patch or whatever if needed. I'm on gentoo linux so it is not so hard.

Thank you
Comment 27 Chris Wilson 2012-10-30 14:12:48 UTC
Hmm, all the error states are most peculiar, dying immediately after MI_SET_CONTEXT with garbage in the IPEHR. Can someone try:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 53ba395..f8eab98 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -252,7 +252,7 @@ void i915_gem_context_init(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	uint32_t ctx_size;
 
-	if (!HAS_HW_CONTEXTS(dev))
+	if (!HAS_HW_CONTEXTS(dev) || 1)
 		dev_priv->hw_contexts_disabled = true;
 
 	if (dev_priv->hw_contexts_disabled)
Comment 28 Zoltan Kovacs 2012-10-30 18:08:01 UTC
I am currently compiling a 3.6.4 kernel with the patch Chris provided, however the source in the latest stable kernel looks a bit different:

        if (!HAS_HW_CONTEXTS(dev)) {
                dev_priv->hw_contexts_disabled = true;
                return;
        }

I have no idea whether this is good or bad, it will turn out soon. :)
Comment 29 Ben Widawsky 2012-10-30 18:14:11 UTC
(In reply to comment #28)
> I am currently compiling a 3.6.4 kernel with the patch Chris provided,
> however the source in the latest stable kernel looks a bit different:
> 
>         if (!HAS_HW_CONTEXTS(dev)) {
>                 dev_priv->hw_contexts_disabled = true;
>                 return;
>         }
> 
> I have no idea whether this is good or bad, it will turn out soon. :)

You could also do:
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7274360..7bed78f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1148,7 +1148,7 @@ struct drm_i915_file_private {
 #define HAS_LLC(dev)            (INTEL_INFO(dev)->has_llc)
 #define I915_NEED_GFX_HWS(dev) (INTEL_INFO(dev)->need_gfx_hws)
 
-#define HAS_HW_CONTEXTS(dev)   (INTEL_INFO(dev)->gen >= 6)
+#define HAS_HW_CONTEXTS(dev)   (0)
Comment 30 Zoltan Kovacs 2012-10-30 18:16:20 UTC
I just added "|| 1" to the if to make it true in all cases.

I was worried that my version is a bit different compared to the one Chris posted in the diff.
Comment 31 Zoltan Kovacs 2012-10-30 19:07:05 UTC
Since the new kernel was booted with the patch, the system seems to be stable.

The screen corruption still exists that was also reported by others but the GPU hang issue is disappeared with the KDE effects enabled again.

I don't know whether it makes any difference but with the new kernel I also upgraded to xf86-video-intel-2.20.12.
Comment 32 Daniel Vetter 2012-10-30 22:02:53 UTC
Ok, I think we need to split this bug up into two parts:
- tracking the gpu hang, which seems to be due to hw contexts. I think we can leave this issue in this bug report here (title adjusted)

- the corruption, which is a regression from mesa 8 to mesa 9/master. It might be that this is simply the lack of a hw workaround, can you please first try the latest drm-intel-nightly branch from the drm-intel kernel git repo at

http://cgit.freedesktop.org/~danvet/drm-intel

If that does not help, please file a new bug report against mesa/i965, mentioning that this is a regression.
Comment 33 Chris Wilson 2012-10-31 08:43:59 UTC
Ok, I have no idea what actual w/a the ARB_ON_OFF are for, they are documented as only being used with runlists and NOOP otherwise, so lets just simplify things. Can you please test with (and remember to reenable hw contexts):


diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 53ba395..198c3d5 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -345,15 +345,10 @@ mi_set_context(struct intel_ring_buffer *ring,
 			return ret;
 	}
 
-	ret = intel_ring_begin(ring, 6);
+	ret = intel_ring_begin(ring, 4);
 	if (ret)
 		return ret;
 
-	if (IS_GEN7(ring->dev))
-		intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_DISABLE);
-	else
-		intel_ring_emit(ring, MI_NOOP);
-
 	intel_ring_emit(ring, MI_NOOP);
 	intel_ring_emit(ring, MI_SET_CONTEXT);
 	intel_ring_emit(ring, new_context->obj->gtt_offset |
@@ -364,11 +359,6 @@ mi_set_context(struct intel_ring_buffer *ring,
 	/* w/a: MI_SET_CONTEXT must always be followed by MI_NOOP */
 	intel_ring_emit(ring, MI_NOOP);
 
-	if (IS_GEN7(ring->dev))
-		intel_ring_emit(ring, MI_ARB_ON_OFF | MI_ARB_ENABLE);
-	else
-		intel_ring_emit(ring, MI_NOOP);
-
 	intel_ring_advance(ring);
 
 	return ret;
Comment 34 Enrico Tagliavini 2012-10-31 11:10:37 UTC
(In reply to comment #32)
> - the corruption, which is a regression from mesa 8 to mesa 9/master. It
> might be that this is simply the lack of a hw workaround, can you please
> first try the latest drm-intel-nightly branch from the drm-intel kernel git
> repo at
> 
> http://cgit.freedesktop.org/~danvet/drm-intel
> 
> If that does not help, please file a new bug report against mesa/i965,
> mentioning that this is a regression.

Tested, the corruption is still there, I opened bug #56610
Comment 35 Enrico Tagliavini 2012-10-31 17:01:14 UTC
Created attachment 69357 [details]
Error state with patch on comment #33 applied

I applied the patch from comment #33 to linux 3.7-rc3. I had 2 GPU hangs in a row. This is the error state
Comment 36 Chris Wilson 2012-10-31 17:06:30 UTC
Ok, no change there then. That implies that is the MI_SET_CONTEXT context alone that causes it to execute -1, or at least die with IPEHR==-1.
Comment 37 Stefano Avallone 2012-10-31 17:06:52 UTC
(In reply to comment #34)
> (In reply to comment #32)
> > - the corruption, which is a regression from mesa 8 to mesa 9/master. It
> > might be that this is simply the lack of a hw workaround, can you please
> > first try the latest drm-intel-nightly branch from the drm-intel kernel git
> > repo at
> > 
> > http://cgit.freedesktop.org/~danvet/drm-intel
> > 
> > If that does not help, please file a new bug report against mesa/i965,
> > mentioning that this is a regression.
> 
> Tested, the corruption is still there, I opened bug #56610

(In reply to comment #34)
> (In reply to comment #32)
> > - the corruption, which is a regression from mesa 8 to mesa 9/master. It
> > might be that this is simply the lack of a hw workaround, can you please
> > first try the latest drm-intel-nightly branch from the drm-intel kernel git
> > repo at
> > 
> > http://cgit.freedesktop.org/~danvet/drm-intel
> > 
> > If that does not help, please file a new bug report against mesa/i965,
> > mentioning that this is a regression.
> 
> Tested, the corruption is still there, I opened bug #56610

Just to inform all the people in cc: here: it seems this is a bug in kwin and it has been fixed (by Kenneth Graunke). Please see my comment in the report of bug #56610
Comment 38 Chris Wilson 2012-10-31 17:39:48 UTC
Stefano, could you try this idea from Ben to rule out mesa fouling up the context:

diff --git a/src/mesa/drivers/dri/i965/brw_vtbl.c b/src/mesa/drivers/dri/i965/brw_vtbl.c
index ca2e7a9..62d609b 100644
--- a/src/mesa/drivers/dri/i965/brw_vtbl.c
+++ b/src/mesa/drivers/dri/i965/brw_vtbl.c
@@ -178,7 +178,7 @@ static void brw_new_batch( struct intel_context *intel )
     * would otherwise be stored in the context (which for all intents and
     * purposes means everything).
     */
-   if (intel->hw_ctx == NULL)
+   if (intel->hw_ctx == NULL || 1)
       brw->state.dirty.brw |= BRW_NEW_CONTEXT;
 
    brw->state.dirty.brw |= BRW_NEW_BATCH;
Comment 39 Stefano Avallone 2012-10-31 18:46:43 UTC
(In reply to comment #38)
> Stefano, could you try this idea from Ben to rule out mesa fouling up the
> context:
 
Sure, I will try this patch and let you know.
Comment 40 Enrico Tagliavini 2012-11-06 17:48:59 UTC
It 5 days I run kwin patched with the fix found in bug #56610 . The screen corruption is definetly gone, and also the GPU hang is gone! I'm not able to hang it anymore. The kernel is fresh compiled (I updated to 3.6.5) without Chris' patches. Mesa is also without patches.
Comment 41 Zoltan Kovacs 2012-11-09 20:21:30 UTC
KDE has been upgraded to 4.9.3 on my machine recently and it seems to be a lot better compared to the previous version. Animations of the effects now seem to be as smooth as they should be. With 4.9.2 I had the same feeling when you play a 3D game with low FPS.

The KDE problem pointed out in bug #56610 is included in this release that could be the trick.
Comment 42 Chris Wilson 2012-11-21 18:04:15 UTC
Stefano, both Enrico and Zoltan report that their systems are stable with an updated kwin to avoid the incorrect msaa rendering, can you confirm? In which case we can close this a side-effect of bug #56610.
Comment 43 Stefano Avallone 2012-11-22 07:41:53 UTC
Yes, no more CPU hangs since I updated kwin. Running both with the mesa patch suggested in comment #38 (for about 3 weeks) and without it (last week I updated to a clean mesa 9.0.1). Thanks for taking care of this issue.
Comment 44 Chris Wilson 2012-11-22 08:18:01 UTC
Thanks everyone.

*** This bug has been marked as a duplicate of bug 56610 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.