63921 – [snb] GTT mapping fails after GPU hang

Bug 63921 - [snb] GTT mapping fails after GPU hang

Summary: [snb] GTT mapping fails after GPU hang

Status:	RESOLVED WONTFIX

Alias:	None

Product:	libva
Classification:	Unclassified
Component:	intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	ykzhao
QA Contact:	Sean V Kelley

URL:
Whiteboard:
Keywords:

Duplicates (1):	63946 (view as bug list)
Depends on:
Blocks:

Reported:	2013-04-25 14:18 UTC by Nicolas Hillegeer
Modified:	2016-12-07 02:47 UTC (History)
CC List:	3 users (show)

See Also:	61668 63946
i915 platform:
i915 features:

Attachments
X.org log for SNA crash (29.83 KB, text/plain) 2013-04-25 14:18 UTC, Nicolas Hillegeer	Details
cat of /proc/meminfo, lots of vmalloc (1.14 KB, text/plain) 2013-04-25 16:04 UTC, Nicolas Hillegeer	Details
Second time SNA crashed, once again took about an hour (28.10 KB, text/plain) 2013-04-25 16:05 UTC, Nicolas Hillegeer	Details
Second crash, dmesg output (47.97 KB, text/plain) 2013-04-25 16:05 UTC, Nicolas Hillegeer	Details
X.org crash with vaapi playback and UXA: [DRI2] Dri2SwapComplete: bad drawable (42.29 KB, text/plain) 2013-04-26 06:38 UTC, Nicolas Hillegeer	Details
dmesg log with UXA crash, never seen this type before, appears to be something with chromium as well (52.04 KB, text/plain) 2013-04-26 06:39 UTC, Nicolas Hillegeer	Details
MPlayer error when crashing with UXA: dri2GetRenderingBuffer: assertion 'buffers' failed (872 bytes, text/plain) 2013-04-26 06:41 UTC, Nicolas Hillegeer	Details
Suppress spurious EIO when moving away from the gpu (1.58 KB, patch) 2013-04-26 07:52 UTC, Chris Wilson	Details \| Splinter Review
Suppress spurious EIO when moving away from the gpu (1.88 KB, patch) 2013-04-26 09:39 UTC, Chris Wilson	Details \| Splinter Review
Xorg.0.log after first patch Chris (27.64 KB, text/plain) 2013-04-26 10:36 UTC, Nicolas Hillegeer	Details
dmesg after first patch Chris (SNA) (981 bytes, text/plain) 2013-04-26 10:36 UTC, Nicolas Hillegeer	Details
Potentially make reset work (1.24 KB, patch) 2013-04-26 23:33 UTC, Ben Widawsky	Details \| Splinter Review
dmesg after Ben's #2 patch (SNA), X.org crashed (1.65 KB, text/plain) 2013-04-27 14:31 UTC, Nicolas Hillegeer	Details
Xorg.0.log after patch #2 by Ben (28.07 KB, text/plain) 2013-04-27 14:32 UTC, Nicolas Hillegeer	Details
Patch to get error state out on low/fragmented memory situations (5.15 KB, patch) 2013-05-16 13:53 UTC, Mika Kuoppala	Details \| Splinter Review
for v3.9.2 (19.30 KB, patch) 2013-05-24 10:49 UTC, Mika Kuoppala	Details \| Splinter Review
i915_error_state2 after 32 concurrent video playback soft crash (could still access dmesg and debugfs) (1.01 MB, text/plain) 2013-05-24 12:27 UTC, Nicolas Hillegeer	Details
i915_error_state2 after 32 concurrent video playback soft crash #2 (1.25 MB, text/plain) 2013-05-24 15:19 UTC, Nicolas Hillegeer	Details
Show Obsolete (2) View All

Description Nicolas Hillegeer 2013-04-25 14:18:36 UTC

Created attachment 78473 [details]
X.org log for SNA crash

Even though the error mentioned above is SNA-only, a very similar one occurs with UXA. Unfortunately I trashed my old Xorg log by rebooting twice but if I test again with UXA I'll paste it too. I inspected the errno (5) and that stands for input/output error, which is the same thing that was mentioned by the UXA log, I recall.

I've been trying to play videos with vaapi-mplayer because I want to use it as a media station that can run for at least a day. Usually the quality is top notch and the CPU usage is very low, which motivates me more to try to get this to work stably.

Unfortunately, due to the multitude of the packages and the fact that the crashes happen seemingly at random (sometimes it takes an hour, sometimes 2 days), I can't test all permutations. I think I could use some help in diagnosing it better. I started upgrading packages because I thought it would alleviate the issue. I started out with all stock debian wheezy packages and ended up with the current install with SNA after lots of compiling.

In my limited testing it appears that SNA crashes much sooner than UXA. The last SNA run took about an hour to crash.

I've seen many crashes over different kernel versions and whatnot, and the general idea that I get is that the card somehow runs out of memory or some such, which is possible I guess since the movies are large and maybe something in the chain does not release the VA surfaces...

If this is true it would seem that both UXA and SNA are leaking, but SNA at a faster rate. Just some speculation here but maybe there could be (switchable) support for VA surface expiration somewhere...

Should I perhaps cross-post to the libva bug-tracker as well?

Graphics card: Intel HD3000 (Sandy Bridge)

Distribution: debian wheezy
Kernel: 3.9-rc8 (latest as of this writing)
libdrm: 2.4.43 (compiled from debian git)
mesa: 8.0.5-4 (stock)
intel-vaapi-driver: 1.0.21.pre1 (I tried stock too, ofcourse)
libva: 1.1.2.pre1 (VA-API version 0.33.0)
intel-xorg-driver: 2.21.6

Anything else I could try?

Btw, I always check i915_error_state and I've never seen anything but this:

# cat /sys/kernel/debug/dri/0/i915_error_state 
cat: /sys/kernel/debug/dri/0/i915_error_state: Cannot allocate memory

Comment 1 Chris Wilson 2013-04-25 15:21:12 UTC

If you can grab the i915_error_state (kill X, kill everything, and try again) file a bug against libva for causing the GPU hang. However, I blame Daniel for never believing me when I sent patches to prevent this... :-p

Comment 2 Nicolas Hillegeer 2013-04-25 15:27:01 UTC

(In reply to comment #1)
> If you can grab the i915_error_state (kill X, kill everything, and try
> again) file a bug against libva for causing the GPU hang. However, I blame
> Daniel for never believing me when I sent patches to prevent this... :-p

At the bottom of my original bug report I had already mentioned that I never got anything substantial out of the i915_error_log. I'll try again on the next crash, but last time I went so far as to kill apache, mongodb, fluentd, sshd, rpcbind, ... basically anything I could get my hands on (trust me the process list was really small after that, memory usage below 100 MB according to htop), and still it told me it had insufficient memory. Is there anything special I could try? Btw X kills itself after that error, no need to kill it twice.

The unit only reports having about 1.8GB of memory (should be 2GB but hey...). Does the unit just have too little memory for a good crash report? Is there maybe something I can use to force linux to flush everything? I have to admit I don't really know how gem/dri/"the driver" tries to allocate that buffer.

Should I report to the libva list regardless of getting a good i915_error_state trace?

It's a bit heartening to know that you at least seem to be aware that this could be an issue! Were my suspicions correct about it not releasing GPU memory or is it something else entirely? No need for a big explanation, I guess I'm just curious :)

Comment 3 Nicolas Hillegeer 2013-04-25 16:04:53 UTC

Created attachment 78479 [details]
cat of /proc/meminfo, lots of vmalloc

Comment 4 Nicolas Hillegeer 2013-04-25 16:05:30 UTC

Created attachment 78480 [details]
Second time SNA crashed, once again took about an hour

Comment 5 Nicolas Hillegeer 2013-04-25 16:05:56 UTC

Created attachment 78481 [details]
Second crash, dmesg output

Comment 6 Nicolas Hillegeer 2013-04-25 16:07:50 UTC

(In reply to comment #1)
> If you can grab the i915_error_state (kill X, kill everything, and try
> again) file a bug against libva for causing the GPU hang. However, I blame
> Daniel for never believing me when I sent patches to prevent this... :-p

Ok, so it happened again and this time I tried literally disabling everything I could find. At the end htop was reporting 56 MB out 1804 MB in use. With 25 tasks left.

I used the following to try and make linux flush some stuff:

sync && echo 3 > /proc/sys/vm/drop_caches

After that I piped /proc/meminfo to a file, you can see it in the attachments, I don't know but that vmalloc number seems ridiculously high. Is that normal?

Should I take this to the libva bug tracker?

Comment 7 Chris Wilson 2013-04-25 19:42:29 UTC

More interesting fail:

[ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 4366.361694] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state
[ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip.
[ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for the gpu reset to complete


Pity we can't get the i915_error_state back, but if you can isolate the cause of the hang that will be enough for the libva (or whoever to work on).

Comment 8 Ben Widawsky 2013-04-25 20:58:07 UTC

(In reply to comment #7)
> More interesting fail:
> 
> [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed...
> GPU hung
> [ 4366.361694] [drm] capturing error event; look for more information
> in/sys/kernel/debug/dri/0/i915_error_state
> [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip.
> [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for
> the gpu reset to complete
> 
> 
> Pity we can't get the i915_error_state back, but if you can isolate the
> cause of the hang that will be enough for the libva (or whoever to work on).

Please try this specifically for the reset hang:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8539177..df9dfa5 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev)
                I915_WRITE(GEN7_MSG_CTL, temp);
        }
 
+       DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n");
+       I915_WRITE(ILK_DISPLAY_CHICKEN2,
+                  I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14));
+
        i915_gem_l3_remap(dev);
 
        i915_gem_init_swizzling(dev);

Comment 9 Nicolas Hillegeer 2013-04-25 22:13:29 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > More interesting fail:
> > 
> > [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed...
> > GPU hung
> > [ 4366.361694] [drm] capturing error event; look for more information
> > in/sys/kernel/debug/dri/0/i915_error_state
> > [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip.
> > [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for
> > the gpu reset to complete
> > 
> > 
> > Pity we can't get the i915_error_state back, but if you can isolate the
> > cause of the hang that will be enough for the libva (or whoever to work on).
> 
> Please try this specifically for the reset hang:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index 8539177..df9dfa5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev)
>                 I915_WRITE(GEN7_MSG_CTL, temp);
>         }
>  
> +       DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n");
> +       I915_WRITE(ILK_DISPLAY_CHICKEN2,
> +                  I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14));
> +
>         i915_gem_l3_remap(dev);
>  
>         i915_gem_init_swizzling(dev);

(In reply to comment #8)
> (In reply to comment #7)
> > More interesting fail:
> > 
> > [ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed...
> > GPU hung
> > [ 4366.361694] [drm] capturing error event; look for more information
> > in/sys/kernel/debug/dri/0/i915_error_state
> > [ 4366.871407] [drm:i915_reset] *ERROR* Failed to reset chip.
> > [ 4376.364660] [drm:i915_gem_wait_for_error] *ERROR* Timed out waiting for
> > the gpu reset to complete
> > 
> > 
> > Pity we can't get the i915_error_state back, but if you can isolate the
> > cause of the hang that will be enough for the libva (or whoever to work on).
> 
> Please try this specifically for the reset hang:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index 8539177..df9dfa5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev)
>                 I915_WRITE(GEN7_MSG_CTL, temp);
>         }
>  
> +       DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n");
> +       I915_WRITE(ILK_DISPLAY_CHICKEN2,
> +                  I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14));
> +
>         i915_gem_l3_remap(dev);
>  
>         i915_gem_init_swizzling(dev);

@Chris:

I'm not sure I'll deduce the cause of the hang by just repeating my testing (i.e.: playing a lot of movies, often multiple at a time), for SNA it seems to happen about every hour, with UXA it takes quite a bit longer. I'm trying UXA again and it's been going for 4 hours now. I hope I get some genious insight in the near future :).

@Ben:

I'll see what I can do! That's some crazy magic right there. I'll report tomorrow with the results. Is this codepath supposed to be called with UXA, SNA or both? Because I will start testing with SNA, since that seems to crash earlier and more predictably.

Thanks!
Nicolas

Comment 10 Nicolas Hillegeer 2013-04-25 22:38:05 UTC

> Please try this specifically for the reset hang:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index 8539177..df9dfa5 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -4036,6 +4036,10 @@ i915_gem_init_hw(struct drm_device *dev)
>                 I915_WRITE(GEN7_MSG_CTL, temp);
>         }
>  
> +       DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n");
> +       I915_WRITE(ILK_DISPLAY_CHICKEN2,
> +                  I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14));
> +
>         i915_gem_l3_remap(dev);
>  
>         i915_gem_init_swizzling(dev);

Btw I applied this path manually, since I don't know the right magic incantation for it. It seems that we have slightly different versions of the kernel (mine is 3.9-rc8, here is the complete, adjusted i915_gem_init_hw function I'm currently compiling:

int
i915_gem_init_hw(struct drm_device *dev)
{
	drm_i915_private_t *dev_priv = dev->dev_private;
	int ret;

	if (INTEL_INFO(dev)->gen < 6 && !intel_enable_gtt())
		return -EIO;

	if (IS_HASWELL(dev) && (I915_READ(0x120010) == 1))
		I915_WRITE(0x9008, I915_READ(0x9008) | 0xf0000);

    DRM_ERROR("Forcing no wait on PCH (SNB ONLY)\n");
    I915_WRITE( ILK_DISPLAY_CHICKEN2, 
                I915_READ(ILK_DISPLAY_CHICKEN2) & ~(0x3 << 14));

	i915_gem_l3_remap(dev);

	i915_gem_init_swizzling(dev);

	ret = i915_gem_init_rings(dev);
	if (ret)
		return ret;

	/*
	 * XXX: There was some w/a described somewhere suggesting loading
	 * contexts before PPGTT.
	 */
	i915_gem_context_init(dev);
	i915_gem_init_ppgtt(dev);

	return 0;
}

Is it still ok?

Comment 11 Ben Widawsky 2013-04-25 23:15:32 UTC

(In reply to comment #10)
[snip]
> Is it still ok?
yes

Comment 12 Nicolas Hillegeer 2013-04-26 06:38:54 UTC

Created attachment 78498 [details]
X.org crash with vaapi playback and UXA: [DRI2] Dri2SwapComplete: bad drawable

Comment 13 Nicolas Hillegeer 2013-04-26 06:39:56 UTC

Created attachment 78499 [details]
dmesg log with UXA crash, never seen this type before, appears to be something with chromium as well

Comment 14 Nicolas Hillegeer 2013-04-26 06:41:10 UTC

Created attachment 78500 [details]
MPlayer error when crashing with UXA: dri2GetRenderingBuffer: assertion 'buffers' failed

Comment 15 Nicolas Hillegeer 2013-04-26 06:44:04 UTC

(In reply to comment #11)
> (In reply to comment #10)
> [snip]
> > Is it still ok?
> yes

Ok, perfect. I compiled the kernel again (with the debian package kernel package builder make-kpkg. I did not do make-kpkg clean because I wanted to let the compile be over quicker. When I look at the date of the last modified file I see that i915_gem.o, i915.o, modules.order and i915.ko have been regenerated so I think that's ok). 

Btw, I just tried another run with UXA. Sometimes, with UXA, it does not crash the X server, but it does disable acceleration and makes vaapi-mplayer unable to get another surface. The result can be seen in t he last 3 files I posted.

I had already almost forgotten about the DRI2SwapComplete: Bad Drawable messages, but they nearly always happen "near the end" when UXA is used. Correct me if I'm wrong here but I thought DRI had something to do with Mesa? Is that also involved here? I could try upgrading to the latest (9.1.1), should I do that?

I'm going to cross post on the libva list and see if those guys can chime in as well

Comment 16 Daniel Vetter 2013-04-26 06:47:17 UTC

DRI is just the direct rendering protocol between X server and clients and used by both Mesa for OpenGL and libva for video decoding. Once the gpu is hung (and reset didn't work) the X server tells the client that by refusing to pass on new buffers.

Comment 17 Chris Wilson 2013-04-26 07:35:03 UTC

Except that "SwapBuffers: Bad Drawable" is indicative of a stupid client bug requesting swaps on random windows (i.e. windows it has not created DRI2 surfaces for or subsequently closed). What have libva done this time?

Comment 18 Nicolas Hillegeer 2013-04-26 07:38:38 UTC

(In reply to comment #11)
> (In reply to comment #10)
> [snip]
> > Is it still ok?
> yes

Ben, I just tried your patch with SNA acceleration and something new happened: the system froze, I can still see the videos on the screen but it does nothing anymore. I can't connect to the system with ssh anymore either, it seems totally non-responsive. I'll have to reboot.

@Daniel: ah, nice, thanks for the explanation!

@Chris: hmmm, though it is strange that this can take a very long amount of time (by the time UXA crashes, I think 1000's of vaapi-mplayer instances have been created). Maybe you can get a similar error once a certain resource is exhausted?

Comment 19 Chris Wilson 2013-04-26 07:52:32 UTC

Created attachment 78502 [details] [review]
Suppress spurious EIO when moving away from the gpu

This should keep the kernel functioning in this extreme case.

Comment 20 Nicolas Hillegeer 2013-04-26 07:57:48 UTC

(In reply to comment #19)
> Created attachment 78502 [details] [review] [review]
> Suppress spurious EIO when moving away from the gpu
> 
> This should keep the kernel functioning in this extreme case.

Should I try this with Ben's path or standalone?

Comment 21 Chris Wilson 2013-04-26 08:01:36 UTC

Since Ben's patch seems to have caused the machine to freeze upon GPU reset, I'd highly recommend to drop it.

Comment 22 Nicolas Hillegeer 2013-04-26 08:34:51 UTC

(In reply to comment #21)
> Since Ben's patch seems to have caused the machine to freeze upon GPU reset,
> I'd highly recommend to drop it.

Ok, I'll do that.

It appears we're working with different kernel versions here, as the patch does not apply cleanly. It seems it's mostly line number changes so I'll manually adjust for now. Which kernel version are you developing on?

Comment 23 Chris Wilson 2013-04-26 09:37:10 UTC

I work on drm-intel-next[-queued] which is now post-3.10: http://cgit.freedesktop.org/~danvet/drm-intel

Comment 24 Chris Wilson 2013-04-26 09:39:58 UTC

Created attachment 78511 [details] [review]
Suppress spurious EIO when moving away from the gpu

Against v3.9-rc8, and added one more EIO check.

Comment 25 Nicolas Hillegeer 2013-04-26 09:41:46 UTC

(In reply to comment #23)
> I work on drm-intel-next[-queued] which is now post-3.10:
> http://cgit.freedesktop.org/~danvet/drm-intel

Alright. I guess I'll keep working on 3.9 then (unless you think it's important I change) and just backport  your changes. So that if (when) this is fixed I have a reasonably stable kernel on which I can run my video player :). I'm currently testing with SNA, so hopefully if it crashes it will do so in about 30 minutes.

I already manually backported the changes, compiled and am running. If it crashes, I'll report, add the extra fix, recompile and test again.

Comment 26 Mika Kuoppala 2013-04-26 10:01:58 UTC

(In reply to comment #25)
> (In reply to comment #23)
> > I work on drm-intel-next[-queued] which is now post-3.10:
> > http://cgit.freedesktop.org/~danvet/drm-intel
> 
> Alright. I guess I'll keep working on 3.9 then (unless you think it's
> important I change) and just backport  your changes. So that if (when) this
> is fixed I have a reasonably stable kernel on which I can run my video
> player :). I'm currently testing with SNA, so hopefully if it crashes it
> will do so in about 30 minutes.
> 
> I already manually backported the changes, compiled and am running. If it
> crashes, I'll report, add the extra fix, recompile and test again.

Just to make sure, there are crashes with and without:

'[ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung'

in dmesg?

Please always check the dmesg after crash and report if above line is present and if it is try to get error state (if we would be so lucky).

Comment 27 Nicolas Hillegeer 2013-04-26 10:35:35 UTC

(In reply to comment #26)
> (In reply to comment #25)
> > (In reply to comment #23)
> > > I work on drm-intel-next[-queued] which is now post-3.10:
> > > http://cgit.freedesktop.org/~danvet/drm-intel
> > 
> > Alright. I guess I'll keep working on 3.9 then (unless you think it's
> > important I change) and just backport  your changes. So that if (when) this
> > is fixed I have a reasonably stable kernel on which I can run my video
> > player :). I'm currently testing with SNA, so hopefully if it crashes it
> > will do so in about 30 minutes.
> > 
> > I already manually backported the changes, compiled and am running. If it
> > crashes, I'll report, add the extra fix, recompile and test again.
> 
> Just to make sure, there are crashes with and without:
> 
> '[ 4366.361686] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed...
> GPU hung'
> 
> in dmesg?
> 
> Please always check the dmesg after crash and report if above line is
> present and if it is try to get error state (if we would be so lucky).

I mostly check dmesg and I can confirm that the line is always there. Right now it just crashed again, I will upload the lines with 'drm' in them from dmesg so you can see. I will also upload the newest Xorg.0.log.

@Chris: so it crashes as well with the first patch you gave me. I'll see if the fifth check helps now.

Comment 28 Nicolas Hillegeer 2013-04-26 10:36:02 UTC

Created attachment 78515 [details]
Xorg.0.log after first patch Chris

Comment 29 Nicolas Hillegeer 2013-04-26 10:36:33 UTC

Created attachment 78516 [details]
dmesg after first patch Chris (SNA)

Comment 30 Chris Wilson 2013-04-26 10:41:49 UTC

Try adding

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 87c62cc..2bd8d7a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct vm_f
        bool write = !!(vmf->flags & FAULT_FLAG_WRITE);
 
        ret = i915_mutex_lock_interruptible(dev);
+       if (ret == -EIO)
+               ret = mutex_lock_interruptible(dev);
        if (ret)
                goto out;
 

as well as the other EIO suppressions.

Comment 31 Nicolas Hillegeer 2013-04-26 10:58:42 UTC

(In reply to comment #30)
> Try adding
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index 87c62cc..2bd8d7a 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct
> vm_f
>         bool write = !!(vmf->flags & FAULT_FLAG_WRITE);
>  
>         ret = i915_mutex_lock_interruptible(dev);
> +       if (ret == -EIO)
> +               ret = mutex_lock_interruptible(dev);
>         if (ret)
>                 goto out;
>  
> 
> as well as the other EIO suppressions.

I'm testing now with the 6 EIO suppressions you gave me. 

An interesting note: I just checked on my Ivy Bridge (HD 4000, gen7) system which I was running with the same test: it's still running! It's been 4 days now I think. Pushing 16 (!) 1080p videos of between 100 and 200MB and constantly getting its surfaces destroyed, same as the sandy bridges. This is with all stock debian wheezy packages, except for kernel 3.9-rc8. I have seen the Ivy Bridge lock up, but that was with wheezy's 3.2. Maybe the Ivy Bridges stability in this round is just a fluke though...

Comment 32 Nicolas Hillegeer 2013-04-26 11:00:21 UTC

Sorry for the spam, but I forgot to add something: The Ivy Bridge system I was talking about has a lot of dri2SwapComplete: bad drawable notices in its Xorg.0.log. But it still keeps on playing all the videos just fine. There's nothing in dmesg for that system

Comment 33 Nicolas Hillegeer 2013-04-26 12:02:26 UTC

(In reply to comment #30)
> Try adding
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c
> b/drivers/gpu/drm/i915/i915_gem.c
> index 87c62cc..2bd8d7a 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1334,6 +1334,8 @@ int i915_gem_fault(struct vm_area_struct *vma, struct
> vm_f
>         bool write = !!(vmf->flags & FAULT_FLAG_WRITE);
>  
>         ret = i915_mutex_lock_interruptible(dev);
> +       if (ret == -EIO)
> +               ret = mutex_lock_interruptible(dev);
>         if (ret)
>                 goto out;
>  
> 
> as well as the other EIO suppressions.

I used all the EIO suppressions you gave me (6 in total) and the result was the same as with all the other patches (your first patch and Ben's patch): the system completely locks up and I can't even ssh in.

Unfortunately I don't find anything in the logs after reboot (dmesg.0, Xorg.0.old and my app log).

Comment 34 Chris Wilson 2013-04-26 12:32:20 UTC

That patch shouldn't itself cause a lockup - so I wonder if the system would have eventually locked up anyway if X didn't die first. But that the patch eliminates the false SIGBUS and failed mmaps is a good sign! :)

Comment 35 Nicolas Hillegeer 2013-04-26 12:39:10 UTC

(In reply to comment #34)
> That patch shouldn't itself cause a lockup - so I wonder if the system would
> have eventually locked up anyway if X didn't die first. But that the patch
> eliminates the false SIGBUS and failed mmaps is a good sign! :)
 
Good question, my intuition also tells me that it would've locked up eventually. But that's because I'm still under the assumption that it's a resource exhaustion problem, which might be completely false, since I've never dealt with video card code in my life. In this way X serves as a full-crash shield, how nice ;).

The system locking up completely does make it very hard to diagnose anything at all though. Any idea on something I could try?

Comment 36 Chris Wilson 2013-04-26 12:50:01 UTC

You can try either:

i915.i915_enable_rc6=0

or

i915.reset=0

and see if they stop the hard hangs.

Comment 37 Nicolas Hillegeer 2013-04-26 15:09:50 UTC

(In reply to comment #36)
> You can try either:
> 
> i915.i915_enable_rc6=0
> 
> or
> 
> i915.reset=0
> 
> and see if they stop the hard hangs.

Unfortunately that doesn't seem to be the case, drat.

Comment 38 Ben Widawsky 2013-04-26 21:10:29 UTC

(In reply to comment #21)
> Since Ben's patch seems to have caused the machine to freeze upon GPU reset,
> I'd highly recommend to drop it.

Can we please also try setting both bits to 1, instead of clearing them? I'm pretty concerned about the reset failure.

Comment 39 Ben Widawsky 2013-04-26 23:33:31 UTC

Created attachment 78538 [details] [review]
Potentially make reset work

Based on 3.9-rc8.

Please run, reproduce hang, and and post the result of dmesg | grep PCH

Comment 40 Nicolas Hillegeer 2013-04-27 10:20:08 UTC

(In reply to comment #39)
> Created attachment 78538 [details] [review] [review]
> Potentially make reset work
> 
> Based on 3.9-rc8.
> 
> Please run, reproduce hang, and and post the result of dmesg | grep PCH

Ok, I will start compiling the kernel with your patch. With or without Chris' EIO suppressions? I'm going to leave them out for now, since you want dmesg output and both your first and Chris' later patches made the system lock up in a way that I couldn't get to dmesg anymore. I even tried some magic sysrq tricks to no avail.

I'll report back after I reproduce the crash. 

Also don't know if you saw me mentioning it, but the Ivy Bridge system that I'm now running since 1.5 days with 32 (!) simultaneous videos (and before that 4 days with 16 simultaneous videos without reboot) is still going strong without even showing slowdown, which invariably does happen to the Sandy Bridge systems.

I'm not sure how different the DRM (or any other subsystem really) treat gen6 and gen7, or how different the hardware really is, but maybe the solution is in there. Just an uneducated guess though.

Comment 41 Chris Wilson 2013-04-27 10:52:58 UTC

SandyBridge and IvyBridge are really two different beasts when it comes to the GPU. There are a lot of similarities, but the silicon is more than just an evolution of the SandyBridge design. Just to put things in perspective.

So that IVB is stable unlike SNB, doesn't rule out any part of the stack. Hopefully we can find a way to prevent the lockup, and to find a way to reset the GPU after the hang, and fix the hangs in the first place!

Comment 42 Nicolas Hillegeer 2013-04-27 13:21:18 UTC

(In reply to comment #39)
> Created attachment 78538 [details] [review] [review]
> Potentially make reset work
> 
> Based on 3.9-rc8.
> 
> Please run, reproduce hang, and and post the result of dmesg | grep PCH

System came to a complete halt again, could not extract dmesg. But I noticed I was running with i915.i915_disable_rc6=0. So I'll turn that off and redo it.

Comment 43 Nicolas Hillegeer 2013-04-27 14:31:03 UTC

(In reply to comment #42)
> (In reply to comment #39)
> > Created attachment 78538 [details] [review] [review] [review]
> > Potentially make reset work
> > 
> > Based on 3.9-rc8.
> > 
> > Please run, reproduce hang, and and post the result of dmesg | grep PCH
> 
> System came to a complete halt again, could not extract dmesg. But I noticed
> I was running with i915.i915_disable_rc6=0. So I'll turn that off and redo
> it.

I tried again with no boot parameters and this time (thankfully) the X.org server crashed before completely locking up the system, allowing me to extract a dmesg! I'll upload both dmesg and Xorg.0.log right now.

Comment 44 Nicolas Hillegeer 2013-04-27 14:31:55 UTC

Created attachment 78556 [details]
dmesg after Ben's #2 patch (SNA), X.org crashed

Comment 45 Nicolas Hillegeer 2013-04-27 14:32:26 UTC

Created attachment 78557 [details]
Xorg.0.log after patch #2 by Ben

Comment 46 Ben Widawsky 2013-04-27 15:57:51 UTC

(In reply to comment #44)
> Created attachment 78556 [details]
> dmesg after Ben's #2 patch (SNA), X.org crashed

Thank you very much for collecting the data. Unfortunately I have no other ideas why the reset could fail.

Comment 47 Nicolas Hillegeer 2013-04-27 16:01:11 UTC

(In reply to comment #46)
> (In reply to comment #44)
> > Created attachment 78556 [details]
> > dmesg after Ben's #2 patch (SNA), X.org crashed
> 
> Thank you very much for collecting the data. Unfortunately I have no other
> ideas why the reset could fail.

No problem at all :). Sometimes these things take time to mull over. If anyone has any ideas they wanna try I'm always happy to patch, recompile, run and collect data.

Thanks already for the fantastic feedback guys, I hope someday we find this nasty little bug, wherever it be hiding.

Comment 48 Mika Kuoppala 2013-05-16 13:53:48 UTC

Created attachment 79420 [details] [review]
Patch to get error state out on low/fragmented memory situations

Comment 49 Nicolas Hillegeer 2013-05-16 14:11:08 UTC

(In reply to comment #48)
> Created attachment 79420 [details] [review] [review]
> Patch to get error state out on low/fragmented memory situations

Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you prefer if I try to backport or start using another kernel?

Checking patch drivers/gpu/drm/i915/i915_debugfs.c...
error: while searching for:
	.release = i915_error_state_release,
};

static int
i915_next_seqno_get(void *data, u64 *val)
{

error: patch failed: drivers/gpu/drm/i915/i915_debugfs.c:866
error: drivers/gpu/drm/i915/i915_debugfs.c: patch does not apply
Checking patch drivers/gpu/drm/i915/i915_drv.h...
Hunk #1 succeeded at 803 (offset -49 lines).

Comment 50 Mika Kuoppala 2013-05-24 10:49:00 UTC

Created attachment 79755 [details] [review]
for v3.9.2

Comment 51 Mika Kuoppala 2013-05-24 10:51:35 UTC

(In reply to comment #49)
> (In reply to comment #48)
> > Created attachment 79420 [details] [review] [review] [review]
> > Patch to get error state out on low/fragmented memory situations
> 
> Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you
> prefer if I try to backport or start using another kernel?

I have attached a proper patch to get error state out even if
memory is fragmented, backported to 3.9.2.

Patch is also included in:
http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued

Nicolas, could you please try the patch so that we could get
an error state for this bug. Thanks.

Comment 52 Nicolas Hillegeer 2013-05-24 10:56:46 UTC

(In reply to comment #51)
> (In reply to comment #49)
> > (In reply to comment #48)
> > > Created attachment 79420 [details] [review] [review] [review] [review]
> > > Patch to get error state out on low/fragmented memory situations
> > 
> > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you
> > prefer if I try to backport or start using another kernel?
> 
> I have attached a proper patch to get error state out even if
> memory is fragmented, backported to 3.9.2.
> 
> Patch is also included in:
> http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued
> 
> Nicolas, could you please try the patch so that we could get
> an error state for this bug. Thanks.

Mika, I had already more or less backported the patch to 3.9.3 (see the other bug report: https://bugs.freedesktop.org/show_bug.cgi?id=63946). Now I'm having difficulties getting it to crash without locking hard. Often it doesn't crash but lowers its output frequency as I explain in that bug report (which is for the same bug). I'm trying out on a lot of units todays so hopefully I strike gold! I'll keep you guys posted. Thanks for the correct backport.

Comment 53 Mika Kuoppala 2013-05-24 11:12:15 UTC

(In reply to comment #52)
> (In reply to comment #51)
> > (In reply to comment #49)
> > > (In reply to comment #48)
> > > > Created attachment 79420 [details] [review] [review] [review] [review] [review]
> > > > Patch to get error state out on low/fragmented memory situations
> > > 
> > > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you
> > > prefer if I try to backport or start using another kernel?
> > 
> > I have attached a proper patch to get error state out even if
> > memory is fragmented, backported to 3.9.2.
> > 
> > Patch is also included in:
> > http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued
> > 
> > Nicolas, could you please try the patch so that we could get
> > an error state for this bug. Thanks.
> 
> Mika, I had already more or less backported the patch to 3.9.3 (see the
> other bug report: https://bugs.freedesktop.org/show_bug.cgi?id=63946). Now
> I'm having difficulties getting it to crash without locking hard. Often it
> doesn't crash but lowers its output frequency as I explain in that bug
> report (which is for the same bug). I'm trying out on a lot of units todays
> so hopefully I strike gold! I'll keep you guys posted. Thanks for the
> correct backport.

Nicolas, it is not a backport of the previous patch in this bug (it was vmalloc hack). The newly attached patch avoids seq_file completely and should work in very low memory and/or fragmented situations also.

Comment 54 Nicolas Hillegeer 2013-05-24 11:14:06 UTC

(In reply to comment #53)
> (In reply to comment #52)
> > (In reply to comment #51)
> > > (In reply to comment #49)
> > > > (In reply to comment #48)
> > > > > Created attachment 79420 [details] [review] [review] [review] [review] [review] [review]
> > > > > Patch to get error state out on low/fragmented memory situations
> > > > 
> > > > Patch doesn't apply cleanly to 3.9.2 (error at the end of message). Do you
> > > > prefer if I try to backport or start using another kernel?
> > > 
> > > I have attached a proper patch to get error state out even if
> > > memory is fragmented, backported to 3.9.2.
> > > 
> > > Patch is also included in:
> > > http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-next-queued
> > > 
> > > Nicolas, could you please try the patch so that we could get
> > > an error state for this bug. Thanks.
> > 
> > Mika, I had already more or less backported the patch to 3.9.3 (see the
> > other bug report: https://bugs.freedesktop.org/show_bug.cgi?id=63946). Now
> > I'm having difficulties getting it to crash without locking hard. Often it
> > doesn't crash but lowers its output frequency as I explain in that bug
> > report (which is for the same bug). I'm trying out on a lot of units todays
> > so hopefully I strike gold! I'll keep you guys posted. Thanks for the
> > correct backport.
> 
> Nicolas, it is not a backport of the previous patch in this bug (it was
> vmalloc hack). The newly attached patch avoids seq_file completely and
> should work in very low memory and/or fragmented situations also.

Acknowledged, that sounds good. I'll compile a fresh kernel and run with your newest patch.

Comment 55 Nicolas Hillegeer 2013-05-24 12:27:09 UTC

Created attachment 79765 [details]
i915_error_state2 after 32 concurrent video playback soft crash (could still access dmesg and debugfs)

Comment 56 Nicolas Hillegeer 2013-05-24 12:35:41 UTC

(In reply to comment #55)
> Created attachment 79765 [details]
> i915_error_state2 after 32 concurrent video playback soft crash (could still
> access dmesg and debugfs)

So finally I have an i915_error_state. I was lucky to have it soft crash this time. This was obtained through

$ cat /sys/kernel/debug/dri/0/i915_error_state2 > /home/user/i915_error_state && gzip /home/user/i915_error_state

kernel: 3.9.3 with Mika's first patch
libva: 1.2.0.pre1
intel-driver (vaapi): 1.1.0.pre1
vaapi version: 0.34.0
intel-xorg-driver: 2.21.7 

P.S.: dmesg output:

[    6.232455] [drm] Initialized drm 1.1.0 20060810
[    6.558826] [drm] Memory usable by graphics device = 2048M
[    6.604230] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[    6.604234] [drm] Driver supports precise vblank timestamp query.
[    6.627854] [drm] Wrong MCH_SSKPD value: 0x16040307
[    6.627859] [drm] This can cause pipe underruns and display issues.
[    6.627861] [drm] Please upgrade your BIOS to fix this.
[    6.645302] fbcon: inteldrmfb (fb0) is primary device
[    6.874096] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    6.879106] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[    7.823019] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[ 4342.301602] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 4342.301617] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state
[ 4454.300480] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 4454.804436] [drm:i915_reset] *ERROR* Failed to reset chip.

Comment 57 Chris Wilson 2013-05-24 12:39:15 UTC

(In reply to comment #55)
> Created attachment 79765 [details]
> i915_error_state2 after 32 concurrent video playback soft crash (could still
> access dmesg and debugfs)

Shrug. vaapi-intel remains garbage. Here the bsd batchbuffer is itself executing a random address and ends up overwriting other batches.

Comment 58 Nicolas Hillegeer 2013-05-24 15:19:14 UTC

Created attachment 79770 [details]
i915_error_state2 after 32 concurrent video playback soft crash #2

A second run, don't know if it will give anything useful but when I try to diagnose bugs in my stuff I always like the contrast. So I'll provide at least two i915_error_state's. Can provide more on request. The text file is gzipped (just like the last one btw).

Comment 59 haihao 2013-05-28 06:35:48 UTC

(In reply to comment #57)
> (In reply to comment #55)
> > Created attachment 79765 [details]
> > i915_error_state2 after 32 concurrent video playback soft crash (could still
> > access dmesg and debugfs)
> 
> Shrug. vaapi-intel remains garbage. Here the bsd batchbuffer is itself
> executing a random address and ends up overwriting other batches.

Do you mean the following bsd batchbuffer is invalid ?

13608000   524288 3f 00 5a5200 0 purgeable bsd snooped (LLC)

bsd ring --- gtt_offset = 0x13608000
00000000 :  13000082

Comment 60 Chris Wilson 2013-05-29 08:12:25 UTC

It's the render ring that is corrupt, my postulate was that the corruption was occuring from commands from the bsd ring.

Comment 61 haihao 2015-11-19 03:11:40 UTC

I am not sure you are still experiencing this issue or not. Recently we found some cases works well with GTT but hangs with PPGTT on SNB with an old kernel. Is it possible to disable PPGTT or try a new kernel?

Comment 62 haihao 2015-11-23 13:51:07 UTC

*** Bug 63946 has been marked as a duplicate of this bug. ***

Comment 63 ykzhao 2015-11-26 04:04:14 UTC

Hi, Nicolas

    Is the issue still reproduced after using the latest intel-driver?

    If it is reproduced, please attach the error log of /sys/class/drm/card0/error.

BTW: If the issue is reproduced, can you try to disable the PPGTT and see whether it is helpful?
     The PPGTT can be disabled by adding the kernel option of "i915.enable_ppgtt=0".

Thanks

Comment 64 haihao 2016-12-07 02:47:32 UTC

I closed this bug as wontfix because of no response over 1 year. Please feel free to reopen the bug if you still have this issue on SNB.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.