78547 – [gen3 3.14 regression] Bad stolen

Bug 78547 - [gen3 3.14 regression] Bad stolen

Summary: [gen3 3.14 regression] Bad stolen

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86 (IA32) Linux (All)

Importance:	highest normal
Assignee:	Ville Syrjala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-05-10 22:46 UTC by malstrond
Modified:	2016-10-19 09:40 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
Crash dump from /sys/class/drm/card0/error (695.99 KB, text/plain) 2014-05-10 22:46 UTC, malstrond	no flags	Details
Try 1: Crashdump (674.63 KB, text/plain) 2014-05-11 15:24 UTC, malstrond	no flags	Details
Try 1: Kernel log (12.98 KB, text/plain) 2014-05-11 15:25 UTC, malstrond	no flags	Details
Try 2: Crashdump (675.04 KB, text/plain) 2014-05-11 15:25 UTC, malstrond	no flags	Details
Try 2: Kernel log (1.08 KB, text/plain) 2014-05-11 15:26 UTC, malstrond	no flags	Details
Try 3: Crashdump (675.04 KB, text/plain) 2014-05-11 15:26 UTC, malstrond	no flags	Details
Try 3: Kernel log (1.08 KB, text/plain) 2014-05-11 15:26 UTC, malstrond	no flags	Details
Try 4: Crashdump (696.07 KB, text/plain) 2014-05-11 15:27 UTC, malstrond	no flags	Details
Try 4: Kernel log (967 bytes, text/plain) 2014-05-11 15:27 UTC, malstrond	no flags	Details
Try 5: Crashdump (696.00 KB, text/plain) 2014-05-11 15:28 UTC, malstrond	no flags	Details
Try 5: Kernel log (968 bytes, text/plain) 2014-05-11 15:28 UTC, malstrond	no flags	Details
Hang when execuiting xrandr: Crashdump (696.09 KB, text/plain) 2014-05-11 18:41 UTC, malstrond	no flags	Details
Hang when executing xrandr: Kernel log (annotated) (25.55 KB, text/plain) 2014-05-11 18:42 UTC, malstrond	no flags	Details
View All

Description malstrond 2014-05-10 22:46:06 UTC

Created attachment 98831 [details]
Crash dump from /sys/class/drm/card0/error

When trying to launch XBMC 13.0 on my ThinkPad X41 Tablet with a 915GM, the display gets filled with a solid color, the GPU hangs and does not recover.
XBMC doesn't get far enough to generate its own error log.
I also don't know if this is a regression because I never tried to use XBMC before.
OS: Arch Linux x86 with Kernel 3.14.2, Mesa 10.1.3.


These are the only related messages in the ring buffer:

[drm] GPU crash dump saved to /sys/class/drm/card0/error
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
i915: render error detected, EIR: 0x00000010
i915: page table error
i915:   PGTBL_ER: 0x00100003
[drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
i915: render error detected, EIR: 0x00000010
i915: page table error
i915:   PGTBL_ER: 0x00100003
[drm] stuck on render ring
[drm:i915_reset] *ERROR* Failed to reset chip: -19

Comment 1 Chris Wilson 2014-05-11 07:17:49 UTC

That's bad. There is tons of garbage in the ringbuffer. Can you try mplayer -vo xv or mplayer -vo gl? Just to see if they suffer something similar that may be easier to debug. Do all the hangs in xbmc look like the one attached? Can you please attach a couple more.

Comment 2 malstrond 2014-05-11 15:24:18 UTC

Created attachment 98847 [details]
Try 1: Crashdump

Comment 3 malstrond 2014-05-11 15:25:01 UTC

Created attachment 98848 [details]
Try 1: Kernel log

Comment 4 malstrond 2014-05-11 15:25:43 UTC

Created attachment 98849 [details]
Try 2: Crashdump

Comment 5 malstrond 2014-05-11 15:26:08 UTC

Created attachment 98850 [details]
Try 2: Kernel log

Comment 6 malstrond 2014-05-11 15:26:35 UTC

Created attachment 98851 [details]
Try 3: Crashdump

Comment 7 malstrond 2014-05-11 15:26:49 UTC

Created attachment 98852 [details]
Try 3: Kernel log

Comment 8 malstrond 2014-05-11 15:27:32 UTC

Created attachment 98853 [details]
Try 4: Crashdump

Comment 9 malstrond 2014-05-11 15:27:46 UTC

Created attachment 98854 [details]
Try 4: Kernel log

Comment 10 malstrond 2014-05-11 15:28:23 UTC

Created attachment 98855 [details]
Try 5: Crashdump

Comment 11 malstrond 2014-05-11 15:28:39 UTC

Created attachment 98856 [details]
Try 5: Kernel log

Comment 12 malstrond 2014-05-11 15:29:21 UTC

Playing a video with mplayer -vo xv and -vo gl works fine. I normally use mpv, which works fine too with --vo=xv, --vo=x11 and --vo=opengl-old.

I also collected a few more crashdumps. 
Once (on try 1) the GPU did recover and X was usable after a short hang, although XBMC still crashed before its GUI showed up. On this try there also were some related entries in the kernel log.

Comment 13 Chris Wilson 2014-05-11 15:37:14 UTC

Apart from try1, they all seem to have the same characteristic of something overwriting the ring buffer. How easy would it to be to test with an old mesa, I guess mesa-9.0?

Comment 14 malstrond 2014-05-11 18:40:34 UTC

I'd have to recompile a lot of stuff to use older mesa versions, since it's a rolling release distro.

However I found out something new:
The same GPU hang with "EIR stuck" also occurs when using xrandr or lxrandr. 
No resolution change, simply executing "xrandr" from lxterminal or starting lxrandr to show the current setting causes the GPU to hang. 
But after X has locked up, I can still execute it with "DISPLAY=:0 xrandr" via SSH and get the correct output. If I change the mode, xrandr output reflects that (* next to a different mode), but the display stays unresponsive. I can also turn the display off, and it really gets turned off. Turning it back on again yields the same solid color it was before though. While doing this lots of call traces appear in the kernel log again.
If I try to execute "DISPLAY=:0 xrandr" from SSH after clean boot, the same crash occurs, but xrandr runs produces the correc output.

Since XBMC tries to change the display mode on startup because it normally runs in fullscreen and mplayer/mpv don't try to change the mode, this might be the same issue.

Comment 15 malstrond 2014-05-11 18:41:14 UTC

Created attachment 98866 [details]
Hang when execuiting xrandr: Crashdump

Comment 16 malstrond 2014-05-11 18:42:14 UTC

Created attachment 98867 [details]
Hang when executing xrandr: Kernel log (annotated)

Comment 17 Chris Wilson 2014-05-12 06:36:33 UTC

It has the same symptoms that the ringbuffer has been overwritten.

I would try:

ickle@nuc-i3427:/usr/src/linux$ git diff
diff --git a/drivers/gpu/drm/i915/i915_gem_stolen.c b/drivers/gpu/drm/i915/i915_gem_stolen.c
index 20bd839..dcf90ca 100644
--- a/drivers/gpu/drm/i915/i915_gem_stolen.c
+++ b/drivers/gpu/drm/i915/i915_gem_stolen.c
@@ -337,6 +337,8 @@ int i915_gem_init_stolen(struct drm_device *dev)
        }
 #endif
 
+       dev_priv->gtt.stolen_size = 0;
+
        if (dev_priv->gtt.stolen_size == 0)
                return 0;

Comment 18 malstrond 2014-05-12 20:13:38 UTC

(In reply to comment #17)
> +       dev_priv->gtt.stolen_size = 0;
Yep, this worked. xrandr no longer hangs the GPU and XBMC works fine.

Comment 19 Daniel Vetter 2014-05-15 15:24:57 UTC

Should we just give up on stolen on gen3? After all we have a few bugs with conflicts with mmio bars and other crap ..

Comment 20 Ville Syrjala 2014-05-15 16:56:39 UTC

Maybe this?
http://lists.freedesktop.org/archives/intel-gfx/2013-December/036841.html

Comment 21 Ville Syrjala 2014-05-15 17:02:03 UTC

(In reply to comment #20)
> Maybe this?
> http://lists.freedesktop.org/archives/intel-gfx/2013-December/036841.html

Oh and in addition we might have to leave a guard page between the GTT and the rest of stolen. I think I saw something like that mentioned somewhere. That patch doesn't have a guard page though.

Comment 22 Daniel Vetter 2014-05-15 21:26:10 UTC

Well fodder for testing for sure, with or without the guard page.

Comment 23 Chris Wilson 2014-05-16 10:30:20 UTC

(In reply to comment #20)
> Maybe this?
> http://lists.freedesktop.org/archives/intel-gfx/2013-December/036841.html

Wait, that's not upstream yet?

Comment 24 Jesse Barnes 2014-06-05 20:57:47 UTC

Assigning to Ville to push his fix.

Comment 25 Ville Syrjala 2014-07-31 18:00:56 UTC

commit f1e1c2129b79cfdaf07bca37c5a10569fe021abe
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Thu Jun 5 20:02:59 2014 +0300

    drm/i915: Don't clobber the GTT when it's within stolen memory

is now merged, so I'm assuming that this is now fixed.

Comment 26 malstrond 2014-08-02 13:28:30 UTC

I can confirm it's fixed on my machine with vanilla kernel 3.15.8.

Comment 27 Jari Tahvanainen 2016-10-19 09:40:17 UTC

Closing resolved+fixed. Verification done by Reporter.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.