Bug 77207

Summary:	[ivb/hsw] batch overwritten with garbage
Product:	Mesa	Reporter:	Mario Golfetto <mariogolf2>
Component:	Drivers/DRI/i965	Assignee:	Kenneth Graunke <kenneth>
Status:	RESOLVED FIXED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	adriancz, anarsoul, bb, emir72h, famelis, ilanco, intel-gfx-bugs, jnocturna, mtijink.bugs, nicolas.belouin, robert2505, ryllu800proar, saintdev, sibrus, t.kijas, webstrand, zecoucou
Version:	10.1
Hardware:	Other
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:
Bug Blocks:	77449
Attachments:	dump in /sys/class/drm/card0/error dump in /sys/class/drm/card0/error (20140411) cat /sys/class/drm/card0/error > error.txt

Description Mario Golfetto 2014-04-08 23:03:14 UTC

Created attachment 97097 [details]
dump in /sys/class/drm/card0/error

System: Debian Jessie (testing) 64bit
kernel: Debian 3.13.7-1 x86_64 GNU/Linux
DE: KDE 4.11.3
CPU: Intel i5-4440
mainboard: Asus Z87-plus
monitor: 1980*1200@60Hz via DVI + CRT1024*768@85Hz via VGA/DSUB
no graphical card added

Yesterday I turned on the standard graphical effects on KDE for simple testing and I halted the system.
Today, after some hours of job, I opened Iceweasel, Icedove, Virtualbox & one VM and I entered on a Google's hangout to have a videochat.
During videochat,
1) I pressed ALT+TAB to switch to another window
2) I saw the graphical effect "Scambiatore circolare" (this is the italian translation: I'm sorry, but I don't find the exact english name of this effect!) 
3) I didn't see all windows of programs opened
4) the system freezed some moments (more or less one second).

This is the first time I see this, whit DVI monitor.
Before today no problem whit a 19" TFT monitor plus CRT.


This is on my dmesg:
[20044.219079] [drm] stuck on render ring
[20044.219082] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[20044.219083] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[20044.219084] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[20044.219085] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[20044.219085] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[20044.221558] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0xad9e000 ctx 1) at 0xad9e004

Attached /sys/class/drm/card0/error dump file (compressed).

Bye, Mario

Comment 1 Chris Wilson 2014-04-09 06:02:12 UTC

Try downgrading to mesa-9 or mesa-10.0.

Comment 2 Chris Wilson 2014-04-09 07:54:44 UTC

*** Bug 76300 has been marked as a duplicate of this bug. ***

Comment 3 Kenneth Graunke 2014-04-09 23:17:20 UTC

I was able to reproduce this today - a sensible looking batch, but the first bunch of DWords are smashed to 0xFFFFFFFF.  Also on Haswell.  Not sure what's going on.

Mario, what version of Mesa are you using?

Comment 4 Mario Golfetto 2014-04-10 20:10:54 UTC

Il 10/04/2014 01:17, bugzilla-daemon@freedesktop.org ha avuto l'onore e
l'ardire di scrivere:
> *Comment # 3 <https://bugs.freedesktop.org/show_bug.cgi?id=77207#c3> on
> bug 77207 <https://bugs.freedesktop.org/show_bug.cgi?id=77207> from
> Kenneth Graunke <mailto:kenneth@whitecape.org> *
> 
> I was able to reproduce this today - a sensible looking batch, but the first
> bunch of DWords are smashed to 0xFFFFFFFF.  Also on Haswell.  Not sure what's
> going on.
> 
> Mario, what version of Mesa are you using?
> 

Hi,
this is mesa version on my box (updated today):

> ii  libegl1-mesa:amd64                    10.1.0-5
> ii  libegl1-mesa-drivers:amd64            10.1.0-5
> ii  libgl1-mesa-dri:amd64                 10.1.0-5
> ii  libgl1-mesa-glx:amd64                 10.1.0-5
> ii  libglapi-mesa:amd64                   10.1.0-5
> ii  libgles2-mesa:amd64                   10.1.0-5
> ii  libglu1-mesa:amd64                    9.0.0-2
> ii  libopenvg1-mesa:amd64                 10.1.0-5
> ii  libwayland-egl1-mesa:amd64            10.1.0-5 

Today I upgraded my system and goes better.

I'm still looking for new deatils.

Reguards,
Mario

Comment 5 Mario Golfetto 2014-04-11 16:59:35 UTC

Created attachment 97229 [details]
dump in /sys/class/drm/card0/error (20140411)

Comment 6 Kenneth Graunke 2014-04-11 17:58:24 UTC

Our current theory is that Mesa is allocating insufficient memory for the MCS buffers, and the GPU is running off the end of those and trampling whatever happens to come after it in memory.  In this case, that's the batchbuffer.  In MCS speak, 0xff means "this part of the buffer is clear."

Comment 7 Chris Wilson 2014-04-13 06:19:28 UTC

*** Bug 77376 has been marked as a duplicate of this bug. ***

Comment 8 Kenneth Graunke 2014-04-13 16:48:28 UTC

For what it's worth, these problems should go away if you upgrade to the upcoming X server stable releases: either 1.15.1 or 1.14.6.  (Those should be coming out any day now.)  KWin is erroneously getting an 8x multisampled visual/fbconfig due to a server-side GLX bug, and it really doesn't want one.

That said, I think winsys multisampling is still broken, so this is a real bug.  Upgrading the X server is probably the best path for users right now, while we figure out what's going on.

Comment 9 Kenneth Graunke 2014-04-13 17:11:00 UTC

Another easier workaround in the meantime is to ask KWin to use EGL:

   KWIN_OPENGL_INTERFACE=egl kwin --replace &

Comment 10 Kenneth Graunke 2014-04-13 17:18:54 UTC

*** Bug 76763 has been marked as a duplicate of this bug. ***

Comment 11 Kenneth Graunke 2014-04-13 17:23:41 UTC

*** Bug 77109 has been marked as a duplicate of this bug. ***

Comment 12 Kenneth Graunke 2014-04-13 17:29:01 UTC

*** Bug 76063 has been marked as a duplicate of this bug. ***

Comment 13 Kenneth Graunke 2014-04-13 17:32:20 UTC

*** Bug 77256 has been marked as a duplicate of this bug. ***

Comment 14 Vasily Khoruzhick 2014-04-13 17:41:02 UTC

Kenneth, could you point me to exact xserver commit(s) which should fix the bug?

Comment 15 Chris Wilson 2014-04-13 19:15:02 UTC

*** Bug 77392 has been marked as a duplicate of this bug. ***

Comment 16 Kenneth Graunke 2014-04-13 20:06:13 UTC

(In reply to comment #14)
> Kenneth, could you point me to exact xserver commit(s) which should fix the
> bug?

Sure.  It's "glx: Clear new FBConfig attributes to 0 by default.":

http://cgit.freedesktop.org/xorg/xserver/commit/?id=96a28e9c914d7ae9b269f73a27b99cbd3c465ac8

Comment 17 Chris Wilson 2014-04-14 11:14:09 UTC

*** Bug 77429 has been marked as a duplicate of this bug. ***

Comment 18 Kenneth Graunke 2014-04-14 17:41:27 UTC

*** Bug 76039 has been marked as a duplicate of this bug. ***

Comment 19 Kenneth Graunke 2014-04-15 09:27:59 UTC

Eric posted a Mesa patch which fixes the corruption/hangs:
http://lists.freedesktop.org/archives/mesa-dev/2014-April/057818.html

Apparently, it was indeed a problem with our multisample control buffer handling.  With that patch, window system multisampling works reliably (at least for me.)  Thank you all for the excellent data, and sorry for the trouble!

That said, KWin really should not be using 8x multisampling - it adds a lot of unnecessary overhead.  I'd still strongly recommend upgrading to X server 1.15.1 or 1.14.6, which fix the GLX bug which caused KWin to do this.  Or, you can continue working around that bug by using KWIN_OPENGL_INTERFACE=egl.

Any of the above should fix this issue.

Comment 20 Eric Anholt 2014-04-15 22:14:20 UTC

commit 7ae870211ddc40ef6ed209a322c3a721214bb737
Author: Eric Anholt <eric@anholt.net>
Date:   Mon Apr 14 16:52:43 2014 -0700

    i965: Fix buffer overruns in MSAA MCS buffer clearing.

Comment 21 Kenneth Graunke 2014-04-16 04:31:17 UTC

*** Bug 76907 has been marked as a duplicate of this bug. ***

Comment 22 Kenneth Graunke 2014-04-16 06:53:41 UTC

*** Bug 76704 has been marked as a duplicate of this bug. ***

Comment 23 Chris Wilson 2014-04-30 10:58:18 UTC

*** Bug 76574 has been marked as a duplicate of this bug. ***

Comment 24 Chris Wilson 2014-04-30 10:59:13 UTC

*** Bug 76491 has been marked as a duplicate of this bug. ***

Comment 25 T.Kijas 2014-05-05 12:10:02 UTC

KWIN_OPENGL_INTERFACE=egl or KWIN_OPENGL_INTERFACE=egl kwin --replace & do not work to me.

Downgrade to mesa 8 makes it work.

I am sorry for reopen it, but my original bug report (about graphical corruptions - white stripes #77256 ) has been marked as duplicate of this bug.

Comment 26 Kenneth Graunke 2014-05-05 17:07:18 UTC

(In reply to comment #25)
> KWIN_OPENGL_INTERFACE=egl or KWIN_OPENGL_INTERFACE=egl kwin --replace & do
> not work to me.
> 
> Downgrade to mesa 8 makes it work.
> 
> I am sorry for reopen it, but my original bug report (about graphical
> corruptions - white stripes #77256 ) has been marked as duplicate of this
> bug.

Perhaps your KWin doesn't have EGL support, so that workaround fails.

The proper solution is to upgrade to Mesa 10.0.5, 10.1.1, or 10.2-rc1.  That will fix the actual driver bug.  If you upgrade to one of those releases, and still experience the problem, please reopen the bug.

I also highly recommend upgrading to X server 1.15.1 or 1.14.6.  It isn't strictly necessary, but without it, things will be slow.

Comment 27 Kenneth Graunke 2014-05-05 17:08:20 UTC

*** Bug 77256 has been marked as a duplicate of this bug. ***

Comment 28 Chris Wilson 2014-05-07 05:41:24 UTC

*** Bug 78362 has been marked as a duplicate of this bug. ***

Comment 29 Chris Wilson 2014-05-11 07:28:28 UTC

*** Bug 78531 has been marked as a duplicate of this bug. ***

Comment 30 Benjamin Baumgärtner 2014-05-15 16:54:54 UTC

Hi Kenneth, 

I upgraded my Ubuntu trusty to today's state of the xorg-edgers PPA:

xserver-xorg-core	2:1.15.1-0ubuntu2
xserver-xorg-video-intel	2:2.99.911+git20140507.18416b51-0ubuntu0ricotz~trusty
libegl1-mesa:amd64	10.3.0~git20140514.8a9f5ecd-0ubuntu0sarvatt~trusty
libgl1-mesa-dri:amd64	10.3.0~git20140514.8a9f5ecd-0ubuntu0sarvatt~trusty

$ glxinfo | grep Mesa
client glx vendor string: Mesa Project and SGI
OpenGL renderer string: Mesa DRI Intel(R) Ivybridge Desktop 
OpenGL core profile version string: 3.3 (Core Profile) Mesa 10.3.0-devel
OpenGL version string: 3.0 Mesa 10.3.0-devel

I'm still experiencing problems. Garbage screen output disappeared, but the screen is stuck for 5-20 secs every 1-3 minutes.

The kernel ring buffer says:
[  333.738942] [drm] stuck on render ring
[  333.738951] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  333.738952] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  333.738953] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  333.738954] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  333.738955] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  339.736311] [drm] stuck on render ring
[  339.736354] [drm:i915_context_is_banned] *ERROR* context hanging too fast, declaring banned!
[  348.720483] [drm] stuck on render ring
[  354.729941] [drm] stuck on render ring
[  360.727429] [drm] stuck on render ring
[  360.727558] [drm:i915_context_is_banned] *ERROR* context hanging too fast, declaring banned!
[  439.102608] init: tty1 main process ended, respawning
[  468.674648] [drm] stuck on render ring
[  647.707294] [drm] stuck on render ring
[  716.739191] [drm] stuck on render ring

So I'm still having trouble, even with xserver 1.15.1 and a recent mesa version.

Regards,
Benjamin

Comment 31 Benjamin Baumgärtner 2014-05-15 16:56:49 UTC

Created attachment 99107 [details]
cat /sys/class/drm/card0/error > error.txt

Comment 32 Kenneth Graunke 2014-05-15 18:04:43 UTC

Hi Benjamin,

The batch buffer in your error state does not contain garbage - instead, it looks like you're hanging on a PIPE_CONTROL after a 3DPRIMITIVE.  Plus, you don't have the graphical corruption described here.  And you definitely have the fix for this bug.

So, I think you're hitting a different GPU hang unrelated to this report.  I've gone ahead and created bug #78751 to track your issue.

Closing this one again.

--Ken

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.