Bug 79675

Summary: Intel driver crash in brw_update_renderbuffer_surface
Product: DRI Reporter: post+fdo
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED DUPLICATE QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
GPU Crash Dump
none
Another GPU crash dump none

Description post+fdo 2014-06-05 09:52:42 UTC
Created attachment 100451 [details]
GPU Crash Dump

I am now for the second time experiencing a crash in the Intel driver in brw_update_renderbuffer_surface. The crashing application is KWin, so this is fairly disruptive. 
Shortly before the crash, I got some weird graphics artefacts (entire window contents, in this case Firefox, getting blacked out - or better, some pretty dark grey).


Here's the stacktrace:

#6  intel_miptree_used_for_rendering (mt=0x0) at ../../../../../../../src/mesa/drivers/dri/i965/intel_mipmap_tree.h:696
#7  brw_update_renderbuffer_surface (brw=0x7f92fd2de038, rb=0x26a0700, layered=<optimized out>, unit=0) at ../../../../../../../src/mesa/drivers/dri/i965/brw_wm_surface_state.c:620
#8  0x00007f92fdee555d in brw_update_renderbuffer_surfaces (brw=0x7f92fd2de038) at ../../../../../../../src/mesa/drivers/dri/i965/brw_wm_surface_state.c:702
#9  0x00007f92fdeb395a in brw_upload_state (brw=brw@entry=0x7f92fd2de038) at ../../../../../../../src/mesa/drivers/dri/i965/brw_state_upload.c:645
#10 0x00007f92fde63fd7 in brw_try_draw_prims (indirect=<optimized out>, max_index=<optimized out>, min_index=<optimized out>, ib=<optimized out>, nr_prims=<optimized out>, prims=<optimized out>, arrays=<optimized out>, ctx=0x7f92fd2de038) at ../../../../../../../src/mesa/drivers/dri/i965/brw_draw.c:487
#11 brw_draw_prims (ctx=0x7f92fd2de038, prims=<optimized out>, nr_prims=<optimized out>, ib=<optimized out>, index_bounds_valid=<optimized out>, min_index=4294967295, max_index=4294967295, unused_tfb_object=0x0, indirect=0x0) at ../../../../../../../src/mesa/drivers/dri/i965/brw_draw.c:581
#12 0x00007f92fdcc08d3 in vbo_handle_primitive_restart (ctx=ctx@entry=0x7f92fd2de038, prim=prim@entry=0x7fff9dd32080, nr_prims=nr_prims@entry=1, ib=ib@entry=0x7fff9dd32060, index_bounds_valid=index_bounds_valid@entry=0 '\000', min_index=min_index@entry=4294967295, max_index=max_index@entry=4294967295) at ../../../../src/mesa/vbo/vbo_exec_array.c:585
#13 0x00007f92fdcc1bf0 in vbo_validated_drawrangeelements (ctx=ctx@entry=0x7f92fd2de038, mode=mode@entry=4, index_bounds_valid=index_bounds_valid@entry=0 '\000', start=start@entry=4294967295, end=end@entry=4294967295, count=count@entry=30, type=type@entry=5123, indices=indices@entry=0x0, basevertex=basevertex@entry=0, numInstances=numInstances@entry=1, baseInstance=baseInstance@entry=0) at ../../../../src/mesa/vbo/vbo_exec_array.c:1006
#14 0x00007f92fdcc2085 in vbo_exec_DrawElementsBaseVertex (mode=4, count=30, type=5123, indices=0x0, basevertex=0) at ../../../../src/mesa/vbo/vbo_exec_array.c:1179
#15 0x00007f93194902db in KWin::GLVertexBuffer::draw (this=this@entry=0x27e2230, region=..., primitiveMode=primitiveMode@entry=7, first=0, count=30, hardwareClipping=<optimized out>) at ../../../kwin/libkwineffects/kwinglutils.cpp:1936
#16 0x00007f931e77bab8 in KWin::SceneOpenGL2Window::performPaint (this=this@entry=0x2cf5b30, mask=mask@entry=9, region=..., data=...) at ../../kwin/scene_opengl.cpp:1574


KWin automatically disabled compositing. Attempts to re-enable it fail, it just crashes again immediately. Running "glxgears" segfaults with the following stacktrace:

#0  0x00007ffff2d98c18 in get_stencil_miptree (irb=0x8ece60) at ../../../../../../../src/mesa/drivers/dri/i965/brw_misc_state.c:257
#1  brw_workaround_depthstencil_alignment (brw=brw@entry=0x7ffff7f4e038, clear_mask=clear_mask@entry=0) at ../../../../../../../src/mesa/drivers/dri/i965/brw_misc_state.c:273
#2  0x00007ffff2d55af5 in brw_try_draw_prims (indirect=0x0, max_index=161, min_index=0, ib=0x0, nr_prims=2, prims=0x8ee7c0, arrays=0x667920, ctx=0x7ffff7f4e038) at ../../../../../../../src/mesa/drivers/dri/i965/brw_draw.c:427
#3  brw_draw_prims (ctx=0x7ffff7f4e038, prims=0x8ee7c0, nr_prims=2, ib=0x0, index_bounds_valid=<optimized out>, min_index=0, max_index=161, unused_tfb_object=0x0, indirect=0x0)
    at ../../../../../../../src/mesa/drivers/dri/i965/brw_draw.c:581
#4  0x00007ffff2bcccab in vbo_save_playback_vertex_list (ctx=0x7ffff7f4e038, data=0x8ee3ec) at ../../../../src/mesa/vbo/vbo_save_draw.c:309
#5  0x00007ffff2ae3a12 in ext_opcode_execute (node=0x8ee3e8, ctx=0x7ffff7f4e038) at ../../../../src/mesa/main/dlist.c:628
#6  execute_list (ctx=0x7ffff7f4e038, list=<optimized out>) at ../../../../src/mesa/main/dlist.c:7024
#7  0x00007ffff2af3f42 in _mesa_CallList (list=1) at ../../../../src/mesa/main/dlist.c:8328

dmesg says:

[ 5116.682010] [drm] stuck on render ring
[ 5116.682014] [drm] stuck on blitter ring
[ 5116.682602] [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [1007], reason: Ring hung, action: reset
[ 5116.682603] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 5116.682604] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 5116.682605] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 5116.682605] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 5116.682606] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 5116.682666] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
[ 5118.684686] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

I am attaching the output of "sudo cat /sys/class/drm/card0/error | gzip".

I am using Debian testing amd64, some version numbers:
KWin 4.11.9
Mesa 10.1.4
xserver-xorg 7.7+7
xserver-xorg-core 1.15.1
xserver-video-driver-intel 2.21.15
Kernel 3.15.0-rc7 (vanilla upstream)
The GPU is the HD3000 built into my Core i5-2450M.
Comment 1 Chris Wilson 2014-06-05 10:01:02 UTC
The first GPU hang is fairly innocuous, it is bug 54226.

There is however a second GPU hang that kills the box.

Then the crash is in mesa when UXA refuses to give it any more buffers. If it did it would only crash again anyway when the kernel refused to do anything with the buffer.


To differentiate this bug from #54226, can you please capture a few more error states so that we can see if there is a second factor here.
Comment 2 post+fdo 2014-06-05 10:07:24 UTC
(In reply to comment #1)
> To differentiate this bug from #54226, can you please capture a few more
> error states so that we can see if there is a second factor here.

I am sorry I do not understand. Should I just run "sudo cat /sys/class/drm/card0/error | gzip" a few more times? That does not seem useful to me, but what do I know ;-) . And I cannot reproduce the error, it happens like once every three weeks or so (this is just the second time I see it). Of course I can capture the GPU state again whenever it happens again.

I will keep the machine in the error state for now, but I'll reboot eventually to get proper scrolling again...
Comment 3 Chris Wilson 2014-06-05 10:16:58 UTC
The problem is that we only capture the first error state (since it usually is the more interesting cause, and they can hog a lot of kernel memory) and we only free the error state when root does "echo > /sys/class/drm/card0/error". Since the two error here occur within a few seconds, capturing the second error does not seem practical. So what I want you to do is see if you get a GPU hang reported that doesn't take the system down with it, and attach that error state to see if it indicates a reason for the other hang.

It may just be the system is not recovering from the GPU hang correctly - e.g. mesa did not invalidate its context and so subsequent batches referenced invalid state.
Comment 4 post+fdo 2014-06-05 11:07:52 UTC
(In reply to comment #3)
> So what I want
> you to do is see if you get a GPU hang reported that doesn't take the system
> down with it, and attach that error state to see if it indicates a reason
> for the other hang.

The overall system didn't go down, it's still running. Just GL is all broken.
And I don't think I'd even notice the hang if KWin wouldn't crash.
I don't experience two hangs, I experience some glitches and then (seconds later) a crash of everything GL-related. So to be honest I don't know how to produce or even detect the state you are asking for.
Comment 5 post+fdo 2014-08-03 11:29:20 UTC
Created attachment 103901 [details]
Another GPU crash dump

I just experienced the GPU crash again and attached the GPU crash dump. Unfortunately, I was not able to catch the backtrace of kwin.

The kernel said
Aug 03 13:12:00 r-schnelltop kernel: [drm] stuck on render ring
Aug 03 13:12:00 r-schnelltop kernel: [drm] stuck on blitter ring
Aug 03 13:12:00 r-schnelltop kernel: [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [955], reason: Ring hung, action: reset
Aug 03 13:12:00 r-schnelltop kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 03 13:12:00 r-schnelltop kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Aug 03 13:12:00 r-schnelltop kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 03 13:12:00 r-schnelltop kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Aug 03 13:12:00 r-schnelltop kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Aug 03 13:12:00 r-schnelltop kernel: [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
Aug 03 13:12:02 r-schnelltop kernel: [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

I have to reboot now, something else seems to hang as well (my current suspect is d-bus) and the machine is mostly unusable.


The software versions changed to:
KWin 4.11.9
Mesa 10.2.4
xserver-xorg-core 1.16.0
xserver-xorg-video-intel 2.21.15
Kernel 3.15.3 (vanilla upstream)
Comment 6 Daniel Vetter 2014-11-04 14:28:28 UTC
That's still the semaphore bug afaict. Not sure why your box keels over right afterwards, but I think it's most accuarate to close this as a dupe.

Retesting with latest kernels&mesa should always be intereseting though.

*** This bug has been marked as a duplicate of bug 54226 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.