Bug 33422

Summary: [SNB] etracer gets segfault on SandyBridge
Product: Mesa Reporter: Paulo Zanoni <przanoni>
Component: Drivers/DRI/i965Assignee: Ian Romanick <idr>
Status: RESOLVED DUPLICATE QA Contact:
Severity: normal    
Priority: medium CC: haihao.xiang, nanhai.zou, przanoni, zhenyu.z.wang
Version: 7.10   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Paulo Zanoni 2011-01-24 11:47:08 UTC
I have the following card (SandyBridge):
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:0116] (rev 09)

With X 1.9.3 and Mesa 7.10 I get a segfault if I play extremetuxracer. You have to play for a few minutes in fullscreen to get the segfault (I'm not sure non-fullscreen won't segfault).

x11-driver-video-intel-2.14.0-1mdv2010.2
x11-server-xorg-1.9.3-3mdv2010.2
libdrm2-2.4.23-1mdv2010.2

Here is the backtrace:


Program received signal SIGSEGV, Segmentation fault.
prepare_wm_surfaces (brw=0x8e78270) at brw_wm_surface_state.c:602
602              brw_add_validated_bo(brw, region->buffer);
(gdb) bt
#0  prepare_wm_surfaces (brw=0x8e78270) at brw_wm_surface_state.c:602
#1  0xb6f01e38 in brw_validate_state (brw=0x8e78270) at brw_state_upload.c:397
#2  0xb6ef0f8f in brw_try_draw_prims (ctx=0x8e78270, arrays=0x9202438, 
    prim=0x9200e0c, nr_prims=1, ib=0x0, index_bounds_valid=1 '\001', 
    min_index=0, max_index=3) at brw_draw.c:362
#3  brw_draw_prims (ctx=0x8e78270, arrays=0x9202438, prim=0x9200e0c, 
    nr_prims=1, ib=0x0, index_bounds_valid=1 '\001', min_index=0, max_index=3)
    at brw_draw.c:447
#4  0xb6fe6ae9 in vbo_exec_vtx_flush (exec=0x9200c98, unmap=1 '\001')
    at vbo/vbo_exec_draw.c:381
#5  0xb6fddbc9 in vbo_exec_FlushVertices_internal (ctx=0x8ea4788, 
    unmap=0 '\000') at vbo/vbo_exec_api.c:911
#6  0xb6fddc68 in vbo_exec_FlushVertices (ctx=0x8ea4788, flags=1)
    at vbo/vbo_exec_api.c:945
#7  0xb70cb7a1 in _mesa_PopAttrib () at main/attrib.c:858
#8  0xb733f0de in __glXDisp_PopAttrib (pc=0x9bcccbc "\004")
    at indirect_dispatch.c:1443
#9  0xb7367d29 in __glXDisp_Render (cl=0x8f22f18, pc=0x9bcccb8 "\004")
    at glxcmds.c:1847
#10 0xb736c870 in __glXDispatch (client=0x8f22e40) at glxext.c:600
#11 0x08070fff in Dispatch () at dispatch.c:432
#12 0x080625ba in main (argc=8, argv=0xbfa489f4, envp=0xbfa48a18) at main.c:291
(gdb) display region
1: region = (struct intel_region *) 0x0



And here is a piece of the code that crashes:

   if (ctx->DrawBuffer->_NumColorDrawBuffers >= 1) {
      for (i = 0; i < ctx->DrawBuffer->_NumColorDrawBuffers; i++) {
         struct gl_renderbuffer *rb = ctx->DrawBuffer->_ColorDrawBuffers[i];
         struct intel_renderbuffer *irb = intel_renderbuffer(rb);
         struct intel_region *region = irb ? irb->region : NULL;

         brw_add_validated_bo(brw, region->buffer);
         nr_surfaces = SURF_INDEX_DRAW(i) + 1;
      }
   }


In our case, "irb" is null, so we assign NULL to "region". On the next line, we try to access region->buffer, which doesn't make sense. We really shouldn't assign NULL to region and then access region->buffer...


So I tried a very simple patch:

diff -Nrup Mesa-7.10//src/mesa/drivers/dri/i965/brw_wm_surface_state.c patched/src/mesa/drivers/dri/i965/brw_wm_surface_state.c
--- Mesa-7.10//src/mesa/drivers/dri/i965/brw_wm_surface_state.c 2011-01-02 20:58:35.000000000 -0200
+++ patched/src/mesa/drivers/dri/i965/brw_wm_surface_state.c    2011-01-24 16:35:47.063197793 -0200
@@ -599,6 +599,9 @@ prepare_wm_surfaces(struct brw_context *
         struct intel_renderbuffer *irb = intel_renderbuffer(rb);
         struct intel_region *region = irb ? irb->region : NULL;
 
+        if (!region)
+            continue;
+
         brw_add_validated_bo(brw, region->buffer);
         nr_surfaces = SURF_INDEX_DRAW(i) + 1;
       }


Then I compiled/rebooted and tried to run etracer again. This time I got a different segfault (but still a NULL region):


Program received signal SIGSEGV, Segmentation fault.
intel_region_buffer (intel=0xa5aaf28, region=0x0, flag=2)
    at intel_regions.c:514
514        if (region->pbo) {
(gdb) bt
#0  intel_region_buffer (intel=0xa5aaf28, region=0x0, flag=2)
    at intel_regions.c:514
#1  0xb6e78d21 in intelClearWithBlit (ctx=0xa5aaf28, mask=2)
    at intel_blit.c:262
#2  0xb6e7bccd in intelClear (ctx=0xa5aaf28, mask=2) at intel_clear.c:177
#3  0xb7083958 in _mesa_Clear (mask=0) at main/clear.c:241
#4  0xb72f2e07 in __glXDisp_Clear (pc=0xb1ea3040 "")
    at indirect_dispatch.c:1335
#5  0xb731bd29 in __glXDisp_Render (cl=0xa206eb0, pc=0xb1ea303c "\b")
    at glxcmds.c:1847
#6  0xb7320870 in __glXDispatch (client=0xa206dd8) at glxext.c:600
#7  0x08070fff in Dispatch () at dispatch.c:432
#8  0x080625ba in main (argc=8, argv=0xbfbd3464, envp=0xbfbd3488) at main.c:291
(gdb) display region
1: region = (struct intel_region *) 0x0


But this backtrace is the one PCPA reported for Salome:
https://bugs.freedesktop.org/show_bug.cgi?id=27333 (see backtrace on last comment).

So I tried PCPA's patch:

diff --git a/src/mesa/drivers/dri/intel/intel_blit.c b/src/mesa/drivers/dri/intel/intel_blit.c
index 2c85ad3..e369783 100644
--- a/src/mesa/drivers/dri/intel/intel_blit.c
+++ b/src/mesa/drivers/dri/intel/intel_blit.c
@@ -263,6 +263,9 @@ intelClearWithBlit(GLcontext *ctx, GLbitfield mask)

       /* OK, clear this renderbuffer */
       irb = intel_get_renderbuffer(fb, buf);
+      if (irb->region == NULL)
+         goto clear_bit;
+
       write_buffer = intel_region_buffer(intel, irb->region,
                                         all ? INTEL_WRITE_FULL :
                                         INTEL_WRITE_PART);
@@ -370,6 +373,7 @@ intelClearWithBlit(GLcontext *ctx, GLbitfield mask)
       if (intel->always_flush_cache)
         intel_batchbuffer_emit_mi_flush(intel->batch);

+  clear_bit:
       if (buf == BUFFER_DEPTH || buf == BUFFER_STENCIL)
         mask &= ~(BUFFER_BIT_DEPTH | BUFFER_BIT_STENCIL);
       else

Then I tried to test... After playing for a while (without crashing, the game going fine), I stopped etracer and went to the desktop. It was in a very inconsistent state: the screen was not refreshing correctly, I could still see parts of the game in my desktop, maximizing/minimizing applications had a weird behavior/flickr... When redrawing parts of the screen, sometimes X draw not the top window, but the content of the window below it. I'm using kwin.

Then, even without being able to use konsole correctly (I was not seeing what I was typing), I launched etracer again. The screen was really flashing, many times per second. Really annoying to watch.

So I came here report the bug =)

Any hints on what could be going on? Something I could investigate? A breakpoint to add? Do you need any "printf"s inside the code?

Thanks,
Paulo
Comment 1 Paulo Zanoni 2011-01-24 11:56:34 UTC
Oh, I forgot:
kernel 2.6.37-server-1.1mnb i686
Comment 2 Paulo Zanoni 2011-01-25 10:02:01 UTC
(In reply to comment #0)
> 
> In our case, "irb" is null, so we assign NULL to "region". On the next line, we
> try to access region->buffer, which doesn't make sense. We really shouldn't
> assign NULL to region and then access region->buffer...
> 

Sorry... What I wrote is just wrong =)

"irb" is valid. "irb->region" is null.
Comment 3 Paulo Zanoni 2011-01-26 05:32:09 UTC
(In reply to comment #0)
> 
> Then I tried to test... After playing for a while (without crashing, the game
> going fine), I stopped etracer and went to the desktop. It was in a very
> inconsistent state: the screen was not refreshing correctly, I could still see
> parts of the game in my desktop, maximizing/minimizing applications had a weird
> behavior/flickr... When redrawing parts of the screen, sometimes X draw not the
> top window, but the content of the window below it. I'm using kwin.
> 
> Then, even without being able to use konsole correctly (I was not seeing what I
> was typing), I launched etracer again. The screen was really flashing, many
> times per second. Really annoying to watch.

Disabling kwin's desktop effects (which I think turns compositing off) seems to fix the above problem for me.

Btw, I also tested mesa 7.9.1 and it was even worse... Some parts of etracer (like the game logo) were half transparent, and right after you clicked on the button to "play" a selected level, the game segfaulted. 2-3 seconds after the game segfaults, X segfaults too. This was reprodutible 100% of the time.


Talking about 7.10 again, if I also tested "armagetron" (with desktop effects *disabled* and also with the 2 patches that prevent segfaults). If you just launch the game, then select "exit game" on the first menu, you will get a black screen instead of your desktop. Dmesg shows: "composite sync not supported". This one is easy to reproduce =)
Comment 4 Paulo Zanoni 2011-01-27 04:35:09 UTC
(In reply to comment #3)
> Talking about 7.10 again, if I also tested "armagetron" (with desktop effects
> *disabled* and also with the 2 patches that prevent segfaults). If you just
> launch the game, then select "exit game" on the first menu, you will get a
> black screen instead of your desktop. Dmesg shows: "composite sync not
> supported". This one is easy to reproduce =)

I just tested mesa git master from today + kernel 2.6.38-rc2+.
The armagetron problem still happens. Do you want me to open a separate bug report for it?

I also get segfaults with new mesa/kernel (usually when closing etracer).
Comment 5 Gordon Jin 2011-02-15 21:51:33 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Talking about 7.10 again, if I also tested "armagetron" (with desktop effects
> > *disabled* and also with the 2 patches that prevent segfaults). If you just
> > launch the game, then select "exit game" on the first menu, you will get a
> > black screen instead of your desktop. Dmesg shows: "composite sync not
> > supported". This one is easy to reproduce =)
> I just tested mesa git master from today + kernel 2.6.38-rc2+.
> The armagetron problem still happens. Do you want me to open a separate bug
> report for it?

Yes, please.

> I also get segfaults with new mesa/kernel (usually when closing etracer).

It's good you could test the latest code. Let's focus on this.

btw, do you happen to know what's the difference among extremetuxracer v.s. tuxracer v.s. ppracer? I think we've run ppracer on Sandybridge and am going to run tuxracer.
Comment 6 Paulo Zanoni 2011-02-16 08:37:07 UTC
(In reply to comment #5)
> 
> It's good you could test the latest code. Let's focus on this.

I've just tested today's kernel downloaded from kernel.org and today's mesa/mesa git:

Linux mandriva 2.6.38-rc5 #1 SMP Wed Feb 16 10:00:36 BRST 2011 i686 i686 i386 GNU/Linux

I tested and everything looks the same =(

If you need me do to any debugging (breakpoints, backtraces, printf variables, patches), even on kernel code, please ask. I can allocate at lot of time for this task if needed. Any tips would be welcome =)

> 
> btw, do you happen to know what's the difference among extremetuxracer v.s.
> tuxracer v.s. ppracer? I think we've run ppracer on Sandybridge and am going to
> run tuxracer.

I think they're all forks of each other. I found that one of the easiest ways to reproduce this bug is by _closing_ etracer (start etracer in fullscreen mode with the native resolution, play one track, close etracer. if it doesn't crash X, repeat). I do this under KDE with desktop effects enabled.

This is the etracer I'm using:
http://svn.mandriva.com/cgi-bin/viewvc.cgi/packages/cooker/extremetuxracer/current/

The SPECS directory contains the RPM spec files (iow: instructions to build the package), and the SOURCES directory contains the sources used.
Comment 7 Paulo Zanoni 2011-02-16 08:56:45 UTC
(In reply to comment #5)
> > The armagetron problem still happens. Do you want me to open a separate bug
> > report for it?
> 
> Yes, please.
> 

Bug #34345
Comment 8 Chris Wilson 2011-02-19 03:27:10 UTC
The root cause is the GPU hang. What's missing here is the sanity check on a potentially NULL buffer and falling back to swrast appropriately.
Comment 9 Chris Wilson 2011-02-20 04:28:20 UTC

*** This bug has been marked as a duplicate of bug 32534 ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.