Bug 50744

Summary: Xorg crashed during flash playback
Product: xorg Reporter: Zdenek Kabelac <zdenek.kabelac>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Full bt
none
Xorg log from crashed Xorg
none
Damage log
none
full bt - from startkde crash
none
New crash, new bt full
none
bzip2 damage log
none
Another bt
none
Next bt with new shorter damage log
none
shorter damage log none

Description Zdenek Kabelac 2012-06-05 11:32:25 UTC
git checkout ff91fd4a2749f1ea8722189aaec938e51bce4222
xorg-x11-server-Xorg-1.12.2-1.fc18.x86_64


While I'd tried to seek in some flash played video I've got this Xorg crash.
I've relatively recently renabled hw acceleration usage for it.

(gdb) bt
#0  0x00007f6ac2f8fc25 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:63
#1  0x00007f6ac2f913d8 in __GI_abort () at abort.c:90
#2  0x00007f6ac2f88a52 in __assert_fail_base (fmt=0x7f6ac30d6388 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7f6ac1381010 "damage->extents.x1 == region->extents.x1 && damage->extents.y1 == region->extents.y1 && damage->extents.x2 == region->extents.x2 && damage->extents.y2 == region->extents.y2", file=file@entry=0x7f6ac1380f93 "sna_damage.c", line=line@entry=502, 
    function=function@entry=0x7f6ac13812a0 "__sna_damage_reduce") at assert.c:92
#3  0x00007f6ac2f88b02 in __GI___assert_fail (
    assertion=assertion@entry=0x7f6ac1381010 "damage->extents.x1 == region->extents.x1 && damage->extents.y1 == region->extents.y1 && damage->extents.x2 == region->extents.x2 && damage->extents.y2 == region->extents.y2", file=file@entry=0x7f6ac1380f93 "sna_damage.c", line=line@entry=502, 
    function=function@entry=0x7f6ac13812a0 "__sna_damage_reduce") at assert.c:101
#4  0x00007f6ac12f2fa6 in __sna_damage_reduce (damage=damage@entry=0x1e7d340) at sna_damage.c:499
#5  0x00007f6ac12f46f8 in __sna_damage_subtract (region=<optimized out>, damage=0x1e7d340) at sna_damage.c:1021
#6  _sna_damage_subtract (damage=0x1e7d340, region=region@entry=0x7fffde01a310) at sna_damage.c:1070
#7  0x00007f6ac12e60ad in sna_damage_subtract (region=0x7fffde01a310, damage=0x39d7a68) at sna_damage.h:156
#8  sna_put_zpixmap_blt (stride=<optimized out>, 
    bits=0x2531468 "[l\245\377Zk\243\377Yj\242\377[i\240\377Yh\236\377We\233\377S^\225\377NW\215\377EN\204\377=F{\377\065:p\377,1f\377#)^\377\036\"X\377\035\035U\377\034\034S\377\033\033S\377\037\035V\377\"\036Y\377$!\\\377%!]\377&!]\377'!]\377'!]\377(\"^\377)#`\377*%b\377+&c\377,(e\377/*h\377\062.k\377;9t\377CB}\377@?z\377<;u\377\063\065m\377-/g\377(*b\377(*_\377(+^\377+/`\377.2c\377\060\064d\377\063\070g\377\066<j\377\070>m\377\071@o\377:Ap\377<Bq\377;Aq\377"..., h=63, w=63, y=179, x=254, region=0x7fffde01a310, gc=0x39980e0, drawable=0x1f95730) at sna_accel.c:2935
#9  sna_put_image (drawable=0x1f95730, gc=0x39980e0, depth=32, x=254, y=179, w=63, h=63, left=0, format=2, 
    bits=0x2531468 "[l\245\377Zk\243\377Yj\242\377[i\240\377Yh\236\377We\233\377S^\225\377NW\215\377EN\204\377=F{\377\065:p\377,1f\377#)^\377\036\"X\377\035\035U\377\034\034S\377\033\033S\377\037\035V\377\"\036Y\377$!\\\377%!]\377&!]\377'!]\377'!]\377(\"^\377)#`\377*%b\377+&c\377,(e\377/*h\377\062.k\377;9t\377CB}\377@?z\377<;u\377\063\065m\377-/g\377(*b\377(*_\377(+^\377+/`\377.2c\377\060\064d\377\063\070g\377\066<j\377\070>m\377\071@o\377:Ap\377<Bq\377;Aq\377"...) at sna_accel.c:3321
#10 0x0000000000504e92 in ?? ()
#11 0x0000000000431033 in ?? ()
#12 0x000000000043442a in ?? ()
#13 0x0000000000423495 in ?? ()
#14 0x00007f6ac2f7aba5 in __libc_start_main (main=0x423120, argc=4, ubp_av=0x7fffde01a5f8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffde01a5e8) at libc-start.c:225


print damage->extents 
$1 = {x1 = 0, y1 = 0, x2 = 319, y2 = 243}

print region->extents 
$2 = {x1 = 253, y1 = 178, x2 = 319, y2 = 243}

print nboxes
$5 = 4

print n
$8 = 2


Sorry there is no Xorg function list - some wrong is with current rawhide debuginfo package
Comment 1 Chris Wilson 2012-06-05 11:44:01 UTC
Do tell how you tweaked flash for hw acceleration... Is this gnash or adobe?

Any chance you can make this crash with debug=full? Reconstructing the contents of the damage to see where the error crept in is the fun part of this.
Comment 2 Zdenek Kabelac 2012-06-05 12:01:25 UTC
(In reply to comment #1)
> Do tell how you tweaked flash for hw acceleration... Is this gnash or adobe?
> 
> Any chance you can make this crash with debug=full? Reconstructing the contents
> of the damage to see where the error crept in is the fun part of this.

I've not been doing anything 'special' you might have it probably already enabled by default - I've just disabled in past because with UXA I've been getting various crashes at one point in time - so I've disable acceleration,
and now I've just enable it again  via 'flash' context menu.

flash-plugin-11.2.202.233-release.x86_64
firefox-12.0-1.fc18.x86_64

(For rawhide nspluginwrapper must be uninstalled - otherwise I get nspluginwrapper coredumps)

It could be true that there is no XV acceleration anyway so take only as a comment, that I've just switched this flag in adobe flash player.

(Hmm how do I check, it's played accelerated ?)
Comment 3 Chris Wilson 2012-06-05 12:06:08 UTC
*g* I've never seen it do anything other than PutImage, at best ShmPutImage. Judging by bug reports, it has been known to switch to GL when fullscreen, though not for me.
Comment 4 Chris Wilson 2012-06-05 12:27:41 UTC
Also was the crash on any site in particular? Of course, if you know how to reproduce it...
Comment 5 Chris Wilson 2012-06-05 13:38:35 UTC
Hopefully these will help pinpoint the perpetrator:

commit 1dafb4777f8378c87f34feae667582498220204c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jun 5 21:25:40 2012 +0100

    sna/damage: Add some assertions to validate that each damage box is non-empty
Comment 6 Zdenek Kabelac 2012-06-06 01:08:58 UTC
I've tried to hit the problem again - but it's not so easy - though I've manage to get the same core already 2 times.

With 'full=debug'  I'm getting pretty instant problems with those signal handlers (no valgrind involved).cd /ho

#1  0x00007f460de35b57 in _L_lock_52 () from /lib64/libc.so.6
#2  0x00007f460de35a38 in __GI__IO_fwrite (buf=0x1325510, size=48, count=1, fp=0xffffffffffffffff) at iofwrite.c:41
#3  0x0000000000468d60 in LogVWrite ()
#4  0x0000000000468bac in ErrorF ()
#5  0x00007f460c119ba0 in sna_crtc_set_cursor_position (crtc=0x1345760, x=718, y=257) at sna_display.c:810
#6  0x00000000004b6b44 in ?? ()
#7  0x000000000055a3c3 in ?? ()
#8  0x000000000055b3e8 in miPointerSetPosition ()
#9  0x0000000000448487 in ?? ()
#10 0x000000000044895d in ?? ()
#11 0x0000000000449cd3 in GetPointerEvents ()
#12 0x000000000044a19d in QueuePointerEvents ()
#13 0x000000000048fae0 in xf86PostMotionEvent ()
#14 0x00007f460a691d15 in HandleState (pInfo=pInfo@entry=0x17774a0, hw=hw@entry=0x1785690, now=<optimized out>, from_timer=from_timer@entry=0)
    at synaptics.c:3370
#15 0x00007f460a693d92 in ReadInput (pInfo=0x17774a0) at synaptics.c:1721
#16 0x0000000000480317 in ?? ()
#17 0x00000000004a44f0 in ?? ()
#18 <signal handler called>
#19 0x00007f460de35a32 in __GI__IO_fwrite (buf=0x7d2a20, size=2, count=1, fp=0x1325430) at iofwrite.c:41
#20 0x0000000000468d60 in LogVWrite ()
#21 0x0000000000468bac in ErrorF ()
#22 0x00007f460c1a6a9d in vertices_float_out (ve=0x7f460c41a41c, f=0x7f4608bbc830, max=2) at kgem_debug_gen4.c:180
#23 0x00007f460c1a6b92 in ve_out (ve=0x7f460c41a41c, ptr=0x7f4608bbc830) at kgem_debug_gen4.c:192
#24 0x00007f460c1a6d23 in indirect_vertex_out (kgem=0x7f460fea9208, v=12121) at kgem_debug_gen4.c:233
#25 0x00007f460c1a6e0c in primitive_out (kgem=0x7f460fea9208, data=0x7f460fea9aa0) at kgem_debug_gen4.c:252
#26 0x00007f460c1a836f in kgem_gen4_decode_3d (kgem=0x7f460fea9208, offset=102) at kgem_debug_gen4.c:660
#27 0x00007f460c19e4bf in __kgem_batch_debug (kgem=0x7f460fea9208, nbatch=736) at kgem_debug.c:420
#28 0x00007f460c0d4e73 in _kgem_submit (kgem=0x7f460fea9208) at kgem.c:1810
#29 0x00007f460c0dd29a in kgem_submit (kgem=0x7f460fea9208) at kgem.h:254
#30 0x00007f460c108766 in sna_accel_wakeup_handler (sna=0x7f460fea9010, ready=0x7de540) at sna_accel.c:12465
#31 0x00007f460c11ecff in sna_wakeup_handler (arg=0, wakeupData=0x0, result=1, read_mask=0x7de540) at sna_driver.c:610
#32 0x000000000043839a in WakeupHandler ()
#33 0x0000000000462414 in WaitForSomething ()
#34 0x00000000004340f1 in ?? ()
#35 0x0000000000423495 in ?? ()
#36 0x00007f460dde9ba5 in __libc_start_main (main=0x423120, argc=4, ubp_av=0x7fffaaf86948, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffaaf86938) at libc-start.c:225
#37 0x000000000042376d in _start ()

So to fix those - I've disable UseSGIO
 
Server {  Option "UseSIGIO" "false" }

But I've not managed to convince my very slow desktop with all those debugs
to generate a crash again.

I'll keep an eye on this - and maybe those assert will help (though I'm not use full debug now)
Comment 7 Zdenek Kabelac 2012-06-06 01:35:27 UTC
(In reply to comment #6)
> I've tried to hit the problem again - but it's not so easy - though I've manage
> to get the same core already 2 times.
> 
> With 'full=debug'  I'm getting pretty instant problems with those signal
> handlers (no valgrind involved).cd /ho
> 
> #1  0x00007f460de35b57 in _L_lock_52 () from /lib64/libc.so.6
> #2  0x00007f460de35a38 in __GI__IO_fwrite (buf=0x1325510, size=48, count=1,
> fp=0xffffffffffffffff) at iofwrite.c:41
> #3  0x0000000000468d60 in LogVWrite ()

I should mention here it's the place of Xorg deadlock. So the binary is running - but everything stays frozen in the lock above.  Just to it clear this is not coredump (unlike with valgrind, where it shutdowns the server)
Comment 8 Chris Wilson 2012-06-06 01:43:03 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > I've tried to hit the problem again - but it's not so easy - though I've manage
> > to get the same core already 2 times.
> > 
> > With 'full=debug'  I'm getting pretty instant problems with those signal
> > handlers (no valgrind involved).cd /ho
> > 
> > #1  0x00007f460de35b57 in _L_lock_52 () from /lib64/libc.so.6
> > #2  0x00007f460de35a38 in __GI__IO_fwrite (buf=0x1325510, size=48, count=1,
> > fp=0xffffffffffffffff) at iofwrite.c:41
> > #3  0x0000000000468d60 in LogVWrite ()
> 
> I should mention here it's the place of Xorg deadlock. So the binary is running
> - but everything stays frozen in the lock above.  Just to it clear this is not
> coredump (unlike with valgrind, where it shutdowns the server)

No worries, this is just the danger of doing anything from a signal handler. As this can only happen with debug=full, it's not critical except that is preventing reporters from getting useful info...
Comment 9 Zdenek Kabelac 2012-06-06 02:28:22 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > I've tried to hit the problem again - but it's not so easy - though I've manage
> > > to get the same core already 2 times.
> > > 
> > > With 'full=debug'  I'm getting pretty instant problems with those signal
> > > handlers (no valgrind involved).cd /ho
> > > 
> > > #1  0x00007f460de35b57 in _L_lock_52 () from /lib64/libc.so.6
> > > #2  0x00007f460de35a38 in __GI__IO_fwrite (buf=0x1325510, size=48, count=1,
> > > fp=0xffffffffffffffff) at iofwrite.c:41
> > > #3  0x0000000000468d60 in LogVWrite ()
> > 
> > I should mention here it's the place of Xorg deadlock. So the binary is running
> > - but everything stays frozen in the lock above.  Just to it clear this is not
> > coredump (unlike with valgrind, where it shutdowns the server)
> 
> No worries, this is just the danger of doing anything from a signal handler. As
> this can only happen with debug=full, it's not critical except that is
> preventing reporters from getting useful info...


Maybe the debug code could be extended to skip debug print if called through signal handler then?
Comment 10 Chris Wilson 2012-06-06 02:33:57 UTC
I was thinking of just removing the likely candidates for recursion during signal handlers... But it would be possible to wrap all the DBG and block SIGIO during the write... Hmm, there's still potential for other fail though, so I think just removing this deadlock is simplest. (For instance, SIGIO recursion is the best explanation I have so far for the memcheck explosion.)
Comment 11 Zdenek Kabelac 2012-06-06 02:48:54 UTC
(In reply to comment #10)
> I was thinking of just removing the likely candidates for recursion during
> signal handlers... But it would be possible to wrap all the DBG and block SIGIO
> during the write... Hmm, there's still potential for other fail though, so I
> think just removing this deadlock is simplest. (For instance, SIGIO recursion
> is the best explanation I have so far for the memcheck explosion.)


Right - masking the SIGIO during debug printing would probably fix the issue.
Unsure how efficient is this - and how it changes the timing.
But it's already slow - so probably it's not an issue to make it tiny bit slower.
Comment 12 Chris Wilson 2012-06-07 03:24:27 UTC
I didn't like either of the options. The root cause is the use of SIGIO by X, but I'm happier with just ignoring the DBG prints for the cursor routines for the time being:

commit c58d137d3eeb0e97bfd53e68404e04d9012b5697
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 7 11:18:27 2012 +0100

    sna: Suppress DBG prints along potential SIGIO paths
    
    As ErrorF/fprintf is not re-entrant due to its mutex we can not use DBG
    from code that could be called by a signal handler. X's SIGIO handler
    attempts to move the cursor from within the handler (eek!) and so we
    need to be careful not to take any locks, such in as the aforementioned
    fprintf, along the cursor paths.
    
    Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=50744
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 13 Zdenek Kabelac 2012-06-12 11:26:55 UTC
Created attachment 62940 [details]
Full bt

Happend again today - with git commit 392e33a62d729c64c57699505220b4029e015470
(so asserts should be compiled in -  build with --enable-debug=memory)

print *region
{extents = {x1 = 133, y1 = 0, x2 = 542, y2 = 541}, data = 0x20aa410}

print *damage
{extents = {x1 = 58, y1 = 0, x2 = 542, y2 = 541}, region = {extents = {x1 = 133, y1 = 0, x2 = 542, y2 = 541}, data = 0x20aa410}, 
  mode = DAMAGE_ADD, remain = 6, dirty = 1, box = 0x216dba4, embedded_box = {list = {next = 0x216db80, prev = 0x216db80}, size = 8, box = {{
        x1 = 350, y1 = 224, x2 = 542, y2 = 242}, {x1 = 133, y1 = 242, x2 = 542, y2 = 541}, {x1 = 133, y1 = 0, x2 = 542, y2 = 224}, {x1 = 133, 
        y1 = 224, x2 = 328, y2 = 242}, {x1 = 133, y1 = 0, x2 = 542, y2 = 160}, {x1 = 133, y1 = 160, x2 = 542, y2 = 320}, {x1 = 133, y1 = 320, 
        x2 = 542, y2 = 480}, {x1 = 133, y1 = 480, x2 = 542, y2 = 541}}}}
Comment 14 Zdenek Kabelac 2012-06-12 11:28:35 UTC
Created attachment 62941 [details]
Xorg log from crashed Xorg

I do not have any reliable reproducer - bug it has again happened with some flash.
Comment 15 Zdenek Kabelac 2012-06-12 11:30:08 UTC
Since I've still full debugs and core available - I could add fill in more variables from stack.
Comment 16 Chris Wilson 2012-06-12 13:36:56 UTC
The minor annoyance is that we've lost the damage->region. However, its boxes appears to be inside damage->embedded_boxes, so can you p nboxes and p free_boxes?
If free_boxes!=NULL, it would be worth printing free_boxes[0:nboxes].
Comment 17 Zdenek Kabelac 2012-06-12 13:45:34 UTC
(In reply to comment #16)
> The minor annoyance is that we've lost the damage->region. However, its boxes
> appears to be inside damage->embedded_boxes, so can you p nboxes and p
> free_boxes?
> If free_boxes!=NULL, it would be worth printing free_boxes[0:nboxes].

(gdb) print boxes
$1 = (BoxPtr) 0x216db94
(gdb) print free_boxes
$2 = (BoxPtr) 0x0
(gdb) print nboxes
$3 = 4
(gdb) print iter->size
$4 = 8
(gdb) print damage->remain
$5 = 6
(gdb) print n
$6 = 2


so probably not big help
Comment 18 Chris Wilson 2012-06-12 14:11:09 UTC
And then he realises he has the full bt...
Comment 19 Chris Wilson 2012-06-12 14:57:44 UTC
That assertion looks like it should be self-contained to the damage code, yet I've reviewed every line of code there and can't see how we could end up with that assertion failling. :(

I've added a few more assertions, and tweaked the likely paths, so if you could update and force DBG for src/sna/sna_damage.c that would be useful.

(There is a DEBUG_DAMAGE in src/sna/sna.h, though it is probably easier to just redefine that value in src/sna/sna_damage.c)
Comment 20 Zdenek Kabelac 2012-06-13 14:38:05 UTC
Created attachment 62990 [details]
Damage log

Log from  'startkde'  session which gives me same coredump for me - instantly anytime :)
Comment 21 Zdenek Kabelac 2012-06-13 14:41:26 UTC
Created attachment 62991 [details]
full bt  - from  startkde crash

Just to show the bt from crashed KDE session.

(which is  btw actually pretty slow with windowmoment)
I think my kde is configure to not use compositing
(But I'm not really using it often - just been testing something)
Comment 22 Chris Wilson 2012-06-13 15:02:30 UTC
Lucky you :-p

commit 221534abe2dc04fae8b8fc332104bca275d4863b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jun 13 22:48:43 2012 +0100

    sna: Only reduce damage addition to a region operation if clean
    
    Some paths bypass operating upon the region as they do not have an
    YX-banded set of boxes and so prefer to defer the costly construction of
    the region till later. As a result, we have to be careful not to
    overwrite any existing information if we do operate on the region after
    setting the dirty boxes.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50744
    Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


(In reply to comment #21)
> (which is  btw actually pretty slow with windowmoment)
> I think my kde is configure to not use compositing
> (But I'm not really using it often - just been testing something)

It shouldn't be that slow, works acceptably if you turn of GL compositing on PNV for instance. However, given that you are exercising the damage paths implies that you are mixing rendering on the CPU and the GPU and migrating between them. Not ideal. Can you grab a debug=full log of the slow wm, so that I can see if there is anything obvious I can do?
Comment 23 Zdenek Kabelac 2012-06-13 15:44:32 UTC
Great, I could confirm that KDE now started without crash, so I assume same applies to flash playback. Also the window movement does not leave 'solitaire-like' trail, so the speed is ok I guess.
Comment 24 Zdenek Kabelac 2012-06-14 08:47:51 UTC
Created attachment 63029 [details]
New crash, new bt full

Doesn't want to go away so easily :)

Happened when I've put all the accelerations defines from line 72-91 in sna_accel.c to 0  after some windows movement.
Comment 25 Zdenek Kabelac 2012-06-14 08:49:49 UTC
Created attachment 63030 [details]
bzip2  damage log

This time somewhat larger log - how big part is needed for attachment ?
Comment 26 Zdenek Kabelac 2012-06-14 08:59:39 UTC
Created attachment 63031 [details]
Another bt
Comment 27 Zdenek Kabelac 2012-06-14 09:00:25 UTC
Created attachment 63032 [details]
Next bt  with new shorter damage log
Comment 28 Zdenek Kabelac 2012-06-14 09:01:25 UTC
Created attachment 63033 [details]
shorter damage log

Just thunderbird started on openbox desktop and mouse moved inside.
Comment 29 Chris Wilson 2012-06-14 09:13:17 UTC
Onwards! Deja vu?

commit b415ca05c2e1c4f09f85d8769d39e5369661ed3a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 14 17:06:19 2012 +0100

    sna: Reset damage extents after reduction goes to zero
    
    If the reduction of the damage clears all of the boxes, we need to reset
    the -infinite extents so that we continue to accumulate further damage.
    
    Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
    References: https://bugs.freedesktop.org/show_bug.cgi?id=50744
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk
Comment 30 Chris Wilson 2012-06-14 09:17:50 UTC
The shorter log also matches the same bug. \o/
Comment 31 Chris Wilson 2012-06-16 04:30:16 UTC
If I mark this bug resolved, that will encourage the next one to bite...

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.