Bug 92751 - Segmentation Fault in sna_put_zpixmap_blt (X server crash)
Summary: Segmentation Fault in sna_put_zpixmap_blt (X server crash)
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-30 23:18 UTC by Joe Peterson
Modified: 2015-11-25 06:04 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Xorg.0.log (28.92 KB, text/plain)
2015-10-30 23:18 UTC, Joe Peterson
no flags Details
Xorg.0.log with --enable-debug (26.88 KB, text/plain)
2015-11-05 00:38 UTC, Joe Peterson
no flags Details
Xorg.0.log with slightly different traceback (33.49 KB, text/plain)
2015-11-05 16:26 UTC, Joe Peterson
no flags Details
Xorg.0.log after getting bt in gdb (31.91 KB, text/plain)
2015-11-06 18:03 UTC, Joe Peterson
no flags Details
gdb backtrace (2.19 KB, text/plain)
2015-11-06 18:22 UTC, Joe Peterson
no flags Details
Xorg.0.log (tail) with debug=full (129.34 KB, text/plain)
2015-11-06 19:40 UTC, Joe Peterson
no flags Details
Xorg.0.log after git commit b1015d4 (11.52 KB, text/plain)
2015-11-06 23:04 UTC, Joe Peterson
no flags Details
Xorg.0.log showing original crash (32.22 KB, text/plain)
2015-11-08 18:19 UTC, Joe Peterson
no flags Details
gdb.txt backtrace (full) showing original crash (3.56 KB, text/plain)
2015-11-08 18:24 UTC, Joe Peterson
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joe Peterson 2015-10-30 23:18:25 UTC
Created attachment 119308 [details]
Xorg.0.log

I get a somewhat random X server crash (I say somewhat, since it tends to happen when I am running qemu and actively using several urxvt terminals). The more I am doing, the more likely, but that's probably case with many X bugs.

It is happening with the latest xf86-video-intel from git (commit d78200e), which I built from source today, but it also was happening with the latest Arch Linux package (git commit df72bc5).

Here's a traceback with debug symbols:

[   271.424] (EE) 0: /usr/lib/xorg-server/Xorg (OsLookupColor+0x139) [0x596d09]
[   271.425] (EE) 1: /usr/lib/libc.so.6 (__restore_rt+0x0) [0x7f8555a7a67f]
[   271.427] (EE) 2: /usr/lib/xorg/modules/drivers/intel_drv.so (__kgem_retire_rq+0x58) [0x7f85506170e8]
[   271.428] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (__kgem_retire_requests_upto+0x47) [0x7f8550618417]
[   271.428] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_zpixmap_blt.isra.96+0x401) [0x7f855064b441]
[   271.428] (EE) 5: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_image+0x215) [0x7f855064b665]
[   271.428] (EE) 6: /usr/lib/xorg-server/Xorg (DamageRegionAppend+0x3783) [0x5216c3]
[   271.428] (EE) 7: /usr/lib/xorg-server/Xorg (SyncVerifyFence+0x2205) [0x4d8ec5]
[   271.428] (EE) 8: /usr/lib/xorg-server/Xorg (SyncVerifyFence+0x3645) [0x4dbb85]
[   271.429] (EE) 9: /usr/lib/xorg-server/Xorg (SendErrorToClient+0x2ff) [0x438fbf]
[   271.429] (EE) 10: /usr/lib/xorg-server/Xorg (remove_fs_handlers+0x41b) [0x43d09b]
[   271.429] (EE) 11: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7f8555a67610]
[   271.429] (EE) 12: /usr/lib/xorg-server/Xorg (_start+0x29) [0x427489]
[   271.430] (EE) 13: ? (?+0x29) [0x29]

If I can turn on additional debugging to help, let me know, but this is not always quickly reproducible. I tried using --enable-debug=full, but the file got way too large.
Comment 1 Chris Wilson 2015-10-31 09:33:48 UTC
Start with --enable-debug (not =full). That will throw a lot of assertions to hopefully identify the bug earlier. Failing that attaching gdb and getting a full backtrace would be invaluable.
Comment 2 Joe Peterson 2015-11-02 15:39:39 UTC
(In reply to Chris Wilson from comment #1)
> Start with --enable-debug (not =full). That will throw a lot of assertions
> to hopefully identify the bug earlier. Failing that attaching gdb and
> getting a full backtrace would be invaluable.

I've been trying to get it to happen with gdb attached - no luck so far, but I'll keep at it when possible. BTW, --enable-debug has so far not produced any output. Are you sure that's enough? Does the backtrace I put in give any clues at all in intel_drv.so?
Comment 3 Chris Wilson 2015-11-02 20:27:28 UTC
It's the value of the locals/parameters that is interesting. There are a few special values that have been known to explode there, which were due to a bug in handling a malloc failure, so checking the values is interesting to compare against the old bug. However, to get a crash there implies that the list is corrupt and so an earlier bug.
Comment 4 Joe Peterson 2015-11-05 00:38:50 UTC
Created attachment 119418 [details]
Xorg.0.log with --enable-debug

Chris, I have been as-yet unable to make it happen with gdb attached, but I did just make it happen with --enable-debug set. One difference I see is that there is now an assertion rather than a seg fault:

__kgem_retire_requests_upto:3290 assertion 'tmp->ring == rq->ring' failed

Does that help at all? If not, I'll keep trying to crash it with gdb connected...
Comment 5 Joe Peterson 2015-11-05 16:26:44 UTC
Created attachment 119424 [details]
Xorg.0.log with slightly different traceback

Here's another log file with a slightly longer/different traceback. Could it be related to the same problem? My method of reproducing is the same (which in itself is rather random: moving windows around rapidly, etc.).

BTW, this time it did happen while I was in gdb, but I got "no stack" when doing the "bt full" - damn. I was attached to the "Xorg" process - is this the correct one?
Comment 6 Chris Wilson 2015-11-05 17:04:57 UTC
That looks to be a different bug, whilst it may cause sufficient corruption to result in the earlier assertion failure, it would have died with the second assertion beforehand.

commit af680486608686ade375f5737bff556343366230
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 5 16:54:36 2015 +0000

    sna/dri2: Update TearFree recursion prevention
    
    Some time past, we switched to a seperate boolean to mark when the
    shadow is no longer accessible due to recursion - but I missed an
    important check inside the vblank handler.
    
    Reported-by: Joe Peterson <joe@wildlava.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92751#c5
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

The first assertion is consistent with the segfault. But what is surprising is that we didn't hit an earlier assertion, so the list is being corrupted during later processing (i.e. something akin to a use-after-free).

As for gdb not helping, the process exited before gdb prompted. Seems like FatalError() no longer causes a gdb trap, so add a breakpoint on FatalError instead.
Comment 7 Joe Peterson 2015-11-05 17:21:06 UTC
Ok, I'll take another look tomorrow morning with gdb. BTW, would it help to set --enable-debug=pixmap for one or both of these?
Comment 8 Chris Wilson 2015-11-05 20:52:31 UTC
(In reply to Joe Peterson from comment #7)
> Ok, I'll take another look tomorrow morning with gdb. BTW, would it help to
> set --enable-debug=pixmap for one or both of these?

No. debug=pixmap is intended to verify that drawing operations are within bounds and that the damage tracking is self-consistent (proving correctness is quite hard since most of the bugs are in the lies we tell to avoid having to track everything).
Comment 9 Joe Peterson 2015-11-06 18:03:41 UTC
Created attachment 119450 [details]
Xorg.0.log after getting bt in gdb

Here's one more Xorg.0.log. This time I did get a backtrace in gbd - attaching it next...
Comment 10 Joe Peterson 2015-11-06 18:22:20 UTC
Created attachment 119451 [details]
gdb backtrace

OK, here's a backtrace from gdb (paired with the last log file)! I really hope the variables that got optimized out do not hinder learning something from this, as these backtraces are hard to generate, but let me know on that.
Comment 11 Chris Wilson 2015-11-06 18:43:15 UTC
It at least confirms that it is a novel bug. :|
Comment 12 Joe Peterson 2015-11-06 18:47:06 UTC
(In reply to Chris Wilson from comment #11)
> It at least confirms that it is a novel bug. :|

Well, and there is that other trace I got once - I kinda hope its related so we are not chasing two bugs here.

BTW, How best to compile without optimization?
Comment 13 Chris Wilson 2015-11-06 19:10:47 UTC
Hmm, I put in configure.ac:

if test "x$DEBUG" = "xfull"; then
        AC_DEFINE(DEBUG_MEMORY,1,[Enable memory debugging])
        AC_DEFINE(DEBUG_PIXMAP,1,[Enable pixmap debugging])
        AC_DEFINE(HAS_DEBUG_FULL,1,[Enable all debugging])
        CFLAGS="$CFLAGS -O0 -ggdb3"
        debug_msg=" full"
fi

so with any luck we have the best debugging offered. But you can edit CFLAGS to try harder.
Comment 14 Joe Peterson 2015-11-06 19:40:34 UTC
Created attachment 119454 [details]
Xorg.0.log (tail) with debug=full

OK, cool. I've recompiled with --enable-debug=full, and although I could not attach gdb this time (other machine in-use at the moment), I did eventually get it to crash, and this attachment is the last 130K or so of the Xorg.0.log file. Note the traceback is different. I create these most easily now by playing a video and dragging the window around the screen.

I can try to get a gdb trace later. But let me know if this is illuminating at all and/or related to the other trace.

Do you need more of the log file?
Comment 15 Chris Wilson 2015-11-06 20:43:04 UTC
(In reply to Joe Peterson from comment #14)
> Created attachment 119454 [details]
> Xorg.0.log (tail) with debug=full
> 
> OK, cool. I've recompiled with --enable-debug=full, and although I could not
> attach gdb this time (other machine in-use at the moment), I did eventually
> get it to crash, and this attachment is the last 130K or so of the
> Xorg.0.log file. Note the traceback is different. I create these most easily
> now by playing a video and dragging the window around the screen.

commit b1015d4de0cb040dc47ebfbef1114b65269f6f36
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Nov 6 20:41:03 2015 +0000

    sna/dri2: Protect against a new possible recursion during TearFree
    
    With the introduction of a new SwapEvent that may trigger a copy, we
    need to add another check for recursion from a TearFree flip event.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=92751#c14
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 16 Joe Peterson 2015-11-06 23:04:32 UTC
Created attachment 119455 [details]
Xorg.0.log after git commit b1015d4

Chris, I tried the new code (git b1015d4), and I get:

    _kgem_submit:4067 assertion 'rq->bo == NULL' failed

on start (see attachment for log). I simply re-built and re-installed the 3 libs - anything else I need to do to test the new changes?
Comment 17 Chris Wilson 2015-11-06 23:30:25 UTC
I made a mistake in adding that assertion. Darn it.
Comment 18 Joe Peterson 2015-11-06 23:34:45 UTC
(In reply to Chris Wilson from comment #17)
> I made a mistake in adding that assertion. Darn it.

Ok, I'm building a branch with a cherry-pick of your latest commit, skipping commit 1982462 that caused the assertion; I'll let you know if that works...
Comment 19 Joe Peterson 2015-11-06 23:52:57 UTC
(In reply to Joe Peterson from comment #18)
> (In reply to Chris Wilson from comment #17)
> > I made a mistake in adding that assertion. Darn it.
> 
> Ok, I'm building a branch with a cherry-pick of your latest commit, skipping
> commit 1982462 that caused the assertion; I'll let you know if that works...

Ok, with this (omitting the kgem commit), I get this backtrace (similar to my original reported trace):

Backtrace:
0: /usr/lib/xorg/modules/drivers/intel_drv.so (__kgem_retire_requests_upto+0x71) [0x7f3876671f91]
1: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_zpixmap_blt.isra.128+0x1b2) [0x7f38766c03e2]
2: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_image+0x2f8) [0x7f38766c0f18]

Perhaps this is expected, since that commit can fix this (?), now commenting our the assert in question but including the commit otherwise...
Comment 20 Joe Peterson 2015-11-07 00:05:06 UTC
(In reply to Joe Peterson from comment #19)
> (In reply to Joe Peterson from comment #18)
> > (In reply to Chris Wilson from comment #17)
> > > I made a mistake in adding that assertion. Darn it.
> > 
> > Ok, I'm building a branch with a cherry-pick of your latest commit, skipping
> > commit 1982462 that caused the assertion; I'll let you know if that works...
> 
> Ok, with this (omitting the kgem commit), I get this backtrace (similar to
> my original reported trace):
> 
> Backtrace:
> 0: /usr/lib/xorg/modules/drivers/intel_drv.so
> (__kgem_retire_requests_upto+0x71) [0x7f3876671f91]
> 1: /usr/lib/xorg/modules/drivers/intel_drv.so
> (sna_put_zpixmap_blt.isra.128+0x1b2) [0x7f38766c03e2]
> 2: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_image+0x2f8)
> [0x7f38766c0f18]
> 
> Perhaps this is expected, since that commit can fix this (?), now commenting
> our the assert in question but including the commit otherwise...

Ok, just comfirming that I still get above backtrace even with all your commits (through b1015d4, including kgem one (1982462)) minus the two asserts that kept X from starting (diff from your b1015d4 is:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index 7acc69e..0e2f830 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -4064,13 +4064,13 @@ void _kgem_submit(struct kgem *kgem)
 #endif
 
        rq = kgem->next_request;
-       assert(rq->bo == NULL);
+       //assert(rq->bo == NULL);
 
        rq->bo = kgem_create_batch(kgem);
        if (rq->bo) {
                struct drm_i915_gem_execbuffer2 execbuf;
 
-               assert(rq->bo->refcnt == 1);
+               //assert(rq->bo->refcnt == 1);
                assert(!rq->bo->needs_flush);
 
                i = kgem->nexec++;



So the new code could have fixed the TearFree issue (I'll let you know if I see that again), but original issue seems to still exist. I'll try to get a gdb backtrace of that and/or a debug=full log...
Comment 21 Joe Peterson 2015-11-08 18:19:19 UTC
Created attachment 119485 [details]
Xorg.0.log showing original crash

I finally captured the original crash in gdb! (Well, it has a few new calls in the stack: __kgem_bo_is_busy, ignore_cpu_damage, try_upload__fast - but these may be due to turning off optimization...?)

This is the Xorg.0.log produced, but the next attachment is the gdb full backtrace.
Comment 22 Joe Peterson 2015-11-08 18:24:08 UTC
Created attachment 119486 [details]
gdb.txt backtrace (full) showing original crash

Here's the gdb full backtrace. Code compiled with -O0, etc. (same CFLAGS as used in full debug mode but without that mode on (note that I have not yet been able to capture this crash with --enable-debug=full).

I really hope this is useful. And BTW, I have not seen the TearFree crash again since using the newer code, so hopefully that one is in the past, and we now only have this original case (let me know id this looks like the original case even with the added calls in that stack).
Comment 23 Chris Wilson 2015-11-08 18:34:23 UTC
Still haven't found the possible list/element corruption, but could you try this simplification:

diff --git a/src/sna/kgem.c b/src/sna/kgem.c
index b80fde7..0f1c1f4 100644
--- a/src/sna/kgem.c
+++ b/src/sna/kgem.c
@@ -3275,12 +3275,11 @@ bool __kgem_ring_is_idle(struct kgem *kgem, int ring)
 
 bool __kgem_retire_requests_upto(struct kgem *kgem, struct kgem_bo *bo)
 {
-       struct kgem_request *rq = bo->rq, *tmp;
-       struct list *requests = &kgem->requests[RQ_RING(rq) == KGEM_BLT];
+       struct kgem_request *rq = RQ(bo->rq), *tmp;
+       struct list *requests = &kgem->requests[rq->ring];
 
-       DBG(("%s(handle=%d)\n", __FUNCTION__, bo->handle));
+       DBG(("%s(handle=%d, ring=%d)\n", __FUNCTION__, bo->handle, rq->ring));
 
-       rq = RQ(rq);
        assert(rq != &kgem->static_request);
        if (rq == (struct kgem_request *)kgem) {
                __kgem_bo_clear_busy(bo);
Comment 24 Joe Peterson 2015-11-10 16:25:53 UTC
(In reply to Chris Wilson from comment #23)
> Still haven't found the possible list/element corruption, but could you try
> this simplification:

Did you want me to turn on debugging messages (which config switch will turn on output from this patch?), or just see if it crashes? I'm running now with this patch, no crash yet, but I'll have to give it some use before feeling more comfortable.

If there are more debug messages we can add that will help, please let me know - I'll be glad to make the crash happen with more debug output.
Comment 25 Joe Peterson 2015-11-13 23:07:21 UTC
Chris, so far, I've not been able to make it crash with your patch in comment #23. Since crashing is a bit random, I can't say for sure if this the patch or just luck, but it is promising. Are there other cases where the code does it the old way, or did that one stick out as unique? If indeed it did fix the issue, it would be good to change it globally. I did not analyze the fix - let me know what the difference is if you have time. I am curious.
Comment 26 Chris Wilson 2015-11-13 23:14:04 UTC
It is used as a flag in other places, and not as an index. Everywhere else we want an index, we use the rq->ring so in that respect it was anomalous. Still doesn't bode entirely well as I don't have an explanation for the crash.
Comment 27 Joe Peterson 2015-11-13 23:24:19 UTC
(In reply to Chris Wilson from comment #26)
> It is used as a flag in other places, and not as an index. Everywhere else
> we want an index, we use the rq->ring so in that respect it was anomalous.
> Still doesn't bode entirely well as I don't have an explanation for the
> crash.

OK, thanks. Well, I can back out the patch and put in some more debugging lines - just let me know if you know of appropriate ones (or can I turn on the one that enables "DBG()" calls to get into the log?).
Comment 28 Joe Peterson 2015-11-24 18:50:48 UTC
Just another update: still no crashes since the comment #23 patch. Statistics say this likely fixed it. Worth integrating (i.e. any disadvantage to integrating)?
Comment 29 Chris Wilson 2015-11-24 22:20:02 UTC
Indeed, already did. I thought using the more common pattern was sensible. Nothing still occurs to me how we die there that wouldn't be picked up by an assertion before and that worries me.
Comment 30 Joe Peterson 2015-11-25 06:04:28 UTC
(In reply to Chris Wilson from comment #29)
> Indeed, already did. I thought using the more common pattern was sensible.
> Nothing still occurs to me how we die there that wouldn't be picked up by an
> assertion before and that worries me.

I see the commit - OK, cool. I am now running at 0995ad2 (latest HEAD as of this afternoon), so I'll let you know if there are any further crashes. I agree it's disconcerting that the reason the patch helps is uncertain. I can always revert back to before the patch and collect additional debug info if you have ideas of what would help. Let me know.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.