Created attachment 119308 [details] Xorg.0.log I get a somewhat random X server crash (I say somewhat, since it tends to happen when I am running qemu and actively using several urxvt terminals). The more I am doing, the more likely, but that's probably case with many X bugs. It is happening with the latest xf86-video-intel from git (commit d78200e), which I built from source today, but it also was happening with the latest Arch Linux package (git commit df72bc5). Here's a traceback with debug symbols: [ 271.424] (EE) 0: /usr/lib/xorg-server/Xorg (OsLookupColor+0x139) [0x596d09] [ 271.425] (EE) 1: /usr/lib/libc.so.6 (__restore_rt+0x0) [0x7f8555a7a67f] [ 271.427] (EE) 2: /usr/lib/xorg/modules/drivers/intel_drv.so (__kgem_retire_rq+0x58) [0x7f85506170e8] [ 271.428] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (__kgem_retire_requests_upto+0x47) [0x7f8550618417] [ 271.428] (EE) 4: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_zpixmap_blt.isra.96+0x401) [0x7f855064b441] [ 271.428] (EE) 5: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_image+0x215) [0x7f855064b665] [ 271.428] (EE) 6: /usr/lib/xorg-server/Xorg (DamageRegionAppend+0x3783) [0x5216c3] [ 271.428] (EE) 7: /usr/lib/xorg-server/Xorg (SyncVerifyFence+0x2205) [0x4d8ec5] [ 271.428] (EE) 8: /usr/lib/xorg-server/Xorg (SyncVerifyFence+0x3645) [0x4dbb85] [ 271.429] (EE) 9: /usr/lib/xorg-server/Xorg (SendErrorToClient+0x2ff) [0x438fbf] [ 271.429] (EE) 10: /usr/lib/xorg-server/Xorg (remove_fs_handlers+0x41b) [0x43d09b] [ 271.429] (EE) 11: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7f8555a67610] [ 271.429] (EE) 12: /usr/lib/xorg-server/Xorg (_start+0x29) [0x427489] [ 271.430] (EE) 13: ? (?+0x29) [0x29] If I can turn on additional debugging to help, let me know, but this is not always quickly reproducible. I tried using --enable-debug=full, but the file got way too large.
Start with --enable-debug (not =full). That will throw a lot of assertions to hopefully identify the bug earlier. Failing that attaching gdb and getting a full backtrace would be invaluable.
(In reply to Chris Wilson from comment #1) > Start with --enable-debug (not =full). That will throw a lot of assertions > to hopefully identify the bug earlier. Failing that attaching gdb and > getting a full backtrace would be invaluable. I've been trying to get it to happen with gdb attached - no luck so far, but I'll keep at it when possible. BTW, --enable-debug has so far not produced any output. Are you sure that's enough? Does the backtrace I put in give any clues at all in intel_drv.so?
It's the value of the locals/parameters that is interesting. There are a few special values that have been known to explode there, which were due to a bug in handling a malloc failure, so checking the values is interesting to compare against the old bug. However, to get a crash there implies that the list is corrupt and so an earlier bug.
Created attachment 119418 [details] Xorg.0.log with --enable-debug Chris, I have been as-yet unable to make it happen with gdb attached, but I did just make it happen with --enable-debug set. One difference I see is that there is now an assertion rather than a seg fault: __kgem_retire_requests_upto:3290 assertion 'tmp->ring == rq->ring' failed Does that help at all? If not, I'll keep trying to crash it with gdb connected...
Created attachment 119424 [details] Xorg.0.log with slightly different traceback Here's another log file with a slightly longer/different traceback. Could it be related to the same problem? My method of reproducing is the same (which in itself is rather random: moving windows around rapidly, etc.). BTW, this time it did happen while I was in gdb, but I got "no stack" when doing the "bt full" - damn. I was attached to the "Xorg" process - is this the correct one?
That looks to be a different bug, whilst it may cause sufficient corruption to result in the earlier assertion failure, it would have died with the second assertion beforehand. commit af680486608686ade375f5737bff556343366230 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Nov 5 16:54:36 2015 +0000 sna/dri2: Update TearFree recursion prevention Some time past, we switched to a seperate boolean to mark when the shadow is no longer accessible due to recursion - but I missed an important check inside the vblank handler. Reported-by: Joe Peterson <joe@wildlava.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92751#c5 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> The first assertion is consistent with the segfault. But what is surprising is that we didn't hit an earlier assertion, so the list is being corrupted during later processing (i.e. something akin to a use-after-free). As for gdb not helping, the process exited before gdb prompted. Seems like FatalError() no longer causes a gdb trap, so add a breakpoint on FatalError instead.
Ok, I'll take another look tomorrow morning with gdb. BTW, would it help to set --enable-debug=pixmap for one or both of these?
(In reply to Joe Peterson from comment #7) > Ok, I'll take another look tomorrow morning with gdb. BTW, would it help to > set --enable-debug=pixmap for one or both of these? No. debug=pixmap is intended to verify that drawing operations are within bounds and that the damage tracking is self-consistent (proving correctness is quite hard since most of the bugs are in the lies we tell to avoid having to track everything).
Created attachment 119450 [details] Xorg.0.log after getting bt in gdb Here's one more Xorg.0.log. This time I did get a backtrace in gbd - attaching it next...
Created attachment 119451 [details] gdb backtrace OK, here's a backtrace from gdb (paired with the last log file)! I really hope the variables that got optimized out do not hinder learning something from this, as these backtraces are hard to generate, but let me know on that.
It at least confirms that it is a novel bug. :|
(In reply to Chris Wilson from comment #11) > It at least confirms that it is a novel bug. :| Well, and there is that other trace I got once - I kinda hope its related so we are not chasing two bugs here. BTW, How best to compile without optimization?
Hmm, I put in configure.ac: if test "x$DEBUG" = "xfull"; then AC_DEFINE(DEBUG_MEMORY,1,[Enable memory debugging]) AC_DEFINE(DEBUG_PIXMAP,1,[Enable pixmap debugging]) AC_DEFINE(HAS_DEBUG_FULL,1,[Enable all debugging]) CFLAGS="$CFLAGS -O0 -ggdb3" debug_msg=" full" fi so with any luck we have the best debugging offered. But you can edit CFLAGS to try harder.
Created attachment 119454 [details] Xorg.0.log (tail) with debug=full OK, cool. I've recompiled with --enable-debug=full, and although I could not attach gdb this time (other machine in-use at the moment), I did eventually get it to crash, and this attachment is the last 130K or so of the Xorg.0.log file. Note the traceback is different. I create these most easily now by playing a video and dragging the window around the screen. I can try to get a gdb trace later. But let me know if this is illuminating at all and/or related to the other trace. Do you need more of the log file?
(In reply to Joe Peterson from comment #14) > Created attachment 119454 [details] > Xorg.0.log (tail) with debug=full > > OK, cool. I've recompiled with --enable-debug=full, and although I could not > attach gdb this time (other machine in-use at the moment), I did eventually > get it to crash, and this attachment is the last 130K or so of the > Xorg.0.log file. Note the traceback is different. I create these most easily > now by playing a video and dragging the window around the screen. commit b1015d4de0cb040dc47ebfbef1114b65269f6f36 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Nov 6 20:41:03 2015 +0000 sna/dri2: Protect against a new possible recursion during TearFree With the introduction of a new SwapEvent that may trigger a copy, we need to add another check for recursion from a TearFree flip event. References: https://bugs.freedesktop.org/show_bug.cgi?id=92751#c14 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Created attachment 119455 [details] Xorg.0.log after git commit b1015d4 Chris, I tried the new code (git b1015d4), and I get: _kgem_submit:4067 assertion 'rq->bo == NULL' failed on start (see attachment for log). I simply re-built and re-installed the 3 libs - anything else I need to do to test the new changes?
I made a mistake in adding that assertion. Darn it.
(In reply to Chris Wilson from comment #17) > I made a mistake in adding that assertion. Darn it. Ok, I'm building a branch with a cherry-pick of your latest commit, skipping commit 1982462 that caused the assertion; I'll let you know if that works...
(In reply to Joe Peterson from comment #18) > (In reply to Chris Wilson from comment #17) > > I made a mistake in adding that assertion. Darn it. > > Ok, I'm building a branch with a cherry-pick of your latest commit, skipping > commit 1982462 that caused the assertion; I'll let you know if that works... Ok, with this (omitting the kgem commit), I get this backtrace (similar to my original reported trace): Backtrace: 0: /usr/lib/xorg/modules/drivers/intel_drv.so (__kgem_retire_requests_upto+0x71) [0x7f3876671f91] 1: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_zpixmap_blt.isra.128+0x1b2) [0x7f38766c03e2] 2: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_image+0x2f8) [0x7f38766c0f18] Perhaps this is expected, since that commit can fix this (?), now commenting our the assert in question but including the commit otherwise...
(In reply to Joe Peterson from comment #19) > (In reply to Joe Peterson from comment #18) > > (In reply to Chris Wilson from comment #17) > > > I made a mistake in adding that assertion. Darn it. > > > > Ok, I'm building a branch with a cherry-pick of your latest commit, skipping > > commit 1982462 that caused the assertion; I'll let you know if that works... > > Ok, with this (omitting the kgem commit), I get this backtrace (similar to > my original reported trace): > > Backtrace: > 0: /usr/lib/xorg/modules/drivers/intel_drv.so > (__kgem_retire_requests_upto+0x71) [0x7f3876671f91] > 1: /usr/lib/xorg/modules/drivers/intel_drv.so > (sna_put_zpixmap_blt.isra.128+0x1b2) [0x7f38766c03e2] > 2: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_put_image+0x2f8) > [0x7f38766c0f18] > > Perhaps this is expected, since that commit can fix this (?), now commenting > our the assert in question but including the commit otherwise... Ok, just comfirming that I still get above backtrace even with all your commits (through b1015d4, including kgem one (1982462)) minus the two asserts that kept X from starting (diff from your b1015d4 is: diff --git a/src/sna/kgem.c b/src/sna/kgem.c index 7acc69e..0e2f830 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -4064,13 +4064,13 @@ void _kgem_submit(struct kgem *kgem) #endif rq = kgem->next_request; - assert(rq->bo == NULL); + //assert(rq->bo == NULL); rq->bo = kgem_create_batch(kgem); if (rq->bo) { struct drm_i915_gem_execbuffer2 execbuf; - assert(rq->bo->refcnt == 1); + //assert(rq->bo->refcnt == 1); assert(!rq->bo->needs_flush); i = kgem->nexec++; So the new code could have fixed the TearFree issue (I'll let you know if I see that again), but original issue seems to still exist. I'll try to get a gdb backtrace of that and/or a debug=full log...
Created attachment 119485 [details] Xorg.0.log showing original crash I finally captured the original crash in gdb! (Well, it has a few new calls in the stack: __kgem_bo_is_busy, ignore_cpu_damage, try_upload__fast - but these may be due to turning off optimization...?) This is the Xorg.0.log produced, but the next attachment is the gdb full backtrace.
Created attachment 119486 [details] gdb.txt backtrace (full) showing original crash Here's the gdb full backtrace. Code compiled with -O0, etc. (same CFLAGS as used in full debug mode but without that mode on (note that I have not yet been able to capture this crash with --enable-debug=full). I really hope this is useful. And BTW, I have not seen the TearFree crash again since using the newer code, so hopefully that one is in the past, and we now only have this original case (let me know id this looks like the original case even with the added calls in that stack).
Still haven't found the possible list/element corruption, but could you try this simplification: diff --git a/src/sna/kgem.c b/src/sna/kgem.c index b80fde7..0f1c1f4 100644 --- a/src/sna/kgem.c +++ b/src/sna/kgem.c @@ -3275,12 +3275,11 @@ bool __kgem_ring_is_idle(struct kgem *kgem, int ring) bool __kgem_retire_requests_upto(struct kgem *kgem, struct kgem_bo *bo) { - struct kgem_request *rq = bo->rq, *tmp; - struct list *requests = &kgem->requests[RQ_RING(rq) == KGEM_BLT]; + struct kgem_request *rq = RQ(bo->rq), *tmp; + struct list *requests = &kgem->requests[rq->ring]; - DBG(("%s(handle=%d)\n", __FUNCTION__, bo->handle)); + DBG(("%s(handle=%d, ring=%d)\n", __FUNCTION__, bo->handle, rq->ring)); - rq = RQ(rq); assert(rq != &kgem->static_request); if (rq == (struct kgem_request *)kgem) { __kgem_bo_clear_busy(bo);
(In reply to Chris Wilson from comment #23) > Still haven't found the possible list/element corruption, but could you try > this simplification: Did you want me to turn on debugging messages (which config switch will turn on output from this patch?), or just see if it crashes? I'm running now with this patch, no crash yet, but I'll have to give it some use before feeling more comfortable. If there are more debug messages we can add that will help, please let me know - I'll be glad to make the crash happen with more debug output.
Chris, so far, I've not been able to make it crash with your patch in comment #23. Since crashing is a bit random, I can't say for sure if this the patch or just luck, but it is promising. Are there other cases where the code does it the old way, or did that one stick out as unique? If indeed it did fix the issue, it would be good to change it globally. I did not analyze the fix - let me know what the difference is if you have time. I am curious.
It is used as a flag in other places, and not as an index. Everywhere else we want an index, we use the rq->ring so in that respect it was anomalous. Still doesn't bode entirely well as I don't have an explanation for the crash.
(In reply to Chris Wilson from comment #26) > It is used as a flag in other places, and not as an index. Everywhere else > we want an index, we use the rq->ring so in that respect it was anomalous. > Still doesn't bode entirely well as I don't have an explanation for the > crash. OK, thanks. Well, I can back out the patch and put in some more debugging lines - just let me know if you know of appropriate ones (or can I turn on the one that enables "DBG()" calls to get into the log?).
Just another update: still no crashes since the comment #23 patch. Statistics say this likely fixed it. Worth integrating (i.e. any disadvantage to integrating)?
Indeed, already did. I thought using the more common pattern was sensible. Nothing still occurs to me how we die there that wouldn't be picked up by an assertion before and that worries me.
(In reply to Chris Wilson from comment #29) > Indeed, already did. I thought using the more common pattern was sensible. > Nothing still occurs to me how we die there that wouldn't be picked up by an > assertion before and that worries me. I see the commit - OK, cool. I am now running at 0995ad2 (latest HEAD as of this afternoon), so I'll let you know if there are any further crashes. I agree it's disconcerting that the reason the patch helps is uncertain. I can always revert back to before the patch and collect additional debug info if you have ideas of what would help. Let me know.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-intel/issues/73.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.