Created attachment 122718 [details] segfault backtrace with kwin xorg-server 1.18.2 xf86-video-intel latest git 90792c9 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M] (rev ff) When starting "DRI_PRIME=1 minetest", single player mode, right when the loading screen seems to be finished, x.org crashes, when * Using dri3 prime offloading AND * compositing is active (tested with kwin_x11 opengl compositor and compton --backend glx) It does NOT crash * with dri3 offloading when no compositor is running * with dri2 offloading (DRI_PRIME=1 LIBGL_DRI3_DISABLE=1 minetest) * starting a compositor after starting the game I'm getting different backtraces, I'll attach the three I have got so far.
Created attachment 122719 [details] another segfault backtrace with kwin, apparently region could not be accesed
Created attachment 122720 [details] SIGABRT backtrace with compton
That hints at a double-free / use-after-free. Could you run Xorg under valgrind? Be sure to build (at least) xf86-video-intel with --enable-debug to reduce the valgrind noise.
Created attachment 122723 [details] valgrind log Hm, I got this in the xorg log. I hope it's not the fault of this archlinux patch: https://projects.archlinux.org/svntogit/packages.git/tree/trunk/0001-present-Only-requeue-for-next-MSC-after-flip-failure.patch?h=packages/xorg-server [ 65226.584] client(800000): Released cmdline(minetest) and cmdargs(NULL). [ 65233.668] sna_page_flip: pageflip failed with err=12 [ 65233.685] sna_page_flip: pageflip failed with err=12 [ 65233.686] (EE) intel(0): page flipping failed, on CRTC:30 (pipe=1), disabling synchronous page flips [ 65233.745] (II) intel(0): switch to mode 1920x1080@60.0 on LVDS1 using pipe 0, position (0, 0), rotation normal, reflection none [ 65233.746] (II) intel(0): switch to mode 1920x1080@60.0 on HDMI1 using pipe 1, position (0, 0), rotation normal, reflection none [ 65233.771] (EE) [ 65233.771] (EE) Backtrace: [ 65233.798] (EE) 0: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_present_queue_vblank+0x7b) [0xc1a4898] [ 65233.805] (EE) 1: /usr/lib/xorg-server/Xorg (present_queue_vblank+0x79) [0x2a8a04] [ 65233.805] (EE) 2: /usr/lib/xorg-server/Xorg (present_execute+0x3af) [0x2a997d] [ 65233.806] (EE) 3: /usr/lib/xorg-server/Xorg (present_re_execute+0x80) [0x2a8b85] [ 65233.807] (EE) 4: /usr/lib/xorg-server/Xorg (present_flip_try_ready+0x60) [0x2a8bfc] [ 65233.808] (EE) 5: /usr/lib/xorg-server/Xorg (present_flip_notify+0x163) [0x2a91f5] [ 65233.809] (EE) 6: /usr/lib/xorg-server/Xorg (present_event_notify+0xfd) [0x2a92f5] [ 65233.818] (EE) 7: /usr/lib/xorg/modules/drivers/intel_drv.so (present_flip_handler+0xe7) [0xc1a4fc4] [ 65233.822] (EE) 8: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_mode_wakeup+0x541) [0xc0d21e4] [ 65233.831] (EE) 9: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_wakeup_handler+0xc2) [0xc0d55c4] [ 65233.832] (EE) 10: /usr/lib/xorg-server/Xorg (WakeupHandler+0xe7) [0x164e86] [ 65233.833] (EE) 11: /usr/lib/xorg-server/Xorg (WaitForSomething+0x58e) [0x363b92] [ 65233.834] (EE) 12: /usr/lib/xorg-server/Xorg (Dispatch+0xa7) [0x154376] [ 65233.835] (EE) 13: /usr/lib/xorg-server/Xorg (dix_main+0x64c) [0x163f9d] [ 65233.836] (EE) 14: /usr/lib/xorg-server/Xorg (main+0x39) [0x143d39] [ 65233.837] (EE) 15: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x69b7710] [ 65233.838] (EE) 16: /usr/lib/xorg-server/Xorg (_start+0x29) [0x143bf9] [ 65233.842] (EE) 17: ? (?+0x29) [0x29] [ 65233.844] (EE) [ 65233.844] sna_present_queue_vblank:356 assertion '(int64_t)(msc - swap->msc) >= 0' failed [ 65233.866] (EE) [ 65233.866] (EE) Backtrace: [ 65233.876] (EE) 0: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_crtc_clear_vblank+0x39) [0xc1a3b60] [ 65233.885] (EE) 1: /usr/lib/xorg/modules/drivers/intel_drv.so (present_flip_handler+0x102) [0xc1a4fdf] [ 65233.889] (EE) 2: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_mode_wakeup+0x541) [0xc0d21e4] [ 65233.892] (EE) 3: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_wakeup_handler+0xc2) [0xc0d55c4] [ 65233.893] (EE) 4: /usr/lib/xorg-server/Xorg (WakeupHandler+0xe7) [0x164e86] [ 65233.894] (EE) 5: /usr/lib/xorg-server/Xorg (WaitForSomething+0x58e) [0x363b92] [ 65233.895] (EE) 6: /usr/lib/xorg-server/Xorg (Dispatch+0xa7) [0x154376] [ 65233.896] (EE) 7: /usr/lib/xorg-server/Xorg (dix_main+0x64c) [0x163f9d] [ 65233.896] (EE) 8: /usr/lib/xorg-server/Xorg (main+0x39) [0x143d39] [ 65233.897] (EE) 9: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x69b7710] [ 65233.898] (EE) 10: /usr/lib/xorg-server/Xorg (_start+0x29) [0x143bf9] [ 65233.900] (EE) 11: ? (?+0x29) [0x29] [ 65233.901] (EE) [ 65233.901] (EE) Fatal server error: [ 65233.902] (EE) sna_crtc_clear_vblank:648 assertion '*sna_crtc_flags(crtc) & CRTC_VBLANK' failed
Hmm, we have two problems here. The crash is from error handling code getting itself into a knot after failing to queue a flip. But we shouldn't fail to queue that flip... It reports ENOMEM, slightly odd. Whilst I go and fix the crash could you grab a massive --enable-debug=full log and possible a drm.debug=7 dmesg to try and find out why the flip fails?
Created attachment 122724 [details] [review] Prevent reporting an incomplete event That should prevent the double-free / use-after-free and general corruption. Any luck on finding the cause of the flipping error?
Created attachment 122725 [details] xorg log with xf86-video-intel debug=full Meanwhile I have updated to xorg server 1.18.3 and thus removed all 3 archlinux patches - and I can still reproduce the crash. It doesn't actually crash directly after the loading screen, the game is displayed shortly before the crash.
Created attachment 122726 [details] xorg log with the patch The previos log was without the patch. The patch indeed resolves the crash. I see you added some debug output, so here is another log.
Created attachment 122727 [details] [review] Prevent reporting an incomplete event malloc, !calloc
Created attachment 122728 [details] xorg log with the second patch Ah, now the debug message actually ends up in the log.
Step 1, commit 65dc4176d84e1d1764a6fa1edc972aa7f1dcd2ba Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 5 17:53:55 2016 +0100 sna/present: Prevent reporting an incomplete event It looks very suspicious that we get a sequence of happy flips (both submit, completion events) then a failure with -ENOMEM after ~800 flips without much of a hint of a leak. Hmm, ah, what appears to be happening is that the event queue is filled up with future vblanks. The first hint of trouble is: [ 67019.025] sna_present_queue: target msc=4008810, seq=4008810 (last_msc=4008683) [ 67019.025] sna_wait_vblank(pipe=0, waiting until seq=4008810) [ 67019.025] sna_present_queue: vblank enqueue failed, faking [ 67019.025] msc_to_delay: sleep 127 frames, 2113 ms [ 67019.025] sna_fake_vblank(event=954, target_msc=4008810, msc=4008683, delay=2113ms) And they continue to submit more and more vblanks into the future, just prior to the crash (when we try to flip onto a full queue) we are up to: [ 67023.142] sna_present_queue_vblank(pipe=0, event=1563, msc=4009173, last swap=4008929) [ 67023.142] sna_present_queue: target msc=4009173, seq=4009173 (last_msc=4008929) [ 67023.142] sna_wait_vblank(pipe=0, waiting until seq=4009173) [ 67023.142] sna_present_queue: vblank enqueue failed, faking [ 67023.142] msc_to_delay: sleep 244 frames, 4057 ms [ 67023.142] sna_fake_vblank(event=1563, target_msc=4009173, msc=4008929, delay=4057ms) It feels like it queuing vblanks twice as fast as vrefresh.
(In reply to Christoph Haag from comment #10) > Created attachment 122728 [details] > xorg log with the second patch > > Ah, now the debug message actually ends up in the log. And alerting me to a bug, I think. commit 84c545b1b76e9c2ca99c0bb2f395931e259c0df8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 5 18:27:51 2016 +0100 sna/present: Markup hw vbanks queued after a fake vblank
Created attachment 122729 [details] xorg log with failing assertion
That was silly in hindsight. commit f4ce7fee6cb760353ce1fef9272e927461d8a21b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 5 19:12:29 2016 +0100 sna/present: And drop bogus queued assertion
Created attachment 122730 [details] another assertion is hit
Ok, that's the assert that was removed in f4ce7fee6
(In reply to Chris Wilson from comment #16) > Ok, that's the assert that was removed in f4ce7fee6 "sna_fake_vblank:224 assertion 'info->n_event_id == 1' failed"?
I swear that wasn't there last time!
commit e091ace4d861bbcd67d69f409f0cdf365ee2e685 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 5 20:01:07 2016 +0100 sna/present: Remove stale assert that fake vblanks only have one event Since commit 02f535e8f3659f1147c6f2e698bd5d8730dec19b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Mar 24 09:39:06 2016 +0000 sna/present: Requeue early vblank completions we may hit the fake vblank timer path with an old hw struct that may have multiple associated events. The assert that we only called the fake vblank from sna_present_queue_vblank is no longer correct. Reported-by: Christoph Haag <haagch@frickel.club> References: https://bugs.freedesktop.org/show_bug.cgi?id=94829#c15 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> commit 4b4324cd059590cbeb37b6d6fd79446171cd04c4 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 5 20:01:07 2016 +0100 sna/present: Update the vblank timestamp after a blocking wait After doing a blocking wait (rather than event driven) for the imminent arrival of the next frame (when it is expected less than 1ms in the future), update the timestamp and frame counter that we then report back to present. Reported-by: Christoph Haag <haagch@frickel.club> References: https://bugs.freedesktop.org/show_bug.cgi?id=94829#c15 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Created attachment 122731 [details] xorg log screen stops updating Finally no crashes. So far so good. Sometimes it works now. And sometimes the whole screen except the mouse pointer stops updating as soon as the game starts. From the sounds of it I can walk around in the game and when pressing escape the mouse pointer gets released from the window, but all windows look frozen.
There is still the problem that the queues are clogged with future vblank events, and it seems our handling of that situation is less than ideal. Not too sure how the frozen display itself manifests yet (some combination of bugs to be solved), but something we can try to mitigate the immediate issue would be: diff --git a/src/sna/sna_present.c b/src/sna/sna_present.c index b880aac..f0a17be 100644 --- a/src/sna/sna_present.c +++ b/src/sna/sna_present.c @@ -248,8 +248,7 @@ static bool sna_fake_vblank(struct sna_present_event *info) return TimerSet(NULL, 0, delay, sna_fake_vblank_handler, info); } -static bool sna_present_queue(struct sna_present_event *info, - uint64_t last_msc) +static bool sna_present_queue(struct sna_present_event *info, uint64_t last_msc) { union drm_wait_vblank vbl; @@ -264,7 +263,8 @@ static bool sna_present_queue(struct sna_present_event *info, vbl.request.type = DRM_VBLANK_ABSOLUTE | DRM_VBLANK_EVENT; vbl.request.sequence = info->target_msc; vbl.request.signal = (uintptr_t)MARK_PRESENT(info); - if (sna_wait_vblank(info->sna, &vbl, sna_crtc_pipe(info->crtc))) { + if (info->target_msc - last_msc > 60 || + sna_wait_vblank(info->sna, &vbl, sna_crtc_pipe(info->crtc))) { DBG(("%s: vblank enqueue failed, faking\n", __FUNCTION__)); if (!sna_fake_vblank(info)) return false; My main concern with such a patch is the 60 frame / 1s heuristic.
Does not fix the freeze. There's also a weird flicker when it happens: https://www.youtube.com/watch?v=2lvhILUShTU
(In reply to Christoph Haag from comment #22) > Does not fix the freeze. > > There's also a weird flicker when it happens: > https://www.youtube.com/watch?v=2lvhILUShTU That looks like what happens if the screen pixmap is flipped but we lose track. The flash is probably we did a CRTC modeset (one of the attempts at error recovery) rather than a flip. Could I look at a debug Xorg.log when it freezes with the patch? First I want to verify that the patch prevents the vblank queues from clogging and then see if we are still hitting a pageflip error. Knowing that should direct the barking to a different tree.
Created attachment 122739 [details] xorg log with latest patch
Well the patch didn't prevent the queue exhaustion. The first issue on flip failure is that we try to restore the previous mode + framebuffer. However, Present is lying and we the frontbuffer it is using no longer matches the Screen, so the framebuffer we restore doesn't match. Then when we fall back to copies rather than flips, Present is copying to its frontbuffer which isn't on the screen any longer. That should recover when the Present mode changes (i.e. when it changes from fullscreen flips).
Created attachment 122742 [details] [review] Restore the root window's pixmap on flip failure Hopefully this matches the Pixmap present is rendering to!
Created attachment 122749 [details] with latest patch still hangs
Hmm, it was effective in its small part. Did it remove the glitch before the freeze?
Created attachment 122752 [details] hang immediately after starting with weird crash after tty switch I think it does remove the flash/flicker. I'm running with the last patch here and the latest commit 9ce7d47a867880b7b43eded7ef8e2d09c5dc21ea and now sometimes X hangs directly after starting with kwin. After a couple of tty switches, my other running X server crashed (?) and when switching back to the hanging one, it crashed too...
Ok, it prevents a glitch after a failure but introduces more strangeness. I can live without that for the moment. Respun the only use HW vblanks for the last frame and ran it through my accuracy tests, no misses so pushed: commit a76560107f7d97b80d47e90b46535a8aba273c32 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Apr 6 09:23:05 2016 +0100 sna/present: Only use the HW vblank for the last frame If that doesn't stop the starvation issue, I have no idea. Looking through the logs, I haven't found a reason for the freeze. After the failed unflip, it looks to be rendering back into the Screen Pixmap and that appears to be on the scanout (i.e. all the updates should be visible).
As far as I can tell from a couple of minutes it does finally fix the freeze. Thanks.
Created attachment 122780 [details] gdb backtrace Is this a possibly related problem? When trying to start the talos principle with Vulkan on my ivy bridge GPU here the game fails and logs this error: ERR: Cannot set requested display mode 1920x1080: GfxAPI error: ERR: Vulkan error: Cannot obtain 'vkCmdCopyQueryPoolResults()' device function and everything was usually fine. But now (?) sometimes X crashes. The annoying thing is that it seems to crash less and less the more debug info is in xf86-video-intel and yet even less when X is run under gdb. After trying to start the talos principle many times I did manage to reproduce it with full debug, so the xorg log is unfortunately a bit larger: http://haagch.frickel.club/files/Xorg.1.log.talosvk.xz Xorg 1.18.3, latest xf86-video-intel git 8d3c8a6c0d48883ec399f8dd36df0eca200722f9
(In reply to Christoph Haag from comment #32) > Created attachment 122780 [details] > gdb backtrace > > But now (?) sometimes X crashes. The annoying thing is that it seems to > crash less and less the more debug info is in xf86-video-intel and yet even > less when X is run under gdb. After trying to start the talos principle many > times I did manage to reproduce it with full debug, so the xorg log is > unfortunately a bit larger: > http://haagch.frickel.club/files/Xorg.1.log.talosvk.xz Gotcha. It is a kind-off related problem, as it was part of the fallout from the 1.18.2 fixes. commit 74b755fe0a786d330d121c435d274e2418003581 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Apr 6 22:50:28 2016 +0100 sna/present: Clear flags on the vblank event's CRTC early We store a flag on the vblank's CRTC to indicate whether we have marked the target CRTC as having an immediately pending vblank. We should clear this set of flags early so that we don't have to worry about the flag whilst processing the vblank, and so that we don't get confused if we have to requeue the vblank. Reported-by: Christoph Haag <haagch@frickel.club> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94829#c32 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Is this a possibly related problem? > > When trying to start the talos principle with Vulkan on my ivy bridge GPU > here the game fails and logs this error: > ERR: Cannot set requested display mode 1920x1080: GfxAPI error: > ERR: Vulkan error: Cannot obtain 'vkCmdCopyQueryPoolResults()' device > function > and everything was usually fine. > Not sure.
So far I have seen no issues anymore.
Thanks for following up. You probably want to keep an eye out for where those future vblanks are coming from - something is unhappy, or maybe just oblivious? I expect there is some animation somewhere (probably the systray ;) that is running very slowly.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.