Bug 91467

Summary: SIGSEGV "get_fb: failed to add fb" / sna_block_handler
Product: xorg Reporter: Andreas Reis <andreas.reis>
Component: Driver/intelAssignee: Chris Wilson <chris>
Status: RESOLVED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.0.log including compton+mpv config
none
dmesg
none
Xorg.0.log with enable-debug=full
none
Xorg.0.log with sna: Add a small pixmap sanity check
none
Xorg.0.log of crash at "sna: Only check non-NULL Pixmaps"
none
Xorg.0.log: "failed to set mode: No space left on device" none

Description Andreas Reis 2015-07-26 13:58:50 UTC
Created attachment 117383 [details]
Xorg.0.log including compton+mpv config

Seemingly random crash when resizing a h.264 (8-bit) mpv video. Can't seem to reproduce it:

[   595.155] (EE) intel(0): get_fb: failed to add fb: 1920x1080 depth=24, bpp=32, pitch=7680: 22
[   595.155] (II) intel(0): switch to mode 1920x1080@60.0 on HDMI2 using pipe 0, position (0, 0), rotation normal, reflection none
[   595.155] (EE) intel(0): get_fb: failed to add fb: 1920x1080 depth=24, bpp=32, pitch=7680: 22
[   596.930] (EE) 
[   596.930] (EE) Backtrace:
[   596.934] (EE) 0: /usr/bin/xorg-server/Xorg (OsSigHandler+0x29) [0x5f1619]
[   596.934] (EE) 1: /usr/lib/libc.so.6 (killpg+0x40) [0x7fe4711bc5ef]
[   596.935] (EE) 2: /usr/lib/xorg/modules/drivers/intel_drv.so (sna_block_handler+0x72) [0x7fe46b3c6e22]
[   596.935] (EE) 3: /usr/bin/xorg-server/Xorg (BlockHandler+0x4a) [0x4401fa]
[   596.935] (EE) 4: /usr/bin/xorg-server/Xorg (WaitForSomething+0x265) [0x5e7bd5]
[   596.935] (EE) 5: /usr/bin/xorg-server/Xorg (Dispatch+0x8e) [0x43984e]
[   596.935] (EE) 6: /usr/bin/xorg-server/Xorg (dix_main+0x3d4) [0x43f784]
[   596.935] (EE) 7: /usr/lib/libc.so.6 (__libc_start_main+0xf0) [0x7fe4711a9790]
[   596.935] (EE) 8: /usr/bin/xorg-server/Xorg (_start+0x29) [0x4246e9]

xserver / video-intel / mesa / compton / drm-intel-nightly (ie, 4.2-rc3) / mpv
(lastest git)

Haswell 4770, one 1080p monitor on HDMI2
Comment 1 Chris Wilson 2015-07-26 14:19:46 UTC
If a compositor is using DRI3, it is responsible for TearFree (i.e. if something does PresentPixmap and present flips to it, we can no longer guarantee that all rendering is tear free). So does the tearing go away if you disable the non-default DRI3?
Comment 2 Chris Wilson 2015-07-26 14:28:55 UTC
Can you addr2line -e /usr/lib/xorg/modules/drivers/intel_drv.so -i 0x7fe46b3c6e22 ? I think it implies that DamageRegion(sna->mode.shadow_damage) is itself NULL. That address locally is RegionNotEmpty(), which we rely on in many many places so is unlikely to be the culprit rather a victim. Still worth ./configure --enable-debug to catch such errors earlier.
Comment 3 Chris Wilson 2015-07-26 14:51:14 UTC
errno=22 is EINVAL, so the kernel thought the handle was unsuitable for a framebuffer.

get_fb: failed to add fb: 1920x1080 depth=24, bpp=32, pitch=7680: 22

Looks superficially fine. Y-tiling should have been filtered out before this point, and you would have to use drm.debug=7 to find out why the kernel rejected it.

The really odd part is that the error messages seem to be from the sna_present_unflip() path - that is the only path that I think can trigger that pattern of error messages - and it should be nigh impossible to fail there given the ScreenPixmap was previously attached.
Comment 4 Andreas Reis 2015-07-26 14:54:30 UTC
Ah, wasn't aware of that DRI3 responsibility.

Forcing DRI2 (ie. Option "DRI" "2"), disabling compton and rebooting (since a few weeks I lose all input devices when I merely restart X) indeed also removes the tearing (with TearFree set true, ofc). At least it does so on vsynctester.com.

Alas, that addr2line only yields ??:0. I'll recompile with debug and see if I can provoke anything.

(Also, I forgot: The tearing with compton only appears when I enable its glx-use-copysubbuffermesa, which is explicitly marked as potential vsync breaker in its manpage.)

I probably should further mention that there are three issues I currently have with the kernel driver:
https://bugs.freedesktop.org/show_bug.cgi?id=91452
https://bugs.freedesktop.org/show_bug.cgi?id=91429
https://bugs.freedesktop.org/show_bug.cgi?id=91428
Comment 5 Chris Wilson 2015-07-26 15:04:52 UTC
(In reply to Andreas Reis from comment #4) 
> (Also, I forgot: The tearing with compton only appears when I enable its
> glx-use-copysubbuffermesa, which is explicitly marked as potential vsync
> breaker in its manpage.)

Indeed, it is not. DRI3 doesn't even try. DRI2 managed to be broken (no one specified whether it should or should not be synchronized to the vertical refresh and then X implemented it using the same routine for fast tearing flips), and even then on your hardware it requires an X server running with root privileges to be allowed to reconfigure the registers on the fly to enable vsync.
Comment 6 Andreas Reis 2015-07-26 15:08:51 UTC
Well that was fast. Restarted, opened the video, moved it to another xmonad workspace, used xmonad to resize it with the mouse, crash.

Doesn't seem that enable-debug added anything, though. The backtrace is exactly the same (apart from the addresses), but this time the preceding three lines concerning the fb are not present.
Comment 7 Chris Wilson 2015-07-26 15:11:36 UTC
Step up to --enable-debug=full and send me the Xorg.0.log.xz then!
Comment 8 Andreas Reis 2015-07-26 15:15:04 UTC
Created attachment 117385 [details]
dmesg

Oh sorry, I was only looking at Xorg.0.log. addr2line with the new address again returns only ??:0, but attached is the relevant part from dmesg/journalctl -b -1.
Comment 9 Andreas Reis 2015-07-26 15:29:08 UTC
Created attachment 117386 [details]
Xorg.0.log with enable-debug=full

Alright, I can reproduce it with the steps mentioned. Doesn't seem like it always happens, but setting mpv to fullscreen and wiggling the mouse around is pretty reliable.

Attached is the log from enable-debug=full (which results in a 4MB smaller package than with enable-debug…).
Comment 10 Chris Wilson 2015-07-26 15:48:45 UTC
Hmm, that implies a use-after-free -- which goes some way to explain the other failures.

Do you mind trying to capture a new debug=full log with 

commit c7d0acf78521d90cfbf087bff108d7c3807a79d2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 26 16:46:03 2015 +0100

    sna: Add a DBG trace to reusing pixmap headers

I need to start finding who grabbed a reference to the pixmap after it was freed.
Comment 11 Chris Wilson 2015-07-26 15:54:21 UTC
Also what patches does your xserver carry?
Comment 12 Andreas Reis 2015-07-26 16:16:14 UTC
I've been trying for a while now with that new DBG trace, but for some reason it doesn't crash. Worst I got was an apparent xserver semi-freeze, meaning the display turned black (with two 1px lines from my wallpaper, one horizontally at the bottom, the other at the right) and stayed so (with the mouse still moving) even after killing mpv via the tty.

As for patches, only the recent "os: make sure the clientsWritable fd_set is initialized" without which I get crashes. The server is at git minus the most recent commit "prime: add rotation support for offloaded outputs (v2)", which doesn't compile for me.
Comment 13 Andreas Reis 2015-07-26 16:38:28 UTC
Well, here's at least the Xorg.0.log from the most recent semi-freeze. mediafire as its 309M compressed to 11M:
https://www.mediafire.com/?z4am389fwz667y5

The added DBG is always "__pop_freed_pixmap: reusing freed pixmap=<inc number> header".
Comment 14 Andreas Reis 2015-07-26 17:10:21 UTC
These semi-freezes really appear to have replaced the crash. The xserver's display just freezes at whatever was shown last, whereas programs like mpv continue to run.

Here's a log with your newer DBG commit:
https://www.mediafire.com/?z4am389fwz667y5

I've also noticed that when compton runs, after resizing I'll frequently get something like this:
http://www.mediafire.com/view/6enxpkwnwlv16mx

The right and bottom bars are copies from the video image before resizing and continue to blink rapidly either for a few seconds or until I move the window again.
Comment 15 Andreas Reis 2015-07-26 17:15:59 UTC
(The hash of new Xorg.0.log's link is the same as before's since I didn't notice that mediafire was set to replace files of the same name.)
Comment 16 Andreas Reis 2015-07-26 17:42:07 UTC
Created attachment 117387 [details]
Xorg.0.log with sna: Add a small pixmap sanity check

With sna: Add a small pixmap sanity check the server won't even start as it instantly segfaults.
Comment 17 Chris Wilson 2015-07-26 18:11:01 UTC
(In reply to Andreas Reis from comment #16)
> Created attachment 117387 [details]
> Xorg.0.log with sna: Add a small pixmap sanity check
> 
> With sna: Add a small pixmap sanity check the server won't even start as it
> instantly segfaults.

Oops,

commit d11dc75fb5a95ba410fedd86d9e1dd50260af979
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 26 19:07:45 2015 +0100

    sna: Only check non-NULL Pixmaps
    
    check_pixmap() can be called very early in the Window setup proceeding,
    before a pixmap is even assigned to a Window. There we expect the Window
    to be NULL, so be more careful in our check_pixmap.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91467#c16
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


I think I understand the freeze,
commit e5f8f90f686879950766babbe805cd9d2412aca3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 26 19:03:46 2015 +0100

    sna: Stall for outstanding TearFree flips when taking over with Present
    
    When juggling Present and TearFree, we have to hide the extra flips from
    Present as it cannot account for them. Preferrably we want to schedule
    the Present flip following completion of the TearFree flip, but for the
    moment simply block and wait until TearFree completes before starting
    Present.
    
    Reported-by: Andreas Reis <andreas.reis@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91467
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

should prevent the freeze at the cost of a small stall everytime Present stops and restarts flipping.

However, not convinced if that is related to the earlier crash - for which as you can tell I've added some more debugging to hopefully catch in action.
Comment 18 Andreas Reis 2015-07-26 18:26:05 UTC
Created attachment 117388 [details]
Xorg.0.log of crash at "sna: Only check non-NULL Pixmaps"

Yeah, I'm back to crashing again. Hooray…

Btw, can one set this bugzilla to auto-detect an attachment's mime-type? Unlike the kernel.org one this one defaults to "select from list: plain text" for me, and it's royally annoying.
Comment 19 Andreas Reis 2015-07-26 23:09:59 UTC
Created attachment 117389 [details]
Xorg.0.log: "failed to set mode: No space left on device"

Another crash, again caused by resizing a mpv video, I just got at "Double check for Present takeover before TearFree flips". Driver was compiled without debug, though.

Might also be interesting as now it's due to "failed to set mode: No space left on device" and the "EQ overflow continuing" reports present in it from a few hours ago.

I won't be able to reply further until Wednesday.
Comment 20 Chris Wilson 2015-07-27 08:06:29 UTC
Such a crash looks fairly impossible (it has to be crtc->slave_damage which can only be NULL in your setup, yet apparently has a non-NULL value here). The preceding errors are from a catastrophic GPU hang.
Comment 21 Andreas Reis 2015-07-28 10:52:47 UTC
Regarding "impossible", what can I say – I got what I got. The catastrophic hangs are no surprise, I frequently get them (it's one of my three other bug reports linked above) when forcing Chromium to use hardware acceleration via ignore-gpu-blacklist in chrome://flags, as one can compare with chrome://gpu. (Videos and opening bookmark menus seem their most common cause.) I did not notice them reported in the Xorg log before, however.

I also managed to get another freeze yesterday. For that matter, compton ran always except briefly for the DRI2 vsync test above.
Comment 22 Andreas Reis 2015-10-22 08:57:29 UTC
Haven't gotten a crash for months now, so I closing as WFM.

---

I know compton is unrelated, but if it's of interest: Its option that causes the corruption is glx-swap-method, eg.

backend = "glx";
glx-swap-method = "1";

man: "GLX buffer swap method we assume. Could be undefined (0), copy (1), exchange (2), 3-6, or buffer-age (-1).  undefined is the slowest and the safest, and the default value. […] buffer-age means auto-detect using GLX_EXT_buffer_age"

1-6 causes full screen corruption on content changes, with 1 being worst by far. buffer-age mostly works, but sometimes causes individual window corruption on resize.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.