93848 – [sna] Random X-Server crashes and freezes after update

Bug 93848 - [sna] Random X-Server crashes and freezes after update

Summary: [sna] Random X-Server crashes and freezes after update

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-01-25 13:39 UTC by M. G.
Modified:	2016-04-10 13:41 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
full Xorg log (21.62 KB, text/plain) 2016-01-25 13:39 UTC, M. G.	no flags	Details
backtrace using gdb no. 1 (10.47 KB, text/plain) 2016-01-25 13:40 UTC, M. G.	no flags	Details
backtrace using gdb no. 2 (23.09 KB, text/plain) 2016-01-25 13:40 UTC, M. G.	no flags	Details
xorg.conf (2.19 KB, text/plain) 2016-01-25 13:41 UTC, M. G.	no flags	Details
backtrace with above patch applied (21.41 KB, text/plain) 2016-01-25 14:49 UTC, M. G.	no flags	Details
backtrace with above patch applied and handle SIGPIPE (9.70 KB, text/plain) 2016-02-03 16:31 UTC, M. G.	no flags	Details
valgrind log (2.68 MB, text/plain) 2016-02-04 14:14 UTC, M. G.	no flags	Details
Xorg log from freeze (33.03 KB, text/plain) 2016-02-14 14:06 UTC, M. G.	no flags	Details
Valgrind log from previous run no. 1 (7.66 KB, text/plain) 2016-02-14 14:08 UTC, M. G.	no flags	Details
Valgrind log from previous run no. 2 (8.61 KB, text/plain) 2016-02-14 14:09 UTC, M. G.	no flags	Details
Show Obsolete (1) View All

Description M. G. 2016-01-25 13:39:24 UTC

Created attachment 121260 [details]
full Xorg log

Since my last system update (including an update of xorg-server and libdrm) I have random X-Server crashes and freezes. With xf86-video-intel-2.99.917-r2, which worked fine before the update, I got "Fatal server error: has_coherent_ptr:1532 assertion '!priv->cpu_bo->needs_flush' failed" (see bug #90053), so I switched to xf86-video-intel-9999 (Git).
Now I get errors like this:

[  5166.629] (EE) Backtrace:
[  5166.629] (EE) 0: /usr/bin/X (xorg_backtrace+0x56) [0x58c3c6]
[  5166.629] (EE) 1: /usr/bin/X (0x400000+0x190609) [0x590609]
[  5166.629] (EE) 2: /lib64/libc.so.6 (0x7f45ddf8b000+0x33400) [0x7f45ddfbe400]
[  5166.629] (EE) 3: /lib64/libc.so.6 (0x7f45ddf8b000+0x78e68) [0x7f45de003e68]
[  5166.629] (EE) 4: /lib64/libc.so.6 (__libc_calloc+0xd5) [0x7f45de006675]
[  5166.630] (EE) 5: /usr/bin/X (0x400000+0x33265) [0x433265]
[  5166.630] (EE) 6: /usr/bin/X (0x400000+0x36367) [0x436367]
[  5166.630] (EE) 7: /usr/bin/X (0x400000+0x3a486) [0x43a486]
[  5166.630] (EE) 8: /lib64/libc.so.6 (__libc_start_main+0xf0) [0x7f45ddfab7b0]
[  5166.630] (EE) 9: /usr/bin/X (_start+0x29) [0x424999]
[  5166.630] (EE) 
[  5166.630] (EE) Segmentation fault at address 0x0
[  5166.630] (EE) 
Fatal server error:
[  5166.630] (EE) Caught signal 11 (Segmentation fault). Server aborting

I created two backtraces using gdb, I don't know if they are helpful. The second one indicates that the Intel driver is involved. If this a xorg-server issue please reassign.

I have a dual head setup (using ZaphodHeads) - don't know if it is important.

My configuration:

System: Intel Celeron G1840
OS: Gentoo Linux
Kernel: 4.4.0
xorg-server: 1.17.4
libdrm: 2.4.65

Comment 1 M. G. 2016-01-25 13:40:20 UTC

Created attachment 121261 [details]
backtrace using gdb no. 1

Comment 2 M. G. 2016-01-25 13:40:47 UTC

Created attachment 121262 [details]
backtrace using gdb no. 2

Comment 3 M. G. 2016-01-25 13:41:09 UTC

Created attachment 121263 [details]
xorg.conf

Comment 4 Chris Wilson 2016-01-25 13:49:46 UTC

Well, they confirm the malloc corruption. Let's suppose this *is* ZaphodHeads specific, in which case trying

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index 6ee4033..bd128d6 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -2321,6 +2321,8 @@ sna_glyphs__shared(CARD8 op,
        if (RegionNil(dst->pCompositeClip))
                return;
 
+       goto fallback;
+
        if (FALLBACK)
                goto fallback;
 

should help identify the culprit.

Comment 5 M. G. 2016-01-25 14:49:40 UTC

Created attachment 121267 [details]
backtrace with above patch applied

Thanks for your quick reply. I have applied your patch. Unfortunately I was only able to produce one freeze till now while I had gdb running. Does the attached backtrace help? I am waiting for a crash, but it looks like the X-Server is only crashing if you don't want it to happen...

Comment 6 Chris Wilson 2016-01-25 16:12:21 UTC

(In reply to M. G. from comment #5)
> Created attachment 121267 [details]
> backtrace with above patch applied
> 
> Thanks for your quick reply. I have applied your patch. Unfortunately I was
> only able to produce one freeze till now while I had gdb running. Does the
> attached backtrace help? I am waiting for a crash, but it looks like the
> X-Server is only crashing if you don't want it to happen...

SIGPIPE is normal (it means that a client disconnected whilst we still have data in the write buffers of the socket). Use "handle SIGPIPE nostop noprint pass" for gdb to ignore it.

Comment 7 M. G. 2016-02-03 16:31:22 UTC

Created attachment 121496 [details]
backtrace with above patch applied and handle SIGPIPE

Thank you and sorry for the delay. It's really strange, I only had two crashes in the past week after applying your patch. On the first crash I had not attached gdb, but I was able to get a backtrace for the second crash. Does the attached backtrace help? If not, what else can I do in order to track down the problem?

Comment 8 Chris Wilson 2016-02-03 17:12:36 UTC

(In reply to M. G. from comment #7)
> but I was able to get a backtrace for the second crash.
> Does the attached backtrace help? If not, what else can I do in order to
> track down the problem?

Thanks! Alas, it does not, but having applied the patch does rule out one possibility. If you are able to, compiling with --enable-debug with valgrind installed and then running X under valgrind would be very, very useful. You will notice a small hit due to valgrind, possibly a large one, but it will help find the cause. To run X under valgrind often requires avoiding the suid wrappers and you will need to remember to specify a --log-file=/var/log/Xorg.valgrind for valgrind.

Comment 9 M. G. 2016-02-04 14:14:44 UTC

Created attachment 121521 [details]
valgrind log

OK. I haven't had another crash yet (currently the X server crashes very rarely), but I see a lot of invalid read/writes with valgrind (see attachment). Are that critical errors that might lead to a segfault?

Comment 10 Chris Wilson 2016-02-04 14:33:24 UTC

(In reply to M. G. from comment #9)
> Created attachment 121521 [details]
> valgrind log
> 
> OK. I haven't had another crash yet (currently the X server crashes very
> rarely), but I see a lot of invalid read/writes with valgrind (see
> attachment). Are that critical errors that might lead to a segfault?

Not yet. They appear to be noise from not compiling in valgrind support to the ddx. You should see "SNA compiled for use with valgrind" in both the Xorg.0.log and the valgrind output. The valgrind support adds markup to the driver allocations and ioctls that valgrind by itself cannot see (and so tries to avoid the false positives).

Comment 11 M. G. 2016-02-04 16:58:22 UTC

Again many thanks! Sorry, I wasn't aware of the --enable-valgrind option because the Gentoo ebuild didn't had an valgrind USE flag. I have added it and recompiled the driver. The error messages are now gone and my system is much more responsive. I will report back as soon as a crash happens again.

Comment 12 M. G. 2016-02-14 14:06:58 UTC

Created attachment 121747 [details]
Xorg log from freeze

Today I had another freeze (I think valgrind prevented the X server from crashing, memcheck-amd64-linux uses 100% CPU since a few hours). I have attached the Xorg log. The valgrind log contains nothing special, just the following messages I see everytime:

==2547== Memcheck, a memory error detector
==2547== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==2547== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==2547== Command: /usr/bin/X.valgrind-testing :0 vt07 -nolisten tcp
==2547== Parent PID: 2543
==2547== 
==2547== Warning: noted but unhandled ioctl 0x4b51 with no size/direction hints.
==2547==    This could cause spurious value errors to appear.
==2547==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
**2547** SNA compiled for use with valgrind
**2547** SNA compiled for use with valgrind
==2547== Warning: noted but unhandled ioctl 0x6458 with no size/direction hints.
==2547==    This could cause spurious value errors to appear.
==2547==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==2547== Warning: noted but unhandled ioctl 0x641e with no size/direction hints.
==2547==    This could cause spurious value errors to appear.
==2547==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==2547== Syscall param writev(vector[...]) points to uninitialised byte(s)
==2547==    at 0x676C6F0: __writev_nocancel (syscall-template.S:81)
==2547==    by 0x59011B: _XSERVTransSocketWritev (Xtranssock.c:2367)
==2547==    by 0x58B24C: FlushClient (io.c:941)
==2547==    by 0x58B97D: WriteToClient (io.c:856)
==2547==    by 0x4F8734: rrGetScreenResources (rrscreen.c:627)
==2547==    by 0x437136: Dispatch (dispatch.c:429)
==2547==    by 0x43B23A: dix_main (main.c:298)
==2547==    by 0x66AC7AF: (below main) (libc-start.c:289)
==2547==  Address 0xdba6df7 is 615 bytes inside a block of size 616 alloc'd
==2547==    at 0x4C29F60: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==2547==    by 0x4F8C8B: rrGetScreenResources (rrscreen.c:551)
==2547==    by 0x437136: Dispatch (dispatch.c:429)
==2547==    by 0x43B23A: dix_main (main.c:298)
==2547==    by 0x66AC7AF: (below main) (libc-start.c:289)
==2547==

I will also attach two valgrind logs from previous runs that contain some invalid read and conditional jump or move depends on uninitialised value(s) messages. Not sure if this helps.

PS: I have built the Intel driver based on commit 8b8c9a36828e90e46ad0755c6861df85f5307fb5.

Comment 13 M. G. 2016-02-14 14:08:42 UTC

Created attachment 121748 [details]
Valgrind log from previous run no. 1

Valgrind log from previous run that might be helpful.

Comment 14 M. G. 2016-02-14 14:09:44 UTC

Created attachment 121749 [details]
Valgrind log from previous run no. 2

Another valgrind log with some different errors.

Comment 15 Chris Wilson 2016-02-18 09:19:17 UTC

The actual valgrind errors are not worrying (one I've fixed already, the read/write on the trailing byte I should fix one day, it's just a nuisance.) The really interesting freeze though occurs outside of the display driver but in the xf86-input-mouse driver. I don't suppose you could substitute the -evdev driver for testing?

Comment 16 M. G. 2016-02-18 20:39:14 UTC

Thanks for your response. I have switched from xf86-input-mouse to xf86-input-evdev (I had to recompile xorg-server with udev enabled). I have also updated xf86-video-intel to latest commit 05320318fb940247d8749da8330215d19f41d84e. Let's see if it still crashes.

Comment 17 M. G. 2016-02-20 21:00:02 UTC

I have also had a freeze with xf86-input-evdev (similar error messages like in comment #12). However, both freezes happened when I started Wireshark. I have checked that the freeze is reproducible. If I start X without valgrind Wireshark works fine, so this seems to be a valgrind issue unrelated to the originally reported problem.

Comment 18 M. G. 2016-04-10 13:32:42 UTC

I haven't had no further crashes since my last comment. I have quit running X with valgrind a few weeks ago and still everything works fine. So I am closing this bug. I assume the problem was fixed by a commit during the initial bug report and the last driver rebuild.

Comment 19 Chris Wilson 2016-04-10 13:41:56 UTC

I hope so too... Please do reopen this bug if it explodes again.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.