Bug 99887

Summary: Random segfault in Xorg 1.19.x
Product: xorg Reporter: Olivier Fourdan <fourdan>
Component: Input/wacomAssignee: Xorg Project Team <xorg-team>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: hewanci, peter.hutterer, sarnex
Version: git   
Hardware: Other   
OS: All   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1401627
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Reproducer helper
none
valgrind log
none
(gdb) t a a bt f
none
valgrind log
none
anothe rvalgrind logs none

Description Olivier Fourdan 2017-02-21 14:40:04 UTC
Created attachment 129794 [details]
Reproducer helper

Description:

Xorg 1.19.x crashes in intel driver with an abort in some random cases.

Steps to reproduce:

1. Build and run the attached test program 
2. Hover the stylus of a wacom tablet to cause the map/unmaps of the window
   (the test app hides and show a window after n X events)

Actual result:

After some time, Xorg crashes with a segfault:

(EE) Segmentation fault at address 0x80

#0  0x00000000071b491f in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00000000071b651a in __GI_abort () at abort.c:89
#2  0x00000000005a15ce in OsAbort () at utils.c:1355
#3  0x00000000005a7203 in AbortServer () at log.c:877
#4  0x00000000005a7fed in FatalError (f=f@entry=0x5d56f0 "Caught signal %d (%s). Server aborting\n") at log.c:1015
#5  0x000000000059e86e in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>) at osinit.c:154
#6  0x0000000006f725c0 in <signal handler called> () at /lib64/libpthread.so.0
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x80: 
#7  0x000000000ab8bade in __kgem_busy (handle=#8  0x000000000ab8bade in kgem_retire__requests_ring (ring=<optimized out>, kgem=<optimized out>) at kgem.c:3226
#9  0x000000000ab8bade in kgem_retire__requests (kgem=0x9841000) at kgem.c:3260
#10 0x000000000ab8bade in kgem_retire (kgem=0x9841000) at kgem.c:3276
#11 0x000000000abc2b45 in sna_accel_block (sna=0x9841000, tv=tv@entry=0xfff0007a8) at sna_accel.c:18312
#12 0x000000000abe10ad in sna_block_handler (data=<optimized out>, _timeout=0xfff000814) at sna_driver.c:761
#13 0x000000000043b9de in BlockHandler (pTimeout=pTimeout@entry=0xfff000814) at dixutils.c:388
#14 0x0000000000598361 in WaitForSomething (are_ready=<optimized out>) at WaitFor.c:219
#15 0x0000000000436dca in Dispatch () at dispatch.c:422
#16 0x000000000043b018 in dix_main (argc=15, argv=0xfff0009f8, envp=<optimized out>) at main.c:287
#17 0x000000000719f401 in __libc_start_main (main=
    0x424cc0 <main>, argc=15, argv=0xfff0009f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0xfff0009e8) at ../csu/libc-start.c:289
#18 0x0000000000424cfa in _start ()
Comment 1 Olivier Fourdan 2017-02-21 14:47:33 UTC
Created attachment 129795 [details]
valgrind log
Comment 2 Olivier Fourdan 2017-02-21 14:49:32 UTC
Created attachment 129796 [details]
(gdb) t a a bt f
Comment 3 Olivier Fourdan 2017-02-21 14:52:57 UTC
I wonder if this could be another (missing?) input_lock() issue similar to bug 99358 but here I don't see much pointing toward the cursor or input code neither in the valgrind logs nor the backtrace.
Comment 4 Chris Wilson 2017-02-21 15:10:02 UTC
Hmm, the valgrind log should be less noisy if the ddx is compiled with --enable-debug. It should mention successfully detecting vg in the compile and when using the new intel_drv.so under valgrind.

As I understand the test, it creates a new client, changes the pointer, waits for the cursor to move, then exits and repeats. To clarify, the XDefinteCursor() step is critical for reproducing? The forking should be irrelevant, the client id should be recycled so that too should not be an issue. Suggests that perhaps the test is just to rapidly change the cursor whilst processing pointer events.

Or it is nothing to do with the pointer, and is the window map/unmapping. Neither are particularly rare paths :| Didn't spot anything in the vg log to suggest otherwise, my first suspicion is the noise is masking a real out-of-bounds write.
Comment 5 Olivier Fourdan 2017-02-21 15:21:27 UTC
It's not a reproducer I forged specifically for this issue, just an "old" reproducer I had for another bug of mine, but the issue is happens in real life and occurs even outside of valgrind (as found in the downstream bug I linked).

I'll rebuild the intel driver with debug enabled, see what it gives.
Comment 6 Olivier Fourdan 2017-02-21 15:51:39 UTC
Created attachment 129797 [details]
valgrind log

Right, I suspect the intel driver is not the culprit but probably a victim of another memory corruption that occurs earlier.

I just had a crash with a complete different backtrace that has nothing to do with the intel driver:

#0  0x00000000071b491f in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00000000071b651a in __GI_abort () at abort.c:89
#2  0x00000000005a15ce in OsAbort () at utils.c:1355
#3  0x00000000005a7203 in AbortServer () at log.c:877
#4  0x00000000005a7fed in FatalError (f=f@entry=0x5d56f0 "Caught signal %d (%s). Server aborting\n") at log.c:1015
#5  0x000000000059e86e in OsSigHandler (signo=11, sip=<optimized out>, unused=<optimized out>) at osinit.c:154
#6  0x0000000006f725c0 in <signal handler called> () at /lib64/libpthread.so.0
#7  0x000000000059e1ec in WriteToClient (who=who@entry=0x972d060, count=count@entry=32, __buf=__buf@entry=0xfff000810) at io.c:705
#8  0x0000000000445749 in ProcGetInputFocus (client=0x972d060) at events.c:4861
#9  0x0000000000437095 in Dispatch () at dispatch.c:479
#10 0x000000000043b018 in dix_main (argc=15, argv=0xfff0009f8, envp=<optimized out>) at main.c:287
#11 0x000000000719f401 in __libc_start_main (main=
    0x424cc0 <main>, argc=15, argv=0xfff0009f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0xfff0009e8)
    at ../csu/libc-start.c:289
Comment 7 Olivier Fourdan 2017-02-21 15:56:25 UTC
Moving to xorg server
Comment 8 Michel Dänzer 2017-02-22 01:39:37 UTC
Looks like bug 99164.
Comment 9 Olivier Fourdan 2017-02-22 09:44:05 UTC
Created attachment 129820 [details]
anothe rvalgrind logs

Interesting point here is the crash occurs in the InputThread.

The corresponding backtrace from core file is even more instructing:

#0  0x0000000006f8b91f in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x0000000006f8d51a in __GI_abort () at abort.c:89
#2  0x00000000005a156e in OsAbort () at utils.c:1355
#3  0x00000000005a7163 in AbortServer () at log.c:877
#4  0x00000000005a7f4d in FatalError (f=f@entry=0xaaf8ecf "%s:%d assertion '%s' failed\n") at log.c:1015
#5  0x000000000a974e36 in _kgem_submit (kgem=kgem@entry=0x9617000) at kgem.c:4134
#6  0x000000000a9b9f64 in kgem_submit (kgem=0x9617000) at kgem.h:382
#7  0x000000000a9b9f64 in sna_accel_flush (sna=sna@entry=0x850800 <FlushCallback>) at sna_accel.c:17375
#8  0x000000000a9b9fec in sna_flush_callback (list=<optimized out>, user_data=0x850800 <FlushCallback>, call_data=<optimized out>) at sna_accel.c:17399
#9  0x000000000043c3e4 in _CallCallbacks (pcbl=0x36dc6b30, pcbl@entry=0x850800 <FlushCallback>, call_data=0x20, call_data@entry=0xb0bc150) at dixutils.c:737
#10 0x000000000059dbd3 in CallCallbacks (call_data=0xb0bc150, pcbl=0x850800 <FlushCallback>) at ../include/callback.h:83
#11 0x000000000059dbd3 in FlushClient (who=who@entry=0xb0bc150, oc=oc@entry=0x9552800, __extraBuf=__extraBuf@entry=0x11027a60, extraCount=extraCount@entry=32) at io.c:809
#12 0x000000000059e13f in WriteToClient (who=who@entry=0xb0bc150, count=count@entry=32, __buf=__buf@entry=0x11027a60) at io.c:763
#13 0x0000000000442572 in WriteEventsToClient (pClient=pClient@entry=0xb0bc150, count=<optimized out>, count@entry=1, events=events@entry=0x11027a60) at events.c:6000
#14 0x0000000000442710 in TryClientEvents (client=0xb0bc150, dev=<optimized out>, pEvents=0x11027a60, count=1, mask=<optimized out>, filter=<optimized out>, grab=0x0)
    at events.c:2021
#15 0x0000000000445e7a in DeliverEventToInputClients (dev=dev@entry=0xfc85870, inputclients=0xd5726f0, win=win@entry=0xb2fed40, events=events@entry=0x11027a60, count=count@entry=1, filter=filter@entry=16, grab=0x0, client_return=0x11027998, mask_return=0x11027994) at events.c:2170
#16 0x0000000000446167 in DeliverEventToWindowMask (mask_return=0x11027994, client_return=0x11027998, grab=0x0, filter=16, count=1, events=0x11027a60, win=0xb2fed40, dev=0xfc85870) at events.c:2213
#17 0x0000000000446167 in DeliverEventsToWindow (pDev=pDev@entry=0xfc85870, pWin=pWin@entry=0xb2fed40, pEvents=pEvents@entry=0x11027a60, count=count@entry=1, filter=filter@entry=16, grab=grab@entry=0x0) at events.c:2277
#18 0x00000000005283b9 in SendEventToAllWindows (dev=dev@entry=0xfc85870, mask=16, ev=ev@entry=0x11027a60, count=count@entry=1) at exevents.c:3045
#19 0x0000000000533d1f in send_property_event (dev=dev@entry=0xfc85870, property=<optimized out>, what=<optimized out>) at xiproperty.c:208
#20 0x00000000005346d8 in XIChangeDeviceProperty (dev=0xfc85870, property=<optimized out>, type=<optimized out>, format=<optimized out>, mode=<optimized out>, len=<optimized out>, value=0x11027b50, sendevent=1) at xiproperty.c:802
#21 0x000000000fe1d039 in wcmUpdateSerial () at /usr/lib64/xorg/modules/input/wacom_drv.so
#22 0x000000000fe131c2 in wcmSendEvents () at /usr/lib64/xorg/modules/input/wacom_drv.so
#23 0x000000000fe14764 in wcmEvent () at /usr/lib64/xorg/modules/input/wacom_drv.so
#24 0x000000000fe1a476 in usbParse () at /usr/lib64/xorg/modules/input/wacom_drv.so
#25 0x000000000fe11feb in wcmReadPacket () at /usr/lib64/xorg/modules/input/wacom_drv.so
#26 0x000000000fe12226 in wcmDevReadInput () at /usr/lib64/xorg/modules/input/wacom_drv.so
#27 0x000000000059cc8c in InputReady (fd=32, xevents=1, data=0xfc94e90) at inputthread.c:173
#28 0x000000000059f271 in ospoll_wait (ospoll=0xd1172f0, timeout=timeout@entry=-1) at ospoll.c:412
#29 0x000000000059cae6 in InputThreadDoWork (arg=<optimized out>) at inputthread.c:360
#30 0x0000000006d3f6ca in start_thread (arg=0x11029700) at pthread_create.c:333
#31 0x000000000705df7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
Comment 10 Peter Hutterer 2017-02-22 23:03:39 UTC
ok, that's a bug in the wacom driver. The driver should not write events from within the input thread. I'll get that fixed
Comment 11 Peter Hutterer 2017-02-23 00:52:53 UTC
fix for the wacom driver: https://sourceforge.net/p/linuxwacom/mailman/message/35683033/

patch to help identify where we post stuff from within the input thread: https://patchwork.freedesktop.org/patch/140413/
Comment 12 Peter Hutterer 2017-03-01 04:37:56 UTC
commit f65d97362813ea89d18e5049ad87bc468ec30f85
Author: Peter Hutterer <peter.hutterer@who-t.net>
Date:   Thu Feb 23 09:47:27 2017 +1000

    Revert to Update properties with a timer func


and in the server:
commit 1b12249fd6d4d3b566043d556503f3f625b9b4ba
Author: Peter Hutterer <peter.hutterer@who-t.net>
Date:   Thu Feb 23 10:21:49 2017 +1000

    os: log a bug whenever WriteToClient is called from the input thread
Comment 13 Nick Sarnie 2017-03-02 02:39:43 UTC
*** Bug 99164 has been marked as a duplicate of this bug. ***
Comment 14 Olivier Fourdan 2017-03-02 12:54:18 UTC
*** Bug 100033 has been marked as a duplicate of this bug. ***
Comment 15 Storm Engineer 2017-03-06 13:06:58 UTC
Since this patch using a Wacom tablet will causes constant freezes. Everything freezes for about a second every 1-5 seconds while the stylus is moving over the tablet.

The tablet is now completely unusable!
Comment 16 Storm Engineer 2017-03-06 13:39:19 UTC
Addition: Removing xf86-input-wacom stops the freezing, but it also removes the ability to have custom tablet and stylus configuration, which can be essential.

Is this something the patch submitted to linuxwacom is supposed to fix? As if not, then this ticket needs to be reopened.
Comment 17 Olivier Fourdan 2017-03-06 14:29:47 UTC
(In reply to Storm Engineer from comment #15)
> Since this patch using a Wacom tablet will causes constant freezes.
> Everything freezes for about a second every 1-5 seconds while the stylus is
> moving over the tablet.
> 
> The tablet is now completely unusable!

Ther eare two patches, one for the Xorg server to log the issue and avoid the crash, and one to actualyl fix the bug in xf86-input-wacom

Are you using the version of xf86-input-wacom which actually has the fix?
Comment 18 Olivier Fourdan 2017-03-06 14:35:38 UTC
(In reply to Olivier Fourdan from comment #17)
> Are you using the version of xf86-input-wacom which actually has the fix?

Note, you can easily check this by looking at the Xorg logs, if you see tons of messages like:

   (EE) BUG: triggered 'if (in_input_thread())'
   (EE) BUG: io.c:658 in WriteToClient()
   (EE) ******** WriteToClient called from input thread *********

Followed by a backtrace, then you are using an affected version of xf86-input-wacom which does not contain the actual fix for this bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.