Summary: | [PATCH] (priv->num_active_touches > priv->num_slots) leading to SIGSEGV | ||
---|---|---|---|
Product: | xorg | Reporter: | Thilo Schulz <thilo> |
Component: | Input/synaptics | Assignee: | Peter Hutterer <peter.hutterer> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | benjamin.tissoires, fireandfuel, jbsimpson, jwrdegoede, lohmaier, michael, peter.hutterer, soren121, thilo |
Version: | git | ||
Hardware: | All | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Thilo Schulz
2015-02-01 23:59:54 UTC
Oh, btw... root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/name ELAN0501:00 04F3:300B UNKNOWN root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/abs 260800000000003 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/ev b root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/ff 0 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/key 6420 10000 0 0 0 0 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/led 0 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/msc 0 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/rel 0 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/snd 0 root@Helmuth:/usr/src/libevdev# cat /sys/class/input/event10/device/capabilities/sw 0 actually, can you attach the evemu-record output for this? That sounds like a kernel bug the way you describe it. Created attachment 113029 [details]
Event log triggering bug
Here you go.
Interesting part starts at
0.318317
I thought it might be a kernel bug, but only saw
Documentation/input/event-codes.txt
I didn't see
Documentation/input/multi-touch-protocol.txt
yet for some reason.
I can reproduce this bug by touching the touchpad with finger 1, 2, and while keeping finger 2 on the pad, rapidly switching between finger 1 and 3.
The missing SYN_REPORT does not happen on every occasion, but still fairly often.
Indeed, that's a new one, didn't think that was legal and this sequence will likely cause issues in evdev and libinput as well. Benjamin, Hans: should we force the kernel to send a SYN_REPORT before re-opening a slot in the same sync frame or just deal with it in userspace? E: 0.318317 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 E: 0.318317 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1 E: 0.318317 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1 E: 0.318317 0003 0035 0757 # EV_ABS / ABS_MT_POSITION_X 757 E: 0.318317 0003 0036 0911 # EV_ABS / ABS_MT_POSITION_Y 911 E: 0.318317 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 E: 0.318317 0003 0039 6834 # EV_ABS / ABS_MT_TRACKING_ID 6834 E: 0.318317 0003 0035 1398 # EV_ABS / ABS_MT_POSITION_X 1398 E: 0.318317 0003 0036 0832 # EV_ABS / ABS_MT_POSITION_Y 832 E: 0.318317 0003 0000 0757 # EV_ABS / ABS_X 757 E: 0.318317 0003 0001 0911 # EV_ABS / ABS_Y 911 E: 0.318317 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- Benjamin asked for hid-replay recordings, they should help with figuring out if that's a kernel bug. Please attach them here (uncompressed), thanks. https://bentiss.github.io/hid-replay-docs/ (In reply to Peter Hutterer from comment #4) > Indeed, that's a new one, didn't think that was legal and this sequence will > likely cause issues in evdev and libinput as well. Benjamin, Hans: should we > force the kernel to send a SYN_REPORT before re-opening a slot in the same > sync frame or just deal with it in userspace? AFAIK the kernel is supposed to send a SYN before re-using the slot, so this seems to be a kernel bug to me, we should probably detect this and complain about it in libinput though (not necessarily deal with it just detect, complain and not crash). > > E: 0.318317 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 > E: 0.318317 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1 > E: 0.318317 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1 > E: 0.318317 0003 0035 0757 # EV_ABS / ABS_MT_POSITION_X 757 > E: 0.318317 0003 0036 0911 # EV_ABS / ABS_MT_POSITION_Y 911 > E: 0.318317 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 > E: 0.318317 0003 0039 6834 # EV_ABS / ABS_MT_TRACKING_ID 6834 > E: 0.318317 0003 0035 1398 # EV_ABS / ABS_MT_POSITION_X 1398 > E: 0.318317 0003 0036 0832 # EV_ABS / ABS_MT_POSITION_Y 832 > E: 0.318317 0003 0000 0757 # EV_ABS / ABS_X 757 > E: 0.318317 0003 0001 0911 # EV_ABS / ABS_Y 911 > E: 0.318317 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- Created attachment 113064 [details]
hid-record log of multiple slot open/close within one SYN_REPORT packet
Created attachment 113065 [details]
Evemu log corresponding to the hid-record log
E: 0.288415 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- E: 0.296207 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 E: 0.296207 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1 E: 0.296207 0003 002f 0001 # EV_ABS / ABS_MT_SLOT 1 E: 0.296207 0003 0035 0908 # EV_ABS / ABS_MT_POSITION_X 908 E: 0.296207 0003 0036 1062 # EV_ABS / ABS_MT_POSITION_Y 1062 E: 0.296207 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 E: 0.296207 0003 0039 8787 # EV_ABS / ABS_MT_TRACKING_ID 8787 E: 0.296207 0003 0035 1566 # EV_ABS / ABS_MT_POSITION_X 1566 E: 0.296207 0003 0036 0861 # EV_ABS / ABS_MT_POSITION_Y 861 E: 0.296207 0003 0000 0908 # EV_ABS / ABS_X 908 E: 0.296207 0003 0001 1062 # EV_ABS / ABS_Y 1062 E: 0.296207 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- have fun. :) Thanks for the recordings. I was able to reproduce it and understood what happened. We receive (in one logical frame): 0.293377 Tip Switch: 0 | Contact Id: 0 | X: 539 | Y: 1960 | Contact Count: 3 0.294783 Tip Switch: 1 | Contact Id: 1 | X: 908 | Y: 1062 | Contact Count: 0 0.296187 Tip Switch: 1 | Contact Id: 2 | X: 1566 | Y: 861 | Contact Count: 0 So we look for the slot corresponding to contactID 0 -> 0, and we release it. Then, contactID 2 correspond to an already allocated slot 1, so we send its data. And finally, we have a new ContactID 2, we look for a slot, and we find that slot 0 is unused, so we reuse it... Which should not happen. I am working on a fix. The first quick solution dropped all of the touchpoints, so I will need a little bit more time to figure a proper solution. Hi Benjamin, good to know. Just curious, what kind of event actually triggers the sending of a SYN_REPORT in the kernel, and where in the code does that happen? I was looking at input.c and input-mt.c and it was not immediately obvious to me. Also, you can take your time. The patch I submitted deals with this problem in userland for now and I've had no more crashes. (In reply to Thilo Schulz from comment #11) > Hi Benjamin, good to know. > Just curious, what kind of event actually triggers the sending of a > SYN_REPORT in the kernel, and where in the code does that happen? I was > looking at input.c and input-mt.c and it was not immediately obvious to me. So, the chain of events is the following: - once a whole touch has been received, hid-multitouch.c calls mt_complete_slot(). - this function assigns the current touch to a slot by calling mt_compute_slot() which calls in your case input_mt_get_slot_by_key() (<- this one was in fault). - once the slot is assigned, mt_complete_slot() sends the slot to the input system (input_mt_slot(), input_mt_report_slot_state(), input_event(), etc..) - once we have received all the touch of the current logical report, we call mt_sync_frame() which in turn calls input_mt_sync_frame() to finish sending the slots (and releasing those who need), and finally, it calls input_sync() which sends the SYN_REPORT to the user space. The thing is that you were looking at the input system files, not the driver of your touchscreen :) FWIW, the kernel patch is on the list: https://patchwork.kernel.org/patch/5770071/ (In reply to Benjamin Tissoires from comment #12) > The thing is that you were looking at the input system files, not the driver > of your touchscreen :) Please note this is no exotic hacked touchscreen hardware. This is actually the touchpad built into a fairly recent off-the-shelf laptop (Acer Aspire E5-571-31KM), Acer being one of those few vendors that actually ship laptops without preinstalled Windows. And alas, I fear this is the kind of bug that makes users go "F*** it, I'll just install Windows". > So, the chain of events is the following: > - once a whole touch has been received, hid-multitouch.c calls > mt_complete_slot(). > - this function assigns the current touch to a slot by calling > mt_compute_slot() which calls in your case input_mt_get_slot_by_key() (<- > this one was in fault). > - once the slot is assigned, mt_complete_slot() sends the slot to the input > system (input_mt_slot(), input_mt_report_slot_state(), input_event(), etc..) > - once we have received all the touch of the current logical report, we call > mt_sync_frame() which in turn calls input_mt_sync_frame() to finish sending > the slots (and releasing those who need), and finally, it calls input_sync() > which sends the SYN_REPORT to the user space. Thanks for laying this out in so great detail to me! This quirk looks interesting in our context: if ((td->mtclass.quirks & MT_QUIRK_IGNORE_DUPLICATES) && mt) { struct input_mt_slot *slot = &mt->slots[slotnum]; if (input_mt_is_active(slot) && input_mt_is_used(mt, slot)) return; } So while this probably would not prevent the bug I reported from happening, this seems like a similar error case. All the best, Thilo (In reply to Thilo Schulz from comment #14) > Please note this is no exotic hacked touchscreen hardware. This is actually > the touchpad built into a fairly recent off-the-shelf laptop (Acer Aspire > E5-571-31KM). Yeah, but hid-multitouch can deal with touchscreen/touchpad. So for me hid-multitouch == touchscreen :) > This quirk looks interesting in our context: > if ((td->mtclass.quirks & MT_QUIRK_IGNORE_DUPLICATES) && mt) > { > struct input_mt_slot *slot = &mt->slots[slotnum]; > if (input_mt_is_active(slot) && > input_mt_is_used(mt, slot)) > return; > } > So while this probably would not prevent the bug I reported from happening, > this seems like a similar error case. > I think this is a different context. Here, we are checking if the slot is active and used. The one I fixed (some more work is required actually) check for not active and not used. This is not the exact negation (we would have had a 'or' in one or the other test), so we should be fine here (and I tested this code among a good deal of touchscreens/pads. Created attachment 113804 [details]
evemu-record output from DLL0665 device
I am able to trigger this same behavior on a Dell XPS 13 (2015) 9343 with just one finger. I'm not sure what program generated the log posted by Thilo in the first post in this thread. Looking at the output of evemu-record, I don't see see any reference to slot for single-finger touches, but I do continuously generate the BUG error in Xorg logs. I also see a lot of this line:
(EE) [dix] DLL0665:01 06CB:76AD UNKNOWN: unable to find touch point 0
though occasionally it says "touch point 1".
Capabilities seem to be the same, so I suspect this is actually the same Synaptics pad internally.
I've also had a couple of crashes, one of which did drop a core from a SEGV. No backtrace as yet, need to get debuginfo for X on this machine.
Not sure I can get a good backtrace from it - gdb is giving me some addresses that seem off to me and none seem to match the symbols I got from the unstripped binary. Reading symbols from /usr/bin/Xorg...Reading symbols from /usr/lib/debug//usr/bin/Xorg...done. done. [New LWP 1336] [New LWP 1378] Core was generated by `/usr/bin/X -core :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007ff4f64278ec in ?? () (gdb) bt #0 0x00007ff4f64278ec in ?? () #1 0x00007ff4f700cf70 in ?? () #2 0x00007ff4f74f1620 in ?? () #3 0x000000000d55e3c8 in ?? () #4 0x582d6b77028e5f00 in ?? () #5 0x00007ff4f700cf70 in ?? () #6 0x00007ff4f7022050 in ?? () #7 0x00007ff4f700cf70 in ?? () #8 0x00007ff4f6427dca in ?? () #9 0x00007ff4f7022050 in ?? () #10 0x00007ff4f798bd40 in ?? () #11 0x00007ff4f798bd50 in ?? () #12 0x00007ff4f6428c4c in ?? () #13 0x00007fff2099cb70 in ?? () #14 0x00007ff4f74c7430 in ?? () #15 0x00007ff4f700ef30 in ?? () #16 0x00007fff2099c770 in ?? () #17 0x0000000000000020 in ?? () #18 0x0000000000000000 in ?? () (In reply to Michael Leuchtenburg from comment #17) > Not sure I can get a good backtrace from it - gdb is giving me some > addresses that seem off to me and none seem to match the symbols I got from > the unstripped binary. Hey Michael, I've had the same problem when I was debugging this. Corefiles did not seem to yield a valid core stack to me. I knew the crash was probably happening in the synaptics input driver of X, so I compiled that part from git with debugging symbols, then started the X server remotely inside cgdb (which is an ncurses frontend for gdb that is pretty neat) and just stepped through code in the synaptics driver. Don't need any debugging symbols for X itself or any of its core files. I've had a quick glance at your evemu log and there's nothing in there that would trigger the bug I described. The report from my first post was no specific software, it was the event sequence I gleamed off when debugging with GDB and was written down by hand. It would be interesting to know whether your crash still happens if you compile the xf86-input-synaptics module with my proposed patch. The Dell XPS 13 needs a kernel fix, see https://bugzilla.redhat.com/show_bug.cgi?id=1188439 Hey Thilo, Ah, thanks for the X debugging tip. I'm traveling at the moment so I won't be able to step through X running live until I get back. Oddly it looks like no library was loaded at the addresses given in the backtrace - not sure what's up with that. I suppose I might be able to get a crash dump even without another machine to interface with gdb on. It is rather odd that my evemu log looks so different. Perhaps it's actually a different bug. I'm currently compiling with the kernel patch Benjamin Tissoires submitted, which should also fix the problem. Looks like that's currently stalled in review due to not being sure how to handle the change in the interface it causes for other drivers that use hid-multitouch. I'll give your synaptics_drv patch a try in a bit. Peter, Yup, I've got a patch in my kernel for that. Sorry, I meant "that use input-mt", not "that use hid-multitouch". the evemu log looks quite normal, nothing out of the ordinary in there. your backtrace is buggered because you probably had some stack smash happen. anyway, for the evemu log I recommend leaving it running until you see the bug happen in the log. Then try to replay the recording through your device (after restarting X) and see if it triggers the bug again. that should narrow down the event sequence that's the culprit. and if you restart evemu every couple of min while you're reproducing the final event log wont be that long... there's also the odd chance that you're getting a SYN_DROPPED event though synaptics should handle this more or less gracefully now. maybe put a printf into the driver to narrow that down. Created attachment 114404 [details]
evemu capture that reproduces bug
This log reproduces the bug when replayed with evemu-device/evemu-play. I think it was about 2/3 of the way through but I didn't time it.
The culprit is: E: 119.483341 0003 002f 0000 # EV_ABS / ABS_MT_SLOT 0 E: 119.483341 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1 E: 119.483341 0003 0039 1272 # EV_ABS / ABS_MT_TRACKING_ID 1272 E: 119.483341 0003 0035 0585 # EV_ABS / ABS_MT_POSITION_X 585 E: 119.483341 0003 0036 0414 # EV_ABS / ABS_MT_POSITION_Y 414 E: 119.483341 0003 0000 0790 # EV_ABS / ABS_X 790 E: 119.483341 0003 0001 0120 # EV_ABS / ABS_Y 120 E: 119.483341 0000 0000 0000 # ------------ SYN_REPORT (0) ---------- Which is the same as pointed out in Comment #4. Benjamin, what's the status on the kernel patch? Are you sure it's the same? In my case slot 0 gets reassigned. I don't see that happening here. the kernel filters out events that represent the current state. In your case you had a second touchpoint so the switch from slot 0 to 1 to 0 was visible. Here there's no slot 1, so the second ABS_MT_SLOT 0 event is never visible. From the synaptics driver's POV both cause the same issue, there needs to be a SYN_REPORT after the tracking ID -1. (In reply to Thilo Schulz from comment #25) > Are you sure it's the same? > In my case slot 0 gets reassigned. I don't see that happening here. E: 119.483341 0003 0039 -001 # EV_ABS / ABS_MT_TRACKING_ID -1 E: 119.483341 0003 0039 1272 # EV_ABS / ABS_MT_TRACKING_ID 1272 ignore my previous comment. it is the same, however, the codepath leading to this must be a tad different. The codepath may be different, but it is also fixed by the kernel patch which Benjamin proposed. It would be nice to see that patch make it into 4.0. Looks like it's still blocked on fixing up the other drivers that depend on that function. I don't understand this subsystem well enough to just run with that, but I'm willing and able to test further patches. (In reply to Michael Leuchtenburg from comment #28) > It would be nice to see that patch make it into 4.0. Looks like it's still > blocked on fixing up the other drivers that depend on that function. I don't > understand this subsystem well enough to just run with that, but I'm willing > and able to test further patches. Actually, the wacom will be fixed in 4.1, so I just resent the patch few minutes ago: https://patchwork.kernel.org/patch/6123751/ I don't think we can expect this to land in 4.0, but 4.1 is reasonable. If you want it sooner, ask your distribution to backport https://git.kernel.org/cgit/linux/kernel/git/jikos/hid.git/commit/?h=for-4.1/wacom&id=9a1c001298fd567c0f0776ab54ab9965eeb9019f and this patch, and you should be solved. Just upgraded to debian jessie and had this problem come up again. Judging from the discussion in https://patchwork.kernel.org/patch/6123751/ Mr. Rydberg seems to want to have this fixed in userland. If this is the way to go, I propose you have a look at my initial patch. It should deal with those issues by injecting a SYN_REPORT in userland, thus making your assumption that there are no separate slot uses within one SYN_REPORT packet valid again. FYI: the proposed patch attached to this bug in #c1 fixes the problem for me on a XPS 13 9343 (2015) running Mageia 5 (see also https://bugs.mageia.org/show_bug.cgi?id=16093 ) There are two bugs here, one is that the slot closes and re-opens in the same event frame, i.e. the sequence in comment 24. That's a kernel bug and needs to be fixed there. The second issue is the direct transition from one touch to the other, i.e. an ABS_MT_TRACKING_ID <x> without a preceding ABS_MT_TRACKING_ID -1. That's the one Henrik was arguing for on the kernel list. libevdev currently filters those out though so they'll never show up for the driver and thus don't trigger this bug. Benjamin, what's the status of the kernel patch here? (In reply to Peter Hutterer from comment #32) > There are two bugs here, one is that the slot closes and re-opens in the > same event frame, i.e. the sequence in comment 24. That's a kernel bug and > needs to be fixed there. It's fixed in kernel v4.1. > > The second issue is the direct transition from one touch to the other, i.e. > an ABS_MT_TRACKING_ID <x> without a preceding ABS_MT_TRACKING_ID -1. That's > the one Henrik was arguing for on the kernel list. libevdev currently > filters those out though so they'll never show up for the driver and thus > don't trigger this bug. I am a little bit lost with this bug. I don't recall having this (ever) exposed in a real hardware. We talked about it publicly, but we never had the case. Can you point which evemu-record (and hid-record ideally) exposes that? can't remember where this came up either but I don't think it was a real device. in that case I'll close the bug, please update to v4.1 or get your distribution to backport the fix. *** Bug 84656 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.