Bug 103298

Summary: infinite loop when lid events are repeated - list item points to itself
Product: Wayland Reporter: Alban Browaeys <prahal>
Component: libinputAssignee: Wayland bug list <wayland-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: medium CC: peter.hutterer, prahal
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: lid close-open evemu record
debounce lid events
debug: trigger a core dump on list corruption
debug: hang cpu - session without patch : action sequence as directory name - all events at once
debug: hang or assert - patched
the version of the patch I applied for session1 - changed to apply against 1.8
patch adapted to branch 1.8 - works fine

Description Alban Browaeys 2017-10-16 16:46:15 UTC
Created attachment 134867 [details]
lid close-open evemu record

steps: when closing and opening lid do so more than once before suspend and resume complete

then gnome-shell consume 100% cpu and no in


- libinput 1.8.3-1 from debian sid
- wayland gnome-shell 3.26.1-2 x86_64
- /dev/input/event5:	Lid Switch

udevadm info /sys/class/input/event5
P: /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0D:00/input/input8/event5
N: input/event5
E: DEVNAME=/dev/input/event5
E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0D:00/input/input8/event5
E: ID_INPUT=1
E: ID_INPUT_SWITCH=1
E: ID_PATH=acpi-PNP0C0D:00
E: ID_PATH_TAG=acpi-PNP0C0D_00
E: LIBINPUT_ATTR_LID_SWITCH_RELIABILITY=reliable
E: LIBINPUT_DEVICE_GROUP=19/0/5/0:PNP0C0D/button
E: MAJOR=13
E: MINOR=69
E: SUBSYSTEM=input
E: TAGS=:power-switch:
E: USEC_INITIALIZED=39835474




gdb backtrace:
(gdb) bt
#0  0x00007fb99722f06e in lid_switch_keyboard_event (time=<optimized out>, event=<optimized out>, data=0x560b543f10f0) at evdev-lid.c:98
#1  0x00007fb997225bfd in post_device_event (device=device@entry=0x560b5e87a800, time=time@entry=698204515849, type=type@entry=LIBINPUT_EVENT_KEYBOARD_KEY, event=0x560b5cf64720) at libinput.c:2113
#2  0x00007fb997226f3f in keyboard_notify_key (device=device@entry=0x560b5e87a800, time=time@entry=698204515849, key=<optimized out>, state=<optimized out>) at libinput.c:2212
#3  0x00007fb997229987 in fallback_keyboard_notify_key (device=device@entry=0x560b5e87a800, time=time@entry=698204515849, key=<optimized out>, state=<optimized out>, dispatch=<optimized out>) at evdev.c:169
#4  0x00007fb99722c4de in fallback_process_key (e=0x7ffde9a12f70, e=0x7ffde9a12f70, time=698204515849, device=0x560b5e87a800, dispatch=<optimized out>) at evdev.c:857
#5  0x00007fb99722c4de in fallback_process (evdev_dispatch=<optimized out>, device=0x560b5e87a800, event=0x7ffde9a12f70, time=698204515849) at evdev.c:1168
#6  0x00007fb997229f19 in evdev_process_event (e=0x7ffde9a12f70, device=0x560b5e87a800) at evdev.c:1904
#7  0x00007fb997229f19 in evdev_device_dispatch_one (ev=0x7ffde9a12f70, device=0x560b5e87a800) at evdev.c:1912
#8  0x00007fb997229f19 in evdev_device_dispatch (data=0x560b5e87a800) at evdev.c:1971
#9  0x00007fb997225cef in libinput_dispatch (libinput=0x560b54202aa0) at libinput.c:1996
#10 0x00007fb9a15f2e9c in  () at /usr/lib/x86_64-linux-gnu/mutter/libmutter-clutter-1.so
#11 0x00007fb9a2976f67 in g_main_dispatch (context=0x560b541f70d0) at ../../../../glib/gmain.c:3148
#12 0x00007fb9a2976f67 in g_main_context_dispatch (context=context@entry=0x560b541f70d0) at ../../../../glib/gmain.c:3813
#13 0x00007fb9a29771a0 in g_main_context_iterate (context=0x560b541f70d0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../../glib/gmain.c:3886
#14 0x00007fb9a29774b2 in g_main_loop_run (loop=0x560b544af720) at ../../../../glib/gmain.c:4082
#15 0x00007fb9a0eac93c in meta_run () at /usr/lib/x86_64-linux-gnu/libmutter-1.so.0
#16 0x0000560b52aed287 in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.c:462
(gdb) up
#1  0x00007fb997225bfd in post_device_event (device=device@entry=0x560b5e87a800, time=time@entry=698204515849, type=type@entry=LIBINPUT_EVENT_KEYBOARD_KEY, event=0x560b5cf64720) at libinput.c:2113
2113			listener->notify_func(time, event, listener->notify_func_data);
(gdb) p &device->event_listeners
$12 = (struct list *) 0x560b5e87a820
(gdb) p device->event_listeners
$13 = {prev = 0x560b54459218, next = 0x560b543f1140}
(gdb) p device->event_listeners->next
$14 = (struct list *) 0x560b543f1140
(gdb) p *device->event_listeners->next
$15 = {prev = 0x560b543f1140, next = 0x560b543f1140}
(gdb) p device->event_listeners->prev
$16 = (struct list *) 0x560b54459218
(gdb) p *device->event_listeners->prev
$17 = {prev = 0x560b543f1140, next = 0x560b5e87a820}



Workaround :
(gdb) p device->event_listeners->next->next = 0x560b54459218
$11 = (struct list *) 0x560b54459218
(gdb) c
Comment 1 Alban Browaeys 2017-10-16 17:07:46 UTC
Created attachment 134868 [details] [review]
debounce lid events

The patch remove the event before adding it anew. It applies against 1.8.3 which is the version I tested against (debian sid).

This fixes the issue that the same event is inserted twice into the list. That we then ends up with a list item that points to itself.

I also safeguards the removal of this event listener by this keyboard_is_monitored flag.

The flag tells we already attached this keyboard listener so that even if the state is_closed is not switched yet we avoid duplicates.
Comment 2 Peter Hutterer 2017-10-18 05:54:04 UTC
I've tried to reproduce this with a test case, but I can't quite figure out what the trigger is. It indicates libinput getting out of sync somewhere with the state of the device, but that's a bug that needs to be fixed where it happens. A simple boolean guard isn't going to be enough here.

Can you try to figure out the exact sequence of events that needs to happen for this to occur please?
Comment 3 Michael Thayer 2017-10-26 09:00:00 UTC
I think I am seeing the same thing with Ubuntu 17.10, libinput 1.8.2-2.  The patch applies cleanly, I will see if it helps.
Comment 4 Michael Thayer 2017-10-26 09:43:32 UTC
I was able to trigger the crash pretty easily with the old library still in use by opening and closing the lid a few times in quick succession and then trying to type an update to this bug.  After restarting I was not immediately able to trigger it, but instead brought gnome-shell down with an assertion (g_malloc failed to allocate 1.8x10^19 bytes).  Working down the stack a few frames it looked like it was trying to use an invalid structure, presumably in memory which had already been freed.  Which at least raises the possibility that gnome-shell has some memory use bug which triggers on suspend and resume and that libinput was collateral damage.
Comment 5 Michael Thayer 2017-10-26 12:42:15 UTC
If it is of interest, I am continuing investigation on Launchpad bug 1724259[1].

[1] https://bugs.launchpad.net/ubuntu/+source/gnome-shell/+bug/1724259
Comment 6 Michael Thayer 2017-10-26 13:33:31 UTC
Note that the more applications (windows) running, the easier it is to trigger this.
Comment 7 Alban Browaeys 2017-10-26 20:12:39 UTC
Sorry taking me a while to sort out how to setup a test/dev environment in a VM.
About the crash I think it is not directly related to this issue. The current issue might prevent an user to reach the crash in gnome-shell.

In the crash I get gnome-shell hangs (100%cpu , libinput is in a busy processing LID event list). But no crash. With the fix I get a crash at random times but pretty much nailed down : shell_gtk_embed_window_created_cb is called with its embed argument freed and dereference it. Sadly I cannot seem to sort out how to reproduce (even though it kills my session every 2 days). I see that with coredumpctl and gdb.

But I believe the current issue is higher priority. When libinput loop I cannot even switch to a VT to kill it. Only way out is to ssh to the box and mess with gdb.
Comment 8 Alban Browaeys 2017-10-26 20:57:55 UTC
Created attachment 135092 [details] [review]
debug: trigger a core dump on list corruption

I send this hack to get a coredump / backtrace instead of busy loop to help other debug this issue. I will investigate anew as time passed and I somehow lost track of the in and outs of this fix.
The core of the debug 

with this debug patch I get these journald entries and the adhoc coredumpctl backtrace.

Note that 0x5647705b6c20 is added 0x564770541940 twice in a raw.


sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device init event listener - 0x564770541940 which has prev (nil) and next (nil)
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  to 0x564770545110
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  0x5647705a72e0 which has prev (nil) and next (nil)
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device POST add event listener -  0x5647705a72e0 which has prev 0x564770545110 and next 0x564770545110
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  to 0x5647705b6c20
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  0x5647705a71a8 which has prev (nil) and next (nil)
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device POST add event listener -  0x5647705a71a8 which has prev 0x5647705b6c20 and next 0x5647705b6c20
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  to 0x56477059c930
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  0x5647705a70f8 which has prev (nil) and next (nil)
sept. 27 02:32:55 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device POST add event listener -  0x5647705a70f8 which has prev 0x56477059c930 and next 0x56477059c930
sept. 27 02:33:14 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  to 0x5647705b6c20
sept. 27 02:33:14 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  0x564770541940 which has prev 0x564770541940 and next 0x564770541940
sept. 27 02:33:14 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device POST add event listener -  0x564770541940 which has prev 0x5647705b6c20 and next 0x5647705a71a8
sept. 27 02:33:18 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  to 0x5647705b6c20
sept. 27 02:33:18 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device PRE add event listener -  0x564770541940 which has prev 0x5647705b6c20 and next 0x5647705a71a8
sept. 27 02:33:18 cyclope org.gnome.Shell.desktop[14480]: LIBINPUT : device POST add event listener -  0x564770541940 which has prev 0x564770541940 and next 0x564770541940






           PID: 14799 (gnome-shell)
           UID: 1000 (prahal)
           GID: 1000 (prahal)
        Signal: 6 (ABRT)
     Timestamp: Wed 2017-09-27 02:33:18 CEST (1min 1s ago)
  Command Line: /usr/bin/gnome-shell
    Executable: /usr/bin/gnome-shell
 Control Group: /user.slice/user-1000.slice/session-396.scope
          Unit: session-396.scope
         Slice: user-1000.slice
       Session: 396
     Owner UID: 1000 (prahal)
       Boot ID: c29df59057834e2bb77ae569ae135cfb
    Machine ID: 2b483dcbcecb6729df407c5b5382b0d1
      Hostname: cyclope
       Storage: /var/lib/systemd/coredump/core.gnome-shell.1000.c29df59057834e2bb77ae569ae135cfb.14799.1506472398000000.lz4
       Message: Process 14799 (gnome-shell) of user 1000 dumped core.
                
                Stack trace of thread 14799:
                #0  0x00007fd6b1220fff __GI_raise (libc.so.6)
                #1  0x00007fd6b122242a __GI_abort (libc.so.6)
                #2  0x00007fd6a859cea4 create_dump (libinput.so.10)
                #3  0x00007fd6a85a626b lid_switch_process_switch (libinput.so.10)
                #4  0x00007fd6a85a10a9 evdev_process_event (libinput.so.10)
                #5  0x00007fd6a859cd2f libinput_dispatch (libinput.so.10)
                #6  0x00007fd6b22feeac dispatch_libinput (libmutter-clutter-1.0.so)
                #7  0x00007fd6b17f2f67 g_main_dispatch (libglib-2.0.so.0)
                #8  0x00007fd6b17f31a0 g_main_context_iterate (libglib-2.0.so.0)
                #9  0x00007fd6b17f34b2 g_main_loop_run (libglib-2.0.so.0)
                #10 0x00007fd6b2fa6a8c meta_run (libmutter.so.0)
                #11 0x000056476f0224b7 main (gnome-shell)
                #12 0x00007fd6b120e2e1 __libc_start_main (libc.so.6)
                #13 0x000056476f0225fa _start (gnome-shell)

Thread 1 (Thread 0x7fd6b47a9a80 (LWP 14799)):
#0  0x00007fd6b1220fff in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fd6b122242a in __GI_abort () at abort.c:89
#2  0x00007fd6a859cea4 in create_dump () at libinput.c:2013
#3  0x00007fd6a859cea4 in libinput_device_add_event_listener (device=<optimized out>, listener=0x564770541940, notify_func=<optimized out>, notify_func_data=<optimized out>) at libinput.c:2047
#4  0x00007fd6a85a626b in lid_switch_process_switch (e=<optimized out>, e=<optimized out>, time=105062899623, device=0x5647705450f0, dispatch=0x5647705418f0) at evdev-lid.c:135
#5  0x00007fd6a85a626b in lid_switch_process (evdev_dispatch=0x5647705418f0, device=0x5647705450f0, event=<optimized out>, time=105062899623) at evdev-lid.c:155
#6  0x00007fd6a85a10a9 in evdev_process_event (e=0x7ffd18541870, device=0x5647705450f0) at evdev.c:1904
#7  0x00007fd6a85a10a9 in evdev_device_dispatch_one (ev=0x7ffd18541870, device=0x5647705450f0) at evdev.c:1912
#8  0x00007fd6a85a10a9 in evdev_device_dispatch (data=0x5647705450f0) at evdev.c:1971
#9  0x00007fd6a859cd2f in libinput_dispatch (libinput=0x56477052bdf0) at libinput.c:1998
#10 0x00007fd6b22feeac in dispatch_libinput (manager_evdev=0x56477052a8c0 [ClutterDeviceManagerEvdev]) at evdev/clutter-device-manager-evdev.c:788
#11 0x00007fd6b22feeac in clutter_event_dispatch (g_source=g_source@entry=0x5647705d7de0, callback=<optimized out>, user_data=<optimized out>) at evdev/clutter-device-manager-evdev.c:810
#12 0x00007fd6b17f2f67 in g_main_dispatch (context=0x56477035bc00) at ../../../../glib/gmain.c:3148
#13 0x00007fd6b17f2f67 in g_main_context_dispatch (context=context@entry=0x56477035bc00) at ../../../../glib/gmain.c:3813
#14 0x00007fd6b17f31a0 in g_main_context_iterate (context=0x56477035bc00, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../../../../glib/gmain.c:3886
#15 0x00007fd6b17f34b2 in g_main_loop_run (loop=0x5647706054b0) at ../../../../glib/gmain.c:4082
#16 0x00007fd6b2fa6a8c in meta_run () at core/main.c:572
#17 0x000056476f0224b7 in main (argc=<optimized out>, argv=<optimized out>) at main.c:471
Comment 9 Peter Hutterer 2017-10-27 05:35:19 UTC
Ok, got it. Sequence is: close lid, press key, open lid, close lid. i.e. the open-lid event has to come in after the keyboard event. Patch is here:

https://lists.freedesktop.org/archives/wayland-devel/2017-October/035562.html
Comment 10 Alban Browaeys 2017-10-27 19:07:01 UTC
Created attachment 135130 [details]
debug: hang cpu - session without patch : action sequence as directory name - all events at once

03-lidcloseopenclose-suspend-lidopen_NOK included has the hanging backtrace.
The events are grabbed with this command in current working directory:
parallel -j $(ls -1 /dev/input/event* | wc -l ) sudo evemu-record {}  {/}.emu ::: /dev/input/event*

This is of interest as session1 with the patch was not successful. I send the session 1 in a followup comment.
Comment 11 Alban Browaeys 2017-10-27 19:11:42 UTC
Created attachment 135131 [details]
debug: hang or assert - patched

only the third test has exact actions in directory name. The two others I do not know when suspend completed so is not encoded in their names.
02 is the assert in libinput list_remove
03 is the hang for the issue at stake
Comment 12 Alban Browaeys 2017-10-27 19:13:52 UTC
Created attachment 135132 [details] [review]
the version of the patch I applied for session1 - changed to apply against 1.8
Comment 13 Alban Browaeys 2017-10-27 19:15:45 UTC
Comment on attachment 135132 [details] [review]
the version of the patch I applied for session1 - changed to apply against 1.8

>Once the lid is closed, the keyboard event listener is set up to open the lid
>for us on keyboard events. With the right sequence, we can trigger the
>listener to be added to the list multiple times, triggering an assert in the
>list test code (or an infinite loop in the 1.8 branch).
>
>Conditions:
>* SW_LID value 1 - sets up the keyboard listener
>* keyboard event - sets lid_is_closed to false
>* SW_LID value 0 - is ignored because we're already open
>* SW_LID value 1 - sets up the keyboard listener again
>
>https://bugs.freedesktop.org/show_bug.cgi?id=103298
>
>Signed-off-by: Peter Hutterer <peter.hutterer at who-t.net>
>---
> src/evdev-fallback.c | 13 ++++++-------
> test/test-switch.c   | 40 +++++++++++++++++++++-------------------
> 2 files changed, 27 insertions(+), 26 deletions(-)
>
>--- a/test/test-lid.c
>+++ b/test/test-lid.c
>@@ -426,26 +426,27 @@
> 
> 	keyboard = litest_add_device(li, LITEST_KEYBOARD);
> 
>-	litest_lid_action(sw, LIBINPUT_SWITCH_STATE_ON);
>-	litest_drain_events(li);
>-
>-	litest_event(keyboard, EV_KEY, KEY_A, 1);
>-	litest_event(keyboard, EV_SYN, SYN_REPORT, 0);
>-	litest_event(keyboard, EV_KEY, KEY_A, 0);
>-	litest_event(keyboard, EV_SYN, SYN_REPORT, 0);
>-	libinput_dispatch(li);
>-
>-	event = libinput_get_event(li);
>-	litest_is_switch_event(event,
>-			       LIBINPUT_SWITCH_LID,
>-			       LIBINPUT_SWITCH_STATE_OFF);
>-
>-	litest_assert_only_typed_events(li, LIBINPUT_EVENT_KEYBOARD_KEY);
>-
>-	litest_lid_action(sw, LIBINPUT_SWITCH_STATE_OFF);
>-	litest_assert_empty_queue(li);
>-
>-	libinput_event_destroy(event);
>+	for (int i = 0; i < 3; i++) {
>+		litest_lid_action(sw, LIBINPUT_SWITCH_STATE_ON);
>+		litest_drain_events(li);
>+
>+		litest_event(keyboard, EV_KEY, KEY_A, 1);
>+		litest_event(keyboard, EV_SYN, SYN_REPORT, 0);
>+		litest_event(keyboard, EV_KEY, KEY_A, 0);
>+		litest_event(keyboard, EV_SYN, SYN_REPORT, 0);
>+		libinput_dispatch(li);
>+
>+		event = libinput_get_event(li);
>+		litest_is_switch_event(event,
>+				       LIBINPUT_SWITCH_LID,
>+				       LIBINPUT_SWITCH_STATE_OFF);
>+		libinput_event_destroy(event);
>+
>+		litest_assert_only_typed_events(li, LIBINPUT_EVENT_KEYBOARD_KEY);
>+
>+		litest_lid_action(sw, LIBINPUT_SWITCH_STATE_OFF);
>+		litest_assert_empty_queue(li);
>+	}
> 	litest_delete_device(keyboard);
> }
> END_TEST
>--- a/src/evdev-lid.c
>+++ b/src/evdev-lid.c
>@@ -104,17 +104,16 @@
> 	if (!dispatch->keyboard.keyboard)
> 		return;
> 
>+	libinput_device_remove_event_listener(&dispatch->keyboard.listener);
>+
> 	if (is_closed) {
> 		libinput_device_add_event_listener(
> 					&dispatch->keyboard.keyboard->base,
> 					&dispatch->keyboard.listener,
> 					lid_switch_keyboard_event,
> 					dispatch);
>-	} else {
>-		libinput_device_remove_event_listener(
>-					&dispatch->keyboard.listener);
>-		libinput_device_init_event_listener(
>-					&dispatch->keyboard.listener);
>+
>+		libinput_device_init_event_listener(&dispatch->keyboard.listener);
> 	}
> }
> 
>@@ -130,11 +129,12 @@
> 	case SW_LID:
> 		is_closed = !!e->value;
> 
>-		if (dispatch->lid_is_closed == is_closed)
>-			return;
> 		lid_switch_toggle_keyboard_listener(dispatch,
> 						    is_closed);
> 
>+		if (dispatch->lid_is_closed == is_closed)
>+			return;
>+
> 		dispatch->lid_is_closed = is_closed;
> 
> 		lid_switch_notify_toggle(dispatch, device, time);
Comment 14 Alban Browaeys 2017-10-27 20:15:08 UTC
sorry I did not saw the else in hunk 2 of the patch against master when porting it to 1.8 branch. I retry (session1 data is thus moot, only session2 unpatched is of interest)
This explains the assert.
I still test against 1.8.
Comment 15 Alban Browaeys 2017-10-29 04:26:04 UTC
Created attachment 135149 [details] [review]
patch adapted to branch 1.8 - works fine

I confirm with this "correct" version of the patch the 100% cpu hang is gone.
Good job !
Comment 16 Peter Hutterer 2017-10-30 01:31:28 UTC
commit 41a70bbe924c4867283bfdec67cd76012744afcb
Author: Peter Hutterer <peter.hutterer@who-t.net>
Date:   Fri Oct 27 15:03:52 2017 +1000

    fallback: fix lid switch event listener being initialized twice


I'll get a new 1.8 release out asap, thanks for testing!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.