Bug 90051 - udevd have no measure to avoid exiting before all jobs are done
Summary: udevd have no measure to avoid exiting before all jobs are done
Status: NEW
Alias: None
Product: systemd
Classification: Unclassified
Component: general (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Kay Sievers
QA Contact: systemd-bugs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-04-16 08:36 UTC by Tom Yan
Modified: 2015-05-26 12:12 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Shell script to reproduce the problem (1.25 KB, text/plain)
2015-04-16 08:36 UTC, Tom Yan
Details
Shell script to reproduce the problem (1.24 KB, text/plain)
2015-04-16 08:39 UTC, Tom Yan
Details
modified "hook" script for reproducing issue (621 bytes, text/plain)
2015-04-24 16:18 UTC, Tom Yan
Details
Case 1 -- without `sleep 1` (555.41 KB, image/jpeg)
2015-04-24 16:19 UTC, Tom Yan
Details
Case 1 -- with `sleep 1` (595.55 KB, image/jpeg)
2015-04-24 16:19 UTC, Tom Yan
Details
Case 2 -- without `sleep 1` (605.49 KB, image/jpeg)
2015-04-24 16:20 UTC, Tom Yan
Details
Case 2 -- with `sleep 1` (618.65 KB, image/jpeg)
2015-04-24 16:21 UTC, Tom Yan
Details
control hook script (569 bytes, text/plain)
2015-04-24 16:38 UTC, Tom Yan
Details
5 trials of reproducing timeout issue (6.87 KB, text/plain)
2015-04-24 20:22 UTC, Tom Yan
Details
journalctl -b -u systemd-udevd (after patch) (256.44 KB, text/plain)
2015-04-25 20:51 UTC, Tom Yan
Details
journalctl -b -u systemd-udevd (before patch) (262.13 KB, text/plain)
2015-04-25 20:53 UTC, Tom Yan
Details

Description Tom Yan 2015-04-16 08:36:55 UTC
Created attachment 115114 [details]
Shell script to reproduce the problem

For example I want to get the following udev rule applied:
ATTR{devpath}=="13", ATTR{power/wakeup}="enabled"

If you run udevd and udevadm trigger in a script, like this one:

https://projects.archlinux.org/svntogit/packages.git/tree/trunk/initcpio-hook-udev?h=packages/systemd

you could notice that udevd is not guaranteed to exit after it finishes its job despite all possible timeout parameters in different commands (udevd/control/settle)

In the case of services, if you stop the udevd service soon after the trigger service, it will wait for its timeout (30s by default) by doing nothing, leaving the job undone.

Attached is a script to reproduce the problem. By toggling `sleep 1` you shall see the difference.
Comment 1 Tom Yan 2015-04-16 08:39:45 UTC
Created attachment 115115 [details]
Shell script to reproduce the problem
Comment 2 Tom Gundersen 2015-04-24 10:45:24 UTC
Hi,

I'm having a hard time following this report.

Let summarize how things are meant to work:

Events received by udev (due to trigger) are either 'queued' or 'running' at any point in time. When you call 'exit', all the 'queued' events are dropped and the 'running' events are finished before we exit (up to a timeout of 30s). That means that even if one event is 'running' until the timeout, no new events are moved from 'queued' to running state.

What you should see is that once you call exit, udev exits "almost immediately". If it does not, that should mean that an event is stuck in "running" state for a long time (this should not happen, but could happen due to kernel issues or other external factors). You can verify that by seeing if udevd has any child processes after you call exit.

If udevd has no child processes, but still does not exit after you call exit, that is an udev bug. I was not able to reproduce that with your script though, so would be great if you could check that out.

Lastly, if if you want to make sure that udevd only exits after all the events caused by trigger have been processed, that is precisely what settle is for. It waits for udevd to receive the triggered events and then it waits for udevd to finish procssing them all (new events may appear at any time, but we make no guarantees about those).
Comment 3 Tom Yan 2015-04-24 12:07:50 UTC
(In reply to Tom Gundersen from comment #2)
> What you should see is that once you call exit, udev exits "almost
> immediately".
This is true for `udevadm control --exit`, but if udevd is started and stopped as a service, this is generally false. Here is an example, which you can see that it waits for exactly 30s:

[tom@localhost ~]$ systemctl status -l systemd-udevd
● systemd-udevd.service - udev Kernel Device Manager
   Loaded: loaded (/usr/lib/systemd/system/systemd-udevd.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2015-04-24 19:36:02 HKT; 46s ago
     Docs: man:systemd-udevd.service(8)
           man:udev(7)
 Main PID: 194 (systemd-udevd)
   CGroup: /system.slice/systemd-udevd.service
           └─194 /usr/lib/systemd/systemd-udevd

Apr 24 19:36:02 localhost systemd-udevd[194]: starting version 219
[tom@localhost ~]$ systemctl status -l systemd-udevd
● systemd-udevd.service - udev Kernel Device Manager
   Loaded: loaded (/usr/lib/systemd/system/systemd-udevd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Fri 2015-04-24 19:37:26 HKT; 2s ago
     Docs: man:systemd-udevd.service(8)
           man:udev(7)
  Process: 659 ExecStart=/usr/lib/systemd/systemd-udevd (code=exited, status=0/SUCCESS)
 Main PID: 659 (code=exited, status=0/SUCCESS)

Apr 24 19:36:55 localhost systemd-udevd[659]: starting version 219
Apr 24 19:36:55 localhost systemd[1]: Started udev Kernel Device Manager.
Apr 24 19:36:55 localhost systemd[1]: Stopping udev Kernel Device Manager...
Apr 24 19:36:55 localhost systemd-udevd[659]: worker [668] did not accept message -1 (Connection refused), kill it
Apr 24 19:36:55 localhost systemd-udevd[659]: worker [672] did not accept message -1 (Connection refused), kill it
Apr 24 19:36:55 localhost systemd-udevd[659]: worker [677] did not accept message -1 (Connection refused), kill it
Apr 24 19:36:55 localhost systemd-udevd[659]: worker [682] did not accept message -1 (Connection refused), kill it
Apr 24 19:36:55 localhost systemd-udevd[659]: worker [682] terminated by signal 9 (Killed)
Apr 24 19:37:25 localhost systemd-udevd[659]: timeout, giving up waiting for workers to finish
Apr 24 19:37:26 localhost systemd[1]: Stopped udev Kernel Device Manager.

But during the wait, I only see the main pid with `watch -n 0.1 pgrep udev`. (Well most of the wait, like 29.x seconds out of 30)

> Lastly, if if you want to make sure that udevd only exits after all the
> events caused by trigger have been processed, that is precisely what settle
> is for. It waits for udevd to receive the triggered events and then it waits
> for udevd to finish procssing them all (new events may appear at any time,
> but we make no guarantees about those).

It doesn't seem to work in the way you said. For example in this script:
https://projects.archlinux.org/svntogit/packages.git/tree/trunk/initcpio-hook-udev?h=packages/systemd
I assume that `udevadm settle` should make sure that events triggered by the two commands before it should be finished? But from what I am experiencing it does not.

For example, a number of kernel modules is included in the initramfs. Suppose given enough time (with `sleep` command), 26 of them would be loaded in early user space, but if only `udevadm settle` is used, the number could be random.
Comment 4 Lennart Poettering 2015-04-24 13:06:46 UTC
Hmm, does adding KillMode=mixed to the systemd-udevd.service unit file fix your issue?
Comment 5 Tom Yan 2015-04-24 14:10:15 UTC
(In reply to Lennart Poettering from comment #4)
> Hmm, does adding KillMode=mixed to the systemd-udevd.service unit file fix
> your issue?

Yes it does avoid the seemingly useless timeout. So now remains the settle issue.
Comment 6 Lennart Poettering 2015-04-24 14:19:03 UTC
I have set KillMode=mixed in the unit file now in git.
Comment 7 Kay Sievers 2015-04-24 14:38:00 UTC
(In reply to Tom Yan from comment #3)
> It doesn't seem to work in the way you said. For example in this script:
> https://projects.archlinux.org/svntogit/packages.git/tree/trunk/initcpio-
> hook-udev?h=packages/systemd
> I assume that `udevadm settle` should make sure that events triggered by the
> two commands before it should be finished? But from what I am experiencing
> it does not.

A quick test seems to work fine here.

# cat /etc/udev/rules.d/10-local.rules
KERNEL=="null", RUN+="/usr/bin/sleep 3"

# echo $(date +%s) trigger; udevadm trigger; \
  echo $(date +%s) settling; udevadm settle; \
  echo $(date +%s) exiting; udevadm control --exit; \
  echo $(date +%s) done
1429886202 trigger
1429886202 settling
1429886205 exiting
1429886205 done
Comment 8 Tom Yan 2015-04-24 15:49:16 UTC
(In reply to Kay Sievers from comment #7)
> (In reply to Tom Yan from comment #3)
> > It doesn't seem to work in the way you said. For example in this script:
> > https://projects.archlinux.org/svntogit/packages.git/tree/trunk/initcpio-
> > hook-udev?h=packages/systemd
> > I assume that `udevadm settle` should make sure that events triggered by the
> > two commands before it should be finished? But from what I am experiencing
> > it does not.
> 
> A quick test seems to work fine here.
> 
> # cat /etc/udev/rules.d/10-local.rules
> KERNEL=="null", RUN+="/usr/bin/sleep 3"
> 
> # echo $(date +%s) trigger; udevadm trigger; \
>   echo $(date +%s) settling; udevadm settle; \
>   echo $(date +%s) exiting; udevadm control --exit; \
>   echo $(date +%s) done
> 1429886202 trigger
> 1429886202 settling
> 1429886205 exiting
> 1429886205 done

I am aware of that. This problem can't be easily reproduced in a script because the time needed to make a difference could varies depending on the job and could be really short.

Not sure how to give you very nice proof. But maybe I can upload some pics to let you see the difference.
Comment 9 Tom Yan 2015-04-24 16:18:22 UTC
Created attachment 115307 [details]
modified "hook" script for reproducing issue

So i add `lsmod` and `lsmod | wc -l` after the "early" udevd is exited to see the (number of) modules get loaded, and by toggling `sleep 1` you can see they change despite the `udevadm settle` command.

Four screenshots from two test cases will follow. The two cases aim to show that it's not a module-specific problem. The difference of the two cases is the port being used by my two usb hid receivers: In case 1 Receiver A use Port 13 and Receiver B use Port 14; In case 2 Receiver A use port 14 and Receiver B use Port 13.
Comment 10 Tom Yan 2015-04-24 16:19:30 UTC
Created attachment 115308 [details]
Case 1 -- without `sleep 1`
Comment 11 Tom Yan 2015-04-24 16:19:55 UTC
Created attachment 115309 [details]
Case 1 -- with `sleep 1`
Comment 12 Tom Yan 2015-04-24 16:20:20 UTC
Created attachment 115310 [details]
Case 2 -- without `sleep 1`
Comment 13 Tom Yan 2015-04-24 16:21:14 UTC
Created attachment 115311 [details]
Case 2 -- with `sleep 1`

same as "Case 1 -- `with sleep 1`", except the order
Comment 14 Tom Yan 2015-04-24 16:38:27 UTC
Created attachment 115312 [details]
control hook script

I tried to put `sleep 1` after `udevadm control --exit` as well. The results are the same as in the "case*_without_sleep" screenshots. Kind of confirm the time is needed for udevd but not something else.

Also tried to have only one receiver plugged into Port 14. The issue would disappear, probably because the time needed is reduced. Hence not a hardware problem (except for being a bit slow?).
Comment 15 Tom Yan 2015-04-24 16:46:08 UTC
(In reply to Tom Yan from comment #14)
> Also tried to have only one receiver plugged into Port 14. The issue would
> disappear, probably because the time needed is reduced. Hence not a hardware
> problem (except for being a bit slow?).

Sorry I should have explained it a bit.

The reason I test Port 14 alone is because in the previous two test cases, the driver module(s) of receiver plugged into Port 14 are not loaded, which are:

Case 1 -- without `sleep 1`: hid_generic
Case 2 -- without `sleep 1`: hid_logitech_hidpp, hid_logitech_dj

so I did extra tests by plugging in receiver into Port 14 only each time to confirm the port itself is fine
Comment 16 Tom Gundersen 2015-04-24 18:29:51 UTC
Thanks for all the info.

I'm still puzzled by the 30 second timeout. It appears we are losing track of worker processes, but I can't see how, and I can't reproduce it. Could you try to reproduce again (without Lennart's fix) using current git HEAD (you can just do "./autogen.sh c && make systemd-udevd"). I added quite a bit of extra logging now.

As to the settle issue, I'm not sure what's going on. There was a fix in this area not too long ago, so could you also confirm that this is still a problem with current git?
Comment 17 Tom Yan 2015-04-24 20:22:18 UTC
Created attachment 115314 [details]
5 trials of reproducing timeout issue
Comment 18 Tom Yan 2015-04-24 20:26:31 UTC
(In reply to Tom Gundersen from comment #16)
> As to the settle issue, I'm not sure what's going on. There was a fix in
> this area not too long ago, so could you also confirm that this is still a
> problem with current git?

Yes it is still a problem as of commit b66f29a.
Comment 19 Tom Gundersen 2015-04-25 16:43:34 UTC
Thanks for checking. The output of systemctl is truncated, so could you possibly instead attach the relevant output of "journalctl -u systemd-udevd"?

Also, could you check if this patch fixes the 30 second timeout: http://paste.fedoraproject.org/215475/99796611/ ?
Comment 20 Tom Yan 2015-04-25 20:51:18 UTC
Created attachment 115329 [details]
journalctl -b -u systemd-udevd (after patch)

journalctl output looks the same to me, so I assume you mean udev_log="debug"?
Comment 21 Tom Yan 2015-04-25 20:53:05 UTC
Created attachment 115330 [details]
journalctl -b -u systemd-udevd (before patch)

The patch doesn't help. Another 5 trials before the patch.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.