Bug 79576

Summary:	udevd flock() failure on disk partitions due to fsck holding WRITE FLOCK
Product:	systemd	Reporter:	jpsinthemix
Component:	general	Assignee:	systemd-bugs
Status:	RESOLVED FIXED	QA Contact:	systemd-bugs
Severity:	critical
Priority:	medium
Version:	unspecified
Hardware:	x86 (IA32)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:
Attachments:	uded.c:worker_new() flock() attempts debug handle preexisting flocks in udevd.c

Description jpsinthemix 2014-06-03 07:59:09 UTC

Created attachment 100341 [details]
uded.c:worker_new() flock() attempts debug

Hi,

On upgrading from systemd-212 to systemd-213, my development system once again (see BR#75512) no longer is able to reliably mount non-root disk partitions listed in /etc/fstab at boot time. I say 'reliably' here as it is able to do sometimes, but mosty not.

On this system, I'm using MBR (not GPT) disk partitions, and not using lvm.

My grub2 boot kernel line is:
  linux   /boot/vmlinuz-3.14.5 root=/dev/sda7 ro raid=noautodetect quiet

At boot, I get the messages like
systemd-fsck[62]: /sbin/fsck.xfs: XFS file system.
[ * ] (1 of 4) A start job is running for dev-sda10.device. (1min 26s / 1min 30s)

The root partition is fine, but no other sdaX partitions appear.

In appearence, this issue results in the same behavior as described in BR#75512, however, note that the kernel has been built with CONFIG_FHANDLE=y here.

I have traced this to systemd-213/src/udev/udevd.c:worker_new(), where among other things, the following code is new w.r.t sytemd-212,

 285                         /*
 286                          * Take a "read lock" on the device node; this establishes
 287                          * a concept of device "ownership" to serialize device
 288                          * access. External processes holding a "write lock" will
 289                          * cause udev to skip the event handling; in the case udev
 290                          * acquired the lock, the external process will block until
 291                          * udev has finished its event handling.
 292                          */
 293                         if (streq_ptr("block", udev_device_get_subsystem(dev))) {
 294                                 struct udev_device *d = dev;
 295
 296                                 if (streq_ptr("partition", udev_device_get_devtype(d)))
 297                                         d = udev_device_get_parent(d);
 298
 299                                 if (d) {
 300                                         fd_lock = open(udev_device_get_devnode(d), O_RDONLY|O_CLOEXEC|O_NOFOLLOW|O_NONBLOCK);
 301                                         if (fd_lock >= 0 && flock(fd_lock, LOCK_SH|LOCK_NB) < 0) {
 302                                                 log_debug("Unable to flock(%s), skipping event handling: %m", udev_device_get_de     vnode(d));
 303                                                 err = -EWOULDBLOCK;
 304                                                 goto skip;
 305                                         }
 306                                 }
 307                         }
 308

On reaching line 293 in worker_new(), for all incoming /dev/sda{,X} devices an attempt to flock()  device /dev/sda is made (lines 296-7 => the lock is always attempted on /dev/sda in these cases).

What is happenning is that an fsck process (see attachment) has already placed a WRITE FLOCK on  /dev/sda, and so udev processing is skipped. When the problem occurs, the result is that no udev processing is done for /dev/sda or any /dev/sdaX partitions.

I've attached a log of events with occur when the flock()'s fail (generated by simply adding some debug code between lines 302 and 303 (I simply print /proc/locks and from that get and print the pid/command(s) which hold FLOCKs). From this log it appears that fsck is holding a WRITE FLOCK before the first entry into worker_new(). Apparently, this WRITE FLOCK is held long enough so that
udevd skips all /dev/sda{.X} processing (and I speculate that fsck is then unable to complete its work as well). Looks like a file lock deadlock between fsck and udevd ?

As noted above, the problem doesn't always occur; it occurs much more frequently on a simple sparsely populated system (9/10 times), than on a full desktop system (1/10 times).

Also, if I add 
     systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M
to the grub boot commandline, it rarely occurs, and even the addition of debugging code dramatically alters the frequency of occurrence.

In passing also note that it appears that the file close of fd_lock is not done if the 'goto skip' path is taken in the udevd.c code above.

As a temporary work-around, I'm now using sytemd-213 without issue, with lines 285-308 (plus several other associated lines) removed (thes lines are, of course, absent in sytemd-212).

I do not know enough about systemd and the kernel to move much further on this, so any help would be greatly appreciated.

thanks much,
John

Comment 1 Kay Sievers 2014-06-03 09:15:06 UTC

(In reply to comment #0)
> udevd skips all /dev/sda{.X} processing (and I speculate that fsck is then
> unable to complete its work as well). Looks like a file lock deadlock
> between fsck and udevd ?

Udev's locking scheme is maybe incompatible with the fsck -l feature
that systemd-fsck requests. (systemd-)sck should stay away from
write-locking the entire disk device it has no business with; we need to
sort that out.

> In passing also note that it appears that the file close of fd_lock is not
> done if the 'goto skip' path is taken in the udevd.c code above.

Fixed.

Thanks,
Kay

Comment 2 jpsinthemix 2014-06-04 08:07:56 UTC

(In reply to comment #1)
> (In reply to comment #0)
> > udevd skips all /dev/sda{.X} processing (and I speculate that fsck is then
> > unable to complete its work as well). Looks like a file lock deadlock
> > between fsck and udevd ?
> 
> Udev's locking scheme is maybe incompatible with the fsck -l feature
> that systemd-fsck requests. (systemd-)sck should stay away from
> write-locking the entire disk device it has no business with; we need to
> sort that out.
> 
> > In passing also note that it appears that the file close of fd_lock is not
> > done if the 'goto skip' path is taken in the udevd.c code above.
> 
> Fixed.
> 
> Thanks,
> Kay


Unfortunately, the patch 

From e918a1b5a94f270186dca59156354acd2a596494 Mon Sep 17 00:00:00 2001
From: Kay Sievers <kay@vrfy.org>
Date: Tue, 03 Jun 2014 14:49:38 +0000
Subject: udev: exclude device-mapper from block device ownership event locking
 
will have no effect for systems which do not use lvm.. which includes mine.


Some thoughts/more info:

I know that simple/default units YYY, with After=XXX, start after XXX has started. Does this hold if XXX is a oneshot as well? I thought not, ie, I thought that for XXX oneshots, YYY would start only after XXX completes/exits, but now I'm not sure.

From additional debugging, for boots where the fscks fail it looks like the systemd-fsck-root.service starts, invoking fsck -l, which locks the whole disk, and then systemd-fsck@.service is activated for the various non-root partitions before systemd-fsck-root.service has completed, whereas, for boots where all fscks succeed, systemd-fsck-root.service appears to have completed before the various systemd-fsck@.services are started.

This is nasty, but to further test this idea, I inserted a crude wait loop in worker_run(), like so:

int got_lock = 0;
if (fd_lock >= 0) {
   int cc = 4;
   while (cc-- > 0 && (got_lock=flock(fd_lock, LOCK_SH|LOCK_NB)) < 0) {
      sleep(1);
      log_error("trying flock(%s; count: %d)", udev_device_get_devnode((struct udev_device *)dev), cc);
   }
}

if (fd_lock >= 0 && got_lock < 0) {
   log_debug("Unable to flock(%s), skipping event handling: %m",
      udev_device_get_devnode((struct udev_device *)dev));
   err = -EWOULDBLOCK;
   fd_lock = safe_close(fd_lock);
   goto skip;
}

and sure enough boots are fine everytime, with every 1-2/3 entering the wait loop and counting down to 3 or 2 before the lock is gotten. Note that the wait loop is entered only for udev_device_get_devtype((struct udev_device *)dev))  ="sda".

By the way, if in worker_run() you place the lock on the udev_device itself (not alyways on the parent if its a partition), then I have yet to see the collision, that is, fsck has a lock on /dev/sda only, not on any of the partitions, so the worker_run() flock always succeeds on partitions. 

Unfortunately, debugging like this is frustrating because this is one of those annoying Heisenbugs ..


In udevd.c, when walking the event queue, are events ever re-tried, and if not does it make sense to do so, eg, for flock() -EWOULDBLOCK cases ?

Comment 3 jpsinthemix 2014-06-04 08:26:10 UTC

Correction:

The "counting down to 3 or 2 before the lock is gotten" is incorrect; the counts were: one 3 and 19 4's in 20 re-boots.

Comment 4 Kay Sievers 2014-06-04 08:37:40 UTC

(In reply to comment #2)
 Date: Tue, 03 Jun 2014 14:49:38 +0000
> Subject: udev: exclude device-mapper from block device ownership event
> locking
>  
> will have no effect for systems which do not use lvm.. which includes mine.

It was not intended to fix this.

> From additional debugging, for boots where the fscks fail it looks like the
> systemd-fsck-root.service starts, invoking fsck -l, which locks the whole
> disk, and then systemd-fsck@.service is activated for the various non-root
> partitions before systemd-fsck-root.service has completed, whereas, for
> boots where all fscks succeed, systemd-fsck-root.service appears to have
> completed before the various systemd-fsck@.services are started.

fsck from util-linux will be changed to not take write locks on the disk
device.

We need to disable udev's locking logic, or stop using '-l' until fsck is
changed.

Comment 5 Karel Zak 2014-06-04 08:55:02 UTC

I'm going to improve fsck -l to use a private file (/run/fsck/<disk>.lock) rather than the disk device. The change will be available in util-linux v2.25 later this month (and I guess the patch will be easy to backport).

Comment 6 Kay Sievers 2014-06-04 09:17:15 UTC

http://cgit.freedesktop.org/systemd/systemd/commit/?id=c343be283b7152554bac0c02493a4e1759c163f7

Comment 7 jpsinthemix 2014-06-04 09:35:12 UTC

Just a thought: If util-linux's fsck doesn't need to lock the whole disk (or a partition for that matter) then, of  course, it shouldn't; likewise, for udev. It definitely makes sense to me that udev should lock devices when processing (at least some) rules related to them.  However, simply modifying fsck really won't fix the fundamental problem here, which is that udev should be able work around an existing lock (within reason of course) to carry out (rather than skip) rule processing, perhaps by waiting, or postponing and trying later.. I'm looking at udevd.c to see if there's a reasonable way of doing something along these lines, but it kinda rough going..

Comment 8 Kay Sievers 2014-06-04 18:59:53 UTC

The fsck issue is hopefully fixed for the next release.

Comment 9 jpsinthemix 2014-06-06 06:08:34 UTC

Created attachment 100504 [details]
handle preexisting flocks in udevd.c

I've been using the attached patch to accommodate preexisting flocks, and all looks well on several laptop/desktops I've installed it on; perhaps it may be of some use..

thanks again for your time and work

Comment 10 Karel Zak 2014-06-09 10:06:25 UTC

fsck fixed by util-linux commit 
https://github.com/karelzak/util-linux/commit/3bbdae633f4a1dda5f95ee6c61f18a1c8ef12250

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.