Bug 82731

Summary: systemd-networkd does not renew DHCP lease when it expires
Product: systemd Reporter: gt6
Component: generalAssignee: Tom Gundersen <teg>
Status: RESOLVED WORKSFORME QA Contact: systemd-bugs
Severity: major    
Priority: medium CC: dirocco.nico, gt6
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: journal
tcpdump
tcpdump-br0

Description gt6 2014-08-17 14:11:10 UTC
I've sent an email with this problem to the systemd-devel mailing list weeks ago (http://lists.freedesktop.org/archives/systemd-devel/2014-July/021602.html) and never received any feedback, so I'm just going to assume that this is indeed a bug. I'll reproduce what I've said in that email.

I run an Archlinux home server connected to a netgear home router. I use 
a bridge because among other things, the server does KVM virtualization. 
My problem is that once the DHCP lease on br0 expires, it is not renewed.

Here's `ip a` before and after br0 loses the ip. You can see the 
"valid_lft 42sec" just before the IP is lost. The lease appears to be 
valid for 24 hours.

-------------before---------------
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
group default
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast 
master br0 state UP group default qlen 1000
     link/ether bc:5f:f4:90:23:85 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::be5f:f4ff:fe90:2385/64 scope link
        valid_lft forever preferred_lft forever
3: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state 
UP group default
     link/ether b6:29:d8:20:20:11 brd ff:ff:ff:ff:ff:ff
     inet 192.168.1.33/24 brd 192.168.1.255 scope global dynamic br0
        valid_lft 42sec preferred_lft 42sec
     inet6 fe80::b429:d8ff:fe20:2011/64 scope link
        valid_lft forever preferred_lft forever
7: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast 
master br0 state UNKNOWN group default qlen 500
     link/ether fe:54:00:55:09:68 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::fc54:ff:fe55:968/64 scope link
        valid_lft forever preferred_lft forever
-------------after---------------
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
group default
     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
     inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
     inet6 ::1/128 scope host
        valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast 
master br0 state UP group default qlen 1000
     link/ether bc:5f:f4:90:23:85 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::be5f:f4ff:fe90:2385/64 scope link
        valid_lft forever preferred_lft forever
3: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state 
UP group default
     link/ether b6:29:d8:20:20:11 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::b429:d8ff:fe20:2011/64 scope link
        valid_lft forever preferred_lft forever
7: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast 
master br0 state UNKNOWN group default qlen 500
     link/ether fe:54:00:55:09:68 brd ff:ff:ff:ff:ff:ff
     inet6 fe80::fc54:ff:fe55:968/64 scope link
        valid_lft forever preferred_lft forever
----------------------------------

Below are the relevant configs.

/etc/systemd/network/bridge-data.netdev:
----------------------------------
[NetDev]
Name=br0
Kind=bridge
----------------------------------

/etc/systemd/network/bridge-data.network:
----------------------------------
[Match]
Name=br0

[Network]
DHCP=yes
----------------------------------

/etc/systemd/network/enterprise.network:
----------------------------------
[Match]
Name=en*

[Network]
Bridge=br0
----------------------------------

I've enabled DEBUG logging for systemd-networkd, but it's not giving me 
anything useful. Here's what I see at the moment the IP is lost (at 
17:44:49) with some context:

----------------------------------
Jul 23 17:39:02 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Next 
retransmission in 1min 48.764460s
Jul 23 17:40:51 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Sent 
SOLICIT
Jul 23 17:40:51 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Next 
retransmission in 1min 55.518924s
Jul 23 17:42:47 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Sent 
SOLICIT
Jul 23 17:42:47 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Next 
retransmission in 2min 4.769016s
Jul 23 17:44:49 enterprise systemd-networkd[25474]: br0             : 
removed address: 192.168.1.33/24
Jul 23 17:44:52 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Sent 
SOLICIT
Jul 23 17:44:52 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Next 
retransmission in 1min 51.959760s
Jul 23 17:46:44 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Sent 
SOLICIT
Jul 23 17:46:44 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Next 
retransmission in 2min 5.687352s
Jul 23 17:48:49 enterprise systemd-networkd[25474]: DHCPv6 CLIENT: Sent 
SOLICIT
----------------------------------

As you can see, it just says "br0: removed address:" and that's it. 
Every now and then (less than once a day) there will be something like this:

----------------------------------
Jul 25 10:11:25 enterprise systemd-networkd[19106]: DHCPv6 CLIENT: Sent 
SOLICIT
Jul 25 10:11:25 enterprise systemd-networkd[19106]: DHCPv6 CLIENT: Next 
retransmission in 1min 57.978432s
Jul 25 10:12:31 enterprise systemd-networkd[19106]: DHCP CLIENT 
(0x461b140f): REQUEST (rebinding)
Jul 25 10:12:31 enterprise systemd-networkd[19106]: DHCP CLIENT 
(0x461b140f): NAK
Jul 25 10:12:31 enterprise systemd-networkd[19106]: DHCP CLIENT 
(0x461b140f): STOPPED: No lease
Jul 25 10:12:31 enterprise systemd-networkd[19106]: br0             : IP 
address in use.
Jul 25 10:13:23 enterprise systemd-networkd[19106]: DHCPv6 CLIENT: Sent 
SOLICIT
Jul 25 10:13:23 enterprise systemd-networkd[19106]: DHCPv6 CLIENT: Next 
retransmission in 2min 1.184604s
----------------------------------

but it doesn't seem to have any negative effect and it's probably 
unrelated to the problem.

I also don't see anything in dmesg. Other machines on the network are 
unaffected so I doubt it has anything to do with the router.

It's a Realtek onboard RTL8111/8168/8411 adaptor and Linux uses the 
r8169 driver automatically. I'm aware that there's a dedicated r8168 
driver by realtek, but I'm using the same mainboard (same revision) with 
Arch in another machine (without a bridge) and I never had any problems. 
Hence, I doubt that the r8169 driver is the problem but I will try using 
it on occasion. (You can imagine that debugging this problem is a pain 
because I always have to wait 24h to see if a change had any effect...)

This is systemd 215 built on Jul. 8.

For now, I've enabled a cron job that checks if the lease is about to expire within the next 1h and restarts systemd-networkd if necessary, but that's just silly.
Comment 1 Tom Gundersen 2014-08-18 17:56:40 UTC
Sorry for net getting back to you sooner. The fact that you are getting a NAK from your server is what is causing the problem. networkd is asking to renew the lease, but the DHCP server is refusing, so the lease simply expires and the kernel drops the address.

A dirty work-around is to mark this as a CriticalConnection.

To fix the problem: May I ask for a package dump aronud the time you get the NAK? That way we can verify whether or not the problem is with the request networkd sends to the server, with the server itself or with the way networkd parses the reply.

Thanks!
Comment 2 gt6 2014-08-20 15:30:30 UTC
Created attachment 104988 [details]
journal
Comment 3 gt6 2014-08-20 15:30:48 UTC
Created attachment 104989 [details]
tcpdump
Comment 4 gt6 2014-08-20 15:31:19 UTC
Thanks for your feedback. I've set up some timers inside a tmux session so I could run tcpdump a few seconds before the lease runs out and then restart systemd-networkd a few seconds later. I've attached both the journal and the tcpdump log from around that time.

As can be seen in the journal at...
 ... 17:02:01 the connection went down 
 ... 17:02:18 the timed script restarted systemd-networkd to re-establish connection

The tcpdump doesn't show anything happening at 17:02:01, but one second later there's a dhcp6 solicit. The other stuff only happens once systemd-networkd was restarted. I stripped some ssh/nfs related stuff from the tcpdump but I'm 100% sure I didn't trim anything relevant.

I'm not sure if this helps. I can post any additional info you need. Thanks again.
Comment 5 gt6 2014-08-20 15:32:54 UTC
Wait, I should have done this for br0 and not enp3s0... Let me do that again and come back later. Sorry about that.
Comment 6 Tom Gundersen 2014-08-20 16:06:00 UTC
Ok, I'll check back later with the updated version. Just a note: what we are looking for is DHCP traffic, which sholud happen around the time you get the NAK (which is probably 50% or 75% of the way to the lease expiration).
Comment 7 gt6 2014-08-20 16:15:42 UTC
I see. In that case, I'll just let tcpdump run until tomorrow to make sure I catch it. Is it sufficient if I monitor port 67 and 68, i.e.

tcpdump -i br0 -n port 67 and port 68

?
Comment 8 Tom Gundersen 2014-08-20 16:19:07 UTC
That should do it.
Comment 9 gt6 2014-09-11 13:49:15 UTC
Created attachment 106133 [details]
tcpdump-br0

Sorry for the delay. I attached a lengthy dump, hoping that it contains the information you require. I wasn't able to find anything about a NAK in it, though. The log contains close to 24h of packets of the br0 interface. At the beginning of the log (at 16:12:59) I restarted systemd-networkd.
Comment 10 Lennart Poettering 2017-10-27 17:17:14 UTC
I am pretty sure this works correctly on current systemd. CLosing hence.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.