Bug 1168388

Summary: veth device goes down when ipv4 dhcp lease expires
Product: Red Hat Enterprise Linux 7 Reporter: Vladimir Benes <vbenes>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.1CC: bgalvani, danw, dcbw, jklimes, lrintel, thaller, tpelka, vbenes
Target Milestone: rc   
Target Release: 7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Activation of connections with static addresses no longer fails when DHCP server does not respond.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-19 10:58:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
[PATCH] device: don't disconnect after DHCP failure when there are static IPs none

Description Vladimir Benes 2014-11-26 18:58:45 UTC
Description of problem:
I tried to verify bug 1139326 so I created two veth pairs with one bridge connecting them together. I started dnsmasq dhcp server on one pair and and asked for address on the other pair. 

This was successful. So I've set lifetime to 120 seconds and set manual address as well

my device then looked like this.
21: test2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1800 qdisc pfifo_fast state UP qlen 1000
    link/ether 02:d2:77:ae:1d:51 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.20/24 brd 192.168.100.255 scope global dynamic test2
       valid_lft 69sec preferred_lft 69sec
    inet 192.168.100.9/32 brd 192.168.100.9 scope global test2
       valid_lft forever preferred_lft forever
    inet6 fe80::d2:77ff:feae:1d51/64 scope link 
       valid_lft forever preferred_lft forever

but when ipv4 lifetime expired whole veth device went down even if there was a static IP set. I see this message:

[root@qe-dell-ovs5-vm-57 NetworkManager]# [ 4598.943591] vethbr: port 2(test2p) entered disabled state
[ 4598.949054] IPv6: ADDRCONF(NETDEV_UP): test2: link is not ready
ip a s test2
21: test2: <BROADCAST,MULTICAST> mtu 1800 qdisc pfifo_fast state DOWN qlen 1000
    link/ether 02:d2:77:ae:1d:51 brd ff:ff:ff:ff:ff:ff


Version-Release number of selected component (if applicable):
NetworkManager-0.9.11.0-11059.fd6da20f0a.el7.x86_64

How reproducible:
always

Steps to Reproduce:

ip link add test1 type veth peer name test1p
ip link add test2 type veth peer name test2p
brctl addbr vethbr
brctl addif vethbr test1p test2p
ip link set dev test1 up
ip link set dev test1p up
ip link set dev test2 up
ip link set dev test2p up
nmcli connection add type ethernet con-name tc1 ifname test1 ip4 192.168.100.1/24
nmcli connection add type ethernet con-name tc2 ifname test2
nmcli con up id tc2
/usr/sbin/dnsmasq --conf-file --no-hosts --keep-in-foreground --bind-interfaces --except-interface=lo --clear-on-reload --strict-order --listen-address=192.168.100.1 --dhcp-range=192.168.100.10,192.168.100.254,60m --dhcp-option=option:router,192.168.100.1 --dhcp-lease-max=50 --dhcp-option-force=26,1800 &
nmcli con up id tc2


Actual results:
after lease is over, veth device is downed

Expected results:
veth device preserves static ip

Additional info:

Comment 1 Dan Williams 2014-12-03 18:59:04 UTC
At this point, I think that's expected.  If DHCP fails on a configured interface, NetworkManager will fail that interface even if there was a static IP assigned in addition to the DHCP.

I think we want to change that behavior (to not down the interface, and to also periodically retry DHCP), but that would be an enhancement.

One thought: if you set "ipv4.may-fail=no", does that work around the problem?

Comment 2 Vladimir Benes 2014-12-05 15:12:51 UTC
actually this worked in older 0.9.9.1-29 as instead of going down the device was overtaken by "Wired Connection X" so after these 5 minutes it went up again.

now I can see:
Dec  5 10:05:25 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): Activation: Stage 4 of 5 (IPv6 Configure Timeout) scheduled...
Dec  5 10:05:25 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): Activation: Stage 4 of 5 (IPv6 Configure Timeout) started...
Dec  5 10:05:25 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): Activation: Stage 4 of 5 (IPv6 Configure Timeout) complete.
Dec  5 10:05:25 qe-dell-ovs5-vm-45 dhclient[15667]: DHCPREQUEST on test2 to 192.168.100.1 port 67 (xid=0x17ae6a2a)
Dec  5 10:05:33 qe-dell-ovs5-vm-45 dhclient[15667]: DHCPREQUEST on test2 to 192.168.100.1 port 67 (xid=0x17ae6a2a)
Dec  5 10:05:47 qe-dell-ovs5-vm-45 dhclient[15667]: DHCPREQUEST on test2 to 255.255.255.255 port 67 (xid=0x17ae6a2a)
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): DHCPv4 state changed bound -> fail
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): canceled DHCP transaction, DHCP client pid 15667
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): DHCPv4 state changed fail -> done
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): device state change: activated -> failed (reason 'ip-config-expired') [100 120 6]
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <warn>  (test2): Activation: failed for connection 'tc2'
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): device state change: failed -> disconnected (reason 'none') [120 30 0]
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): deactivating device (reason 'none') [0]
Dec  5 10:05:54 qe-dell-ovs5-vm-45 NetworkManager[13884]: <info>  (test2): device state change: disconnected -> unmanaged (reason 'none') [30 10 0]

this in logs and device going to unmanaged.

This seems to be a regression from RHEL7.0 behavior. May-fail helped in older versions but is not helping here either.

Comment 3 Vladimir Benes 2014-12-05 15:19:34 UTC
In addition to steps from comment #0
ip link add test1 type veth peer name test1p
ip link add test2 type veth peer name test2p
brctl addbr vethbr
brctl addif vethbr test1p test2p
ip link set dev test1 up
ip link set dev test1p up
ip link set dev test2 up
ip link set dev test2p up
nmcli connection add type ethernet con-name tc1 ifname test1 ip4 192.168.100.1/24
nmcli connection add type ethernet con-name tc2 ifname test2
service dhcpd start (config from https://bugzilla.redhat.com/show_bug.cgi?id=1139326#c0)
nmcli con up id tc2
service dhcpd stop

when lease is over wait some more time (120 s) to let NM to finish it's two tries 

service dhcpd start

and after ~5 minutes tc2 should be upped again with gw and ip all set.

This works in 0.9.9.1-29 but doesn't in 0.9.11.0-6

Comment 4 Vladimir Benes 2014-12-05 15:28:40 UTC
and I can see this after lease is over in older version:
Dec  5 10:26:24 qe-dell-ovs5-vm-45 dhclient[16547]: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 7 (xid=0x654d4b08)
Dec  5 10:26:24 qe-dell-ovs5-vm-45 NetworkManager: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 7 (xid=0x654d4b08)
Dec  5 10:26:31 qe-dell-ovs5-vm-45 dhclient[16547]: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 12 (xid=0x654d4b08)
Dec  5 10:26:31 qe-dell-ovs5-vm-45 NetworkManager: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 12 (xid=0x654d4b08)
Dec  5 10:26:43 qe-dell-ovs5-vm-45 dhclient[16547]: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 15 (xid=0x654d4b08)
Dec  5 10:26:43 qe-dell-ovs5-vm-45 NetworkManager: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 15 (xid=0x654d4b08)
Dec  5 10:26:58 qe-dell-ovs5-vm-45 dhclient[16547]: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 16 (xid=0x654d4b08)
Dec  5 10:26:58 qe-dell-ovs5-vm-45 NetworkManager: DHCPDISCOVER on test2 to 255.255.255.255 port 67 interval 16 (xid=0x654d4b08)
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <warn> (test2): DHCPv4 request timed out.
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> (test2): canceled DHCP transaction, DHCP client pid 16547
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> Activation (test2) Stage 4 of 5 (IPv4 Configure Timeout) scheduled...
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> Activation (test2) Stage 4 of 5 (IPv4 Configure Timeout) started...
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> (test2): device state change: ip-config -> failed (reason 'ip-config-unavailable') [70 120 5]
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> Disabling autoconnect for connection 'tc2'.
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <warn> Activation (test2) failed for connection 'tc2'
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> Activation (test2) Stage 4 of 5 (IPv4 Configure Timeout) complete.
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> (test2): device state change: failed -> disconnected (reason 'none') [120 30 0]
Dec  5 10:27:04 qe-dell-ovs5-vm-45 NetworkManager[15898]: <info> (test2): deactivating device (reason 'none') [0]
Dec  5 10:27:04 qe-dell-ovs5-vm-45 avahi-daemon[542]: Withdrawing address record for fe80::28d0:5eff:fee1:f752 on test2.

Comment 7 Beniamino Galvani 2015-07-18 08:53:13 UTC
Created attachment 1053329 [details]
[PATCH] device: don't disconnect after DHCP failure when there are static IPs

Don't disconnect the device when the DHCP renewal fails and there are
already configured static IP addresses on the device. Instead, keep
the device up and try DHCP again after some time.

This should solve the issue reported in bug description. Tested for IPv4 only.

Comment 8 Dan Williams 2015-07-23 20:07:23 UTC
LGTM

Comment 9 Thomas Haller 2015-08-07 10:07:01 UTC
How about:

         && nm_ip4_config_get_num_addresses (priv->con_ip4_config) > 0) {
-         _LOGI (LOGD_DHCP4, "Scheduling DHCPv4 restart because device has IP addresses");
-         priv->dhcp4_restart_id = g_timeout_add_seconds (120, dhcp4_restart_cb, self);
+         if (!priv->dhcp4_restart_id) {
+              _LOGI (LOGD_DHCP4, "Scheduling DHCPv4 restart because device has IP addresses");
+              priv->dhcp4_restart_id = g_timeout_add_seconds (120, dhcp4_restart_cb, self);
+         }
          return;


and same for IPv6.



Also, what happens if the connection has ipvx.may-fail=yes? I think in that case we also should not tear down the connection -- but I don't see that that is happening...

Comment 10 Beniamino Galvani 2015-08-07 13:43:19 UTC
(In reply to Thomas Haller from comment #9)
> How about:
> 
>          && nm_ip4_config_get_num_addresses (priv->con_ip4_config) > 0) {
> -         _LOGI (LOGD_DHCP4, "Scheduling DHCPv4 restart because device has
> IP addresses");
> -         priv->dhcp4_restart_id = g_timeout_add_seconds (120,
> dhcp4_restart_cb, self);
> +         if (!priv->dhcp4_restart_id) {
> +              _LOGI (LOGD_DHCP4, "Scheduling DHCPv4 restart because device
> has IP addresses");
> +              priv->dhcp4_restart_id = g_timeout_add_seconds (120,
> dhcp4_restart_cb, self);
> +         }
>           return;
> 
> 
> and same for IPv6.

This isn't required as priv->dhcp4_restart_id is always cleared some lines
above in dhcpx_cleanup().

> Also, what happens if the connection has ipvx.may-fail=yes? I think in that
> case we also should not tear down the connection -- but I don't see that
> that is happening...

If there are no static addresses, dhcpx_fail() schedules
nm_device_activate_ipx_config_timeout()	which in turn calls
act_stage4_ipx_config_timeout()	to set the new device state according
to the 'may-fail' setting.

If there are static addresses configured and DHCP fails, the value of
ipvx.may-fail is not considered because at least the "static" method
succeeded.

Comment 11 Beniamino Galvani 2015-08-25 12:58:10 UTC
Upstream bug https://bugzilla.gnome.org/show_bug.cgi?id=741347
contains a rework of IP configuration failures and includes a more
general fix for this issue. Please review the branch posted there.

Comment 12 Beniamino Galvani 2015-08-26 22:15:02 UTC
Since the issue was blocking automated tests, I merged the attached
patch. The other improvements mentioned in comment 11 are not so
urgent and can be discussed separately in the upstream bug.

master:
abc96ec device: don't disconnect after DHCP failure when there are static IPs
905220b device: fix clearing of dhcp6_restart_id in dhcp6_cleanup()

nm-1-0:
80b3081 device: don't disconnect after DHCP failure when there are static IPs
eb1ccf9 device: fix clearing of dhcp6_restart_id in dhcp6_cleanup()

Comment 15 Vladimir Benes 2015-09-04 11:50:05 UTC
Veth device doesn't go down and dhcp request is send out every 5 minutes. Tested on all supported architectures.

Comment 16 errata-xmlrpc 2015-11-19 10:58:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2315.html