rhosp-director: restarting network service on undercloud causes the vip IP to disappear. Environment: openstack-tripleo-heat-templates-7.0.1-0.20170925173114.el7ost.1.noarch puppet-keepalived-0.0.2-0.20170823225813.bbca37a.el7ost.noarch keepalived-1.3.5-1.el7.x86_64 instack-undercloud-7.4.1-0.20170925172804.el7ost.noarch Steps to reproduce: (undercloud) [stack@undercloud-0 ~]$ sudo ip a show dev br-ctlplane 7: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:54:00:09:41:3d brd ff:ff:ff:ff:ff:ff inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane valid_lft forever preferred_lft forever inet 192.168.24.3/32 scope global br-ctlplane valid_lft forever preferred_lft forever inet 192.168.24.2/32 scope global br-ctlplane valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe09:413d/64 scope link valid_lft forever preferred_lft forever (undercloud) [stack@undercloud-0 ~]$ sudo service network restart Restarting network (via systemctl): [ OK ] (undercloud) [stack@undercloud-0 ~]$ sudo ip a show dev br-ctlplane 10: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:54:00:09:41:3d brd ff:ff:ff:ff:ff:ff inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe09:413d/64 scope link valid_lft forever preferred_lft forever Note: restarting keepalived restores the IP. (undercloud) [stack@undercloud-0 ~]$ sudo systemctl restart keepalived (undercloud) [stack@undercloud-0 ~]$ sudo ip a show dev br-ctlplane 10: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:54:00:09:41:3d brd ff:ff:ff:ff:ff:ff inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane valid_lft forever preferred_lft forever inet 192.168.24.3/32 scope global br-ctlplane valid_lft forever preferred_lft forever inet 192.168.24.2/32 scope global br-ctlplane valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe09:413d/64 scope link valid_lft forever preferred_lft forever
Recent changes to os-net-config now add this to the OVS_EXTRA setting in ifcfg-br-ctlplane file: set bridge br-ctlplane fail_mode=standalone -- del-controller br-ctlplane These were added to address bugs that occurred due to changes in Neutron ML2/OVS. Unfortunately, this is being run when the network service is restarted. This is causing the bridge to be deleted and recreated, and keepalived loses its IP addresses. In theory, those commands only need to run on reboot after an upgrade. For a long-term fix, perhaps instead of writing these to every OVS bridge ifcfg file we should create a startup script to perform these commands that would run before keepalived starts, or even run once on reboot after upgrade.
(In reply to Dan Sneddon from comment #1) > Recent changes to os-net-config now add this to the OVS_EXTRA setting in > ifcfg-br-ctlplane file: > > set bridge br-ctlplane fail_mode=standalone -- del-controller br-ctlplane I believe it is only the "del-controller br-ctlplane" that is causing this issue. The bridge should already be set up for fail_mode standalone so that command will have no effect.
(In reply to Dan Sneddon from comment #2) > (In reply to Dan Sneddon from comment #1) > > Recent changes to os-net-config now add this to the OVS_EXTRA setting in > > ifcfg-br-ctlplane file: > > > > set bridge br-ctlplane fail_mode=standalone -- del-controller br-ctlplane > > I believe it is only the "del-controller br-ctlplane" that is causing this > issue. The bridge should already be set up for fail_mode standalone so that > command will have no effect. Jakub Libosvar responded to this: """ I don't understand how deleting the controller affects the VIP in the undercloud case. AFAIK calling ifdown on ovs bridge deletes the bridge from ovsdb and then VIP is not added back. Does the VIP disappearing happen even without the del-controller command? """ Can you please try removing the "-- del-controller br-ctlplane" from the ifcfg-br-ctlplane file on the Undercloud and try restarting networking? If the VIP still disappears, then we know that this is not a regression and this issue has always been present. There isn't usually a reason to restart networking or ifdown/ifup the br-ctlplane interface once the undercloud is up and running.
Removed, Here's what the ifcfg file looks like now: [root@undercloud-0 ~]# cat /etc/sysconfig/network-scripts/ifcfg-br-ctlplane # This file is autogenerated by os-net-config DEVICE=br-ctlplane ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSBridge MTU=1500 BOOTPROTO=static IPADDR=192.168.24.1 NETMASK=255.255.255.0 OVS_EXTRA="set bridge br-ctlplane other-config:hwaddr=52:54:00:2d:a9:2b -- br-set-external-id br-ctlplane bridge-id br-ctlplane -- set bridge br-ctlplane fail_mode=standalone" Restarted the network with "service network restart". The problem is still there: [root@undercloud-0 ~]# ip a show dev br-ctlplane 12: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:54:00:2d:a9:2b brd ff:ff:ff:ff:ff:ff inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe2d:a92b/64 scope link valid_lft forever preferred_lft forever
Since this isn't a regression, as shown in comment 4, we're closing this as its not expected that the network service would be restarted on the undercloud.
(In reply to Bob Fournier from comment #5) > Since this isn't a regression, as shown in comment 4, we're closing this as > its not expected that the network service would be restarted on the > undercloud. Hi Bob, Can you please explain why it is not expected that the network will be restarted? The undercloud machine is still under the customer control. Adding different network configuration is an utmost valid use case and expecting not to restart the network is not a proper solution to the problem. I have encountered a similar issue with ironic services and I believe this should be re-opened.
Sure its reasonable that this should be fixed. We've come across a more general case though of the same problem - https://bugzilla.redhat.com/show_bug.cgi?id=1626357. As that has a proposed solution I'm going to mark this as a duplicate of 1626357, unless there are objections.
(In reply to Bob Fournier from comment #8) > Sure its reasonable that this should be fixed. We've come across a more > general case though of the same problem - > https://bugzilla.redhat.com/show_bug.cgi?id=1626357. As that has a proposed > solution I'm going to mark this as a duplicate of 1626357, unless there are > objections. Fully agreed, it is the same issue and Harald also confirms that newer keepalived makes things better. Thanks for the feedback, Bob. *** This bug has been marked as a duplicate of bug 1626357 ***
I am removing the duplicate flag and setting this to CLOSED - NEXTRELEASE. We should have keepalived v2.0.6 or later in the next release, the new version of keepalived should fix the issue described in this bug.