We fixed an issue with netns-cleanup in neutron a few versions ago, it didn't cleanup keepalived processes spawned by neutron, and we started doing it right. This didn't work with upgrading a node like: pcs cluster stop update packages <update other stuff..> pcs cluster start Because during "pcs cluster stop" neutron-netns-cleanup stop is triggered leaving some orphaned keepalived services, when the cluster stars again neutron-netns-cleanup start does nothing but cleaning empty namespaces (not --forced), this means the old netns orphaned keepalived servers remain, neutron-l3-agent recognizes them, so those are not restarted in the new namespace... When you have upgraded 2 nodes, and go for the last one, there are no extra (functional) keepalive servers running to take over the floating ips, and that translates into the ip outage. As a workaround, we can run: kill $(ps ax | grep -e "keepalived.*\.pid-vrrp" | awk '{print $1}') 2>/dev/null || : kill $(ps ax | grep -e "radvd.*\.pid\.radvd" | awk '{print $1}') 2>/dev/null || : after pcs cluster stop, to make sure the cleanups introduced in the upgraded version are effective.
Related to the fix for https://review.gerrithub.io/#/c/248931 and https://bugzilla.redhat.com/show_bug.cgi?id=1175251 which was done in a way so that the fix was not applied on updates, and left the orphaned keepalived processes around, causing l3 agent to never failover.
the impact here is that during an update, you lose floating ip connectivity while the current active l3 agent node is updated.
Tested openstack-tripleo-heat-templates-0.8.6-85.el7ost.noarch root@overcloud-controller-0 ~]# ps aux | grep keep neutron 4675 0.0 0.5 336080 42408 ? S Dec02 0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992 root 4689 0.0 0.0 111640 1324 ? Ss Dec02 0:02 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp root 4690 0.0 0.0 113760 2264 ? S Dec02 0:25 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp root 29515 0.0 0.0 112648 920 pts/0 S+ 06:52 0:00 grep --color=auto keep [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# pcs cluster stop Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@overcloud-controller-0 ~]# ps aux | grep keep root 32348 0.0 0.0 112644 920 pts/0 S+ 06:53 0:00 grep --color=auto keep [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# ps status error: TTY could not be found Usage: ps [options] Try 'ps --help <simple|list|output|threads|misc|all>' or 'ps --help <s|l|o|t|m|a>' for additional help text. For more details see ps(1). [root@overcloud-controller-0 ~]# pcs status Error: cluster is not currently running on this node [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# pcs cluster start Starting Cluster... [root@overcloud-controller-0 ~]# ip netns [root@overcloud-controller-0 ~]# ip netns [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# Broadcast message from systemd-journald (Thu 2015-12-03 06:55:01 EST): haproxy[1214]: proxy manila has no server available! [root@overcloud-controller-0 ~]# ps aux | grep keep neutron 4960 0.0 0.5 335736 42060 ? S 06:55 0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992 root 5302 0.0 0.0 111640 1332 ? Ss 06:55 0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp root 5304 0.0 0.0 113760 2128 ? S 06:55 0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp root 15704 0.0 0.0 112648 924 pts/0 S+ 07:18 0:00 grep --color=auto keep [root@overcloud-controller-0 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2650