Bug 1285079

Summary: orphaned keepalived processes remain in old neutron netns
Product: Red Hat OpenStack Reporter: James Slagle <jslagle>
Component: openstack-tripleo-heat-templatesAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Ofer Blaut <oblaut>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: bperkins, dnavale, jcoufal, mburns, rhel-osp-director-maint
Target Milestone: y2   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-84.el7ost Doc Type: Bug Fix
Doc Text:
Previously, orphaned OpenStack Networking L3 agent keepalived processes were left running by OpenStack Networking's netns-cleanup script. As a result, the OpenStack Networking tenant router failover did not work during the controller node update in the overcloud. With this update, keepalived processes are cleaned up properly during the controller node update. As a result, OpenStack Networking tenant router failover works normally and the high availability of the tenant network is preserved.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-21 16:52:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Slagle 2015-11-24 20:38:22 UTC
We fixed an issue with netns-cleanup in neutron a few versions ago, it didn't
cleanup keepalived processes spawned by neutron, and we started doing it
right.

This didn't work with upgrading a node like:

pcs cluster stop

update packages

<update other stuff..>

pcs cluster start

Because during "pcs cluster stop" neutron-netns-cleanup stop is triggered
leaving some orphaned keepalived services, when the cluster stars again
neutron-netns-cleanup start does nothing but cleaning empty namespaces (not
--forced), this means the old netns orphaned keepalived servers remain,
neutron-l3-agent recognizes them, so those are not restarted in the new
namespace...

When you have upgraded 2 nodes, and go for the last one, there are no extra
(functional) keepalive servers running to take over the floating ips, and that
translates into the ip outage.


As a workaround, we can run:

 kill $(ps ax | grep -e "keepalived.*\.pid-vrrp" | awk '{print $1}')
2>/dev/null || :
  kill $(ps ax | grep -e "radvd.*\.pid\.radvd" | awk '{print $1}') 2>/dev/null
|| :                                                                                                                                                                                                                             

after pcs cluster stop, to make sure the cleanups introduced in the upgraded
version are effective.

Comment 1 James Slagle 2015-11-24 20:41:01 UTC
Related to the fix for https://review.gerrithub.io/#/c/248931 and https://bugzilla.redhat.com/show_bug.cgi?id=1175251 which was done in a way so that the fix was not applied on updates, and left the orphaned keepalived processes around, causing l3 agent to never failover.

Comment 2 James Slagle 2015-11-24 20:41:45 UTC
the impact here is that during an update, you lose floating ip connectivity while the current active l3 agent node is updated.

Comment 4 Ofer Blaut 2015-12-03 12:19:42 UTC
Tested

openstack-tripleo-heat-templates-0.8.6-85.el7ost.noarch

root@overcloud-controller-0 ~]# ps aux | grep keep
neutron   4675  0.0  0.5 336080 42408 ?        S    Dec02   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992
root      4689  0.0  0.0 111640  1324 ?        Ss   Dec02   0:02 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root      4690  0.0  0.0 113760  2264 ?        S    Dec02   0:25 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root     29515  0.0  0.0 112648   920 pts/0    S+   06:52   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# pcs cluster stop
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
[root@overcloud-controller-0 ~]# ps aux | grep keep
root     32348  0.0  0.0 112644   920 pts/0    S+   06:53   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# ps status
error: TTY could not be found

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
[root@overcloud-controller-0 ~]# pcs status
Error: cluster is not currently running on this node
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# pcs cluster start
Starting Cluster...
[root@overcloud-controller-0 ~]# ip netns
[root@overcloud-controller-0 ~]# ip netns
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
Broadcast message from systemd-journald (Thu 2015-12-03 06:55:01 EST):

haproxy[1214]: proxy manila has no server available!

[root@overcloud-controller-0 ~]# ps aux | grep keep
neutron   4960  0.0  0.5 335736 42060 ?        S    06:55   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992
root      5302  0.0  0.0 111640  1332 ?        Ss   06:55   0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root      5304  0.0  0.0 113760  2128 ?        S    06:55   0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root     15704  0.0  0.0 112648   924 pts/0    S+   07:18   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]#

Comment 6 errata-xmlrpc 2015-12-21 16:52:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650