1285079 – orphaned keepalived processes remain in old neutron netns

Bug 1285079 - orphaned keepalived processes remain in old neutron netns

Summary: orphaned keepalived processes remain in old neutron netns

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	y2
Target Release:	7.0 (Kilo)
Assignee:	Jiri Stransky
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-24 20:38 UTC by James Slagle
Modified:	2015-12-21 16:52 UTC (History)
CC List:	5 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-0.8.6-84.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, orphaned OpenStack Networking L3 agent keepalived processes were left running by OpenStack Networking's netns-cleanup script. As a result, the OpenStack Networking tenant router failover did not work during the controller node update in the overcloud. With this update, keepalived processes are cleaned up properly during the controller node update. As a result, OpenStack Networking tenant router failover works normally and the high availability of the tenant network is preserved.
Clone Of:
Environment:
Last Closed:	2015-12-21 16:52:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	249171	0	None	MERGED	Update: clean keepalived and radvd instances after pcs cluster stop	2021-01-04 15:47:47 UTC
Red Hat Product Errata	RHSA-2015:2650	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update	2015-12-21 21:44:54 UTC

Description James Slagle 2015-11-24 20:38:22 UTC

We fixed an issue with netns-cleanup in neutron a few versions ago, it didn't
cleanup keepalived processes spawned by neutron, and we started doing it
right.

This didn't work with upgrading a node like:

pcs cluster stop

update packages

<update other stuff..>

pcs cluster start

Because during "pcs cluster stop" neutron-netns-cleanup stop is triggered
leaving some orphaned keepalived services, when the cluster stars again
neutron-netns-cleanup start does nothing but cleaning empty namespaces (not
--forced), this means the old netns orphaned keepalived servers remain,
neutron-l3-agent recognizes them, so those are not restarted in the new
namespace...

When you have upgraded 2 nodes, and go for the last one, there are no extra
(functional) keepalive servers running to take over the floating ips, and that
translates into the ip outage.


As a workaround, we can run:

 kill $(ps ax | grep -e "keepalived.*\.pid-vrrp" | awk '{print $1}')
2>/dev/null || :
  kill $(ps ax | grep -e "radvd.*\.pid\.radvd" | awk '{print $1}') 2>/dev/null
|| :                                                                                                                                                                                                                             

after pcs cluster stop, to make sure the cleanups introduced in the upgraded
version are effective.

Comment 1 James Slagle 2015-11-24 20:41:01 UTC

Related to the fix for https://review.gerrithub.io/#/c/248931 and https://bugzilla.redhat.com/show_bug.cgi?id=1175251 which was done in a way so that the fix was not applied on updates, and left the orphaned keepalived processes around, causing l3 agent to never failover.

Comment 2 James Slagle 2015-11-24 20:41:45 UTC

the impact here is that during an update, you lose floating ip connectivity while the current active l3 agent node is updated.

Comment 4 Ofer Blaut 2015-12-03 12:19:42 UTC

Tested

openstack-tripleo-heat-templates-0.8.6-85.el7ost.noarch

root@overcloud-controller-0 ~]# ps aux | grep keep
neutron   4675  0.0  0.5 336080 42408 ?        S    Dec02   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992
root      4689  0.0  0.0 111640  1324 ?        Ss   Dec02   0:02 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root      4690  0.0  0.0 113760  2264 ?        S    Dec02   0:25 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root     29515  0.0  0.0 112648   920 pts/0    S+   06:52   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# pcs cluster stop
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
[root@overcloud-controller-0 ~]# ps aux | grep keep
root     32348  0.0  0.0 112644   920 pts/0    S+   06:53   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# ps status
error: TTY could not be found

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
[root@overcloud-controller-0 ~]# pcs status
Error: cluster is not currently running on this node
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# pcs cluster start
Starting Cluster...
[root@overcloud-controller-0 ~]# ip netns
[root@overcloud-controller-0 ~]# ip netns
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
Broadcast message from systemd-journald (Thu 2015-12-03 06:55:01 EST):

haproxy[1214]: proxy manila has no server available!

[root@overcloud-controller-0 ~]# ps aux | grep keep
neutron   4960  0.0  0.5 335736 42060 ?        S    06:55   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992
root      5302  0.0  0.0 111640  1332 ?        Ss   06:55   0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root      5304  0.0  0.0 113760  2128 ?        S    06:55   0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root     15704  0.0  0.0 112648   924 pts/0    S+   07:18   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]#

Comment 6 errata-xmlrpc 2015-12-21 16:52:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650

Note You need to log in before you can comment on or make changes to this bug.