Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1285079 - orphaned keepalived processes remain in old neutron netns
orphaned keepalived processes remain in old neutron netns
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity unspecified
: y2
: 7.0 (Kilo)
Assigned To: Jiri Stransky
Ofer Blaut
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-24 15:38 EST by James Slagle
Modified: 2015-12-21 11:52 EST (History)
5 users (show)

See Also:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-84.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, orphaned OpenStack Networking L3 agent keepalived processes were left running by OpenStack Networking's netns-cleanup script. As a result, the OpenStack Networking tenant router failover did not work during the controller node update in the overcloud. With this update, keepalived processes are cleaned up properly during the controller node update. As a result, OpenStack Networking tenant router failover works normally and the high availability of the tenant network is preserved.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-21 11:52:55 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 249171 None None None Never
Red Hat Product Errata RHSA-2015:2650 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update 2015-12-21 16:44:54 EST

  None (edit)
Description James Slagle 2015-11-24 15:38:22 EST
We fixed an issue with netns-cleanup in neutron a few versions ago, it didn't
cleanup keepalived processes spawned by neutron, and we started doing it
right.

This didn't work with upgrading a node like:

pcs cluster stop

update packages

<update other stuff..>

pcs cluster start

Because during "pcs cluster stop" neutron-netns-cleanup stop is triggered
leaving some orphaned keepalived services, when the cluster stars again
neutron-netns-cleanup start does nothing but cleaning empty namespaces (not
--forced), this means the old netns orphaned keepalived servers remain,
neutron-l3-agent recognizes them, so those are not restarted in the new
namespace...

When you have upgraded 2 nodes, and go for the last one, there are no extra
(functional) keepalive servers running to take over the floating ips, and that
translates into the ip outage.


As a workaround, we can run:

 kill $(ps ax | grep -e "keepalived.*\.pid-vrrp" | awk '{print $1}')
2>/dev/null || :
  kill $(ps ax | grep -e "radvd.*\.pid\.radvd" | awk '{print $1}') 2>/dev/null
|| :                                                                                                                                                                                                                             

after pcs cluster stop, to make sure the cleanups introduced in the upgraded
version are effective.
Comment 1 James Slagle 2015-11-24 15:41:01 EST
Related to the fix for https://review.gerrithub.io/#/c/248931 and https://bugzilla.redhat.com/show_bug.cgi?id=1175251 which was done in a way so that the fix was not applied on updates, and left the orphaned keepalived processes around, causing l3 agent to never failover.
Comment 2 James Slagle 2015-11-24 15:41:45 EST
the impact here is that during an update, you lose floating ip connectivity while the current active l3 agent node is updated.
Comment 4 Ofer Blaut 2015-12-03 07:19:42 EST
Tested

openstack-tripleo-heat-templates-0.8.6-85.el7ost.noarch

root@overcloud-controller-0 ~]# ps aux | grep keep
neutron   4675  0.0  0.5 336080 42408 ?        S    Dec02   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992
root      4689  0.0  0.0 111640  1324 ?        Ss   Dec02   0:02 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root      4690  0.0  0.0 113760  2264 ?        S    Dec02   0:25 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root     29515  0.0  0.0 112648   920 pts/0    S+   06:52   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# pcs cluster stop
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
[root@overcloud-controller-0 ~]# ps aux | grep keep
root     32348  0.0  0.0 112644   920 pts/0    S+   06:53   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# ps status
error: TTY could not be found

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
[root@overcloud-controller-0 ~]# pcs status
Error: cluster is not currently running on this node
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# pcs cluster start
Starting Cluster...
[root@overcloud-controller-0 ~]# ip netns
[root@overcloud-controller-0 ~]# ip netns
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
Broadcast message from systemd-journald@overcloud-controller-0.localdomain (Thu 2015-12-03 06:55:01 EST):

haproxy[1214]: proxy manila has no server available!

[root@overcloud-controller-0 ~]# ps aux | grep keep
neutron   4960  0.0  0.5 335736 42060 ?        S    06:55   0:00 /usr/bin/python2 /bin/neutron-keepalived-state-change --router_id=936289f5-87b7-40e3-a8a4-caf23c241c40 --namespace=qrouter-936289f5-87b7-40e3-a8a4-caf23c241c40 --conf_dir=/var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40 --monitor_interface=ha-75e3e2ef-82 --monitor_cidr=169.254.0.1/24 --pid_file=/var/lib/neutron/external/pids/936289f5-87b7-40e3-a8a4-caf23c241c40.monitor.pid --state_path=/var/lib/neutron --user=995 --group=992
root      5302  0.0  0.0 111640  1332 ?        Ss   06:55   0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root      5304  0.0  0.0 113760  2128 ?        S    06:55   0:00 keepalived -P -f /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40/keepalived.conf -p /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid -r /var/lib/neutron/ha_confs/936289f5-87b7-40e3-a8a4-caf23c241c40.pid-vrrp
root     15704  0.0  0.0 112648   924 pts/0    S+   07:18   0:00 grep --color=auto keep
[root@overcloud-controller-0 ~]#
Comment 6 errata-xmlrpc 2015-12-21 11:52:55 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650

Note You need to log in before you can comment on or make changes to this bug.