Bug 1383448 - Neutron L3 HA - stop of neutron-keepalived-state-change leaves stale "ip -o monitor address" processes
Summary: Neutron L3 HA - stop of neutron-keepalived-state-change leaves stale "ip -o m...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 9.0 (Mitaka)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: async
: 9.0 (Mitaka)
Assignee: Daniel Alvarez Sanchez
QA Contact: GenadiC
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-10 16:02 UTC by Matt Flusche
Modified: 2019-12-16 07:03 UTC (History)
8 users (show)

Fixed In Version: openstack-neutron-8.3.0-7.el7ost
Doc Type: Bug Fix
Doc Text:
When HA routers are deleted, the L3 agent uses SIGKILL to kill the neutron-keepalived-state-change process and orphan its child process, 'ip -o monitor'. This would make memory consumption grow and, eventually, OOM killers to show up. This fix implements a way to use SIGTERM to kill keepalived-state-change process gracefully and kill ip monitor on its cleanup. This ensures that there are no ip monitors being leaked when HA routers are deleted.
Clone Of:
Environment:
Last Closed: 2017-06-19 14:47:53 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1632099 0 None None None 2016-10-10 20:39:09 UTC
OpenStack gerrit 383936 0 None ABANDONED Clean up router namespaces with L3 agent 2020-10-08 18:08:03 UTC
Red Hat Product Errata RHBA-2017:1503 0 normal SHIPPED_LIVE openstack-neutron bug fix advisory 2017-06-19 18:46:06 UTC

Description Matt Flusche 2016-10-10 16:02:01 UTC
Description of problem:
Upon exit neutron-keepalived-state-change does not terminate "ip -o monitor address" child process.

Version-Release number of selected component (if applicable):
openstack-neutron-8.1.2-4.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
There are several ways to reproduce this; here is one:
1. Deploy Overcloud with OSP Director
2. Deploy HA router
3. On Controller/l3 node observe: "ip -o monitor address" processes and parent: neutron-keepalived-state-change
  $  ps -ef |egrep 'ip -o monitor address|keepalived-state-change'
4. Place controller into standby to stop all l3 services.
  $ sudo pcs cluster standby <controller-node-name>
5. After l3 services have been stopped and netns cleanup, observe "ip -o monitor address" processes remain.
  $  ps -ef |egrep 'ip -o monitor address|keepalived-state-change'
6. Unstandby controller node and observe new keepalived processes along with additional "ip -o monitor address" processes.
  $ sudo pcs cluster unstandby <controller-node-name>
7. repeat steps 4 - 6 and observe build-up of "ip -o monitor" processes

Actual results:
build-up of "ip -o monitor" processes as neutron-keepalived-state-change is stopped.

Expected results:
Cleanup of "ip -o monitor" when neutron-keepalived-state-change is terminated.

Additional info:
When neutron-keepalived-state-change is terminated the "ip -o monitor address" ppid is changed to 1.  The following can be used to clean out old processes:

  # ps -ef |grep 'ip -o monitor' |awk '$3 == "1"{print $2}' |xargs kill

Comment 1 Assaf Muller 2016-10-10 20:28:28 UTC
While trying to find the root cause for the issue I found something else - It seems like when we delete an HA router, we disable neutron-keepalived-state-monitor with signal 9 and the ip monitor the keepalived-state-monitor process spawns is never cleaned up.

Comment 2 Assaf Muller 2016-10-10 22:31:29 UTC
As for this RHBZ, if we look at:

https://review.rdoproject.org/r/gitweb?p=openstack/neutron-distgit.git;a=blob;f=neutron-netns-cleanup.init;h=aef6a07094952af4d4ab31f1fb21ebc9635526b9;hb=refs/heads/rpm-master#l43

It looks like we use killall, which by default uses signal 15, so I'm not sure why the 'ip' child of neutron-keepalived-state-monitor is not being cleaned up.

Comment 3 Assaf Muller 2016-10-11 10:52:41 UTC
Looking at [1] again, I ran netns-cleanup --force directly without the RDO service file (Line 40), without the following killall neutron-keepalived-state-change, and executing that line alone will kill neutron-keepalived-state-change and orphan the ip monitors.

[1] https://review.rdoproject.org/r/gitweb?p=openstack/neutron-distgit.git;a=blob;f=neutron-netns-cleanup.init;h=aef6a07094952af4d4ab31f1fb21ebc9635526b9;hb=refs/heads/rpm-master#l43

Comment 4 Assaf Muller 2016-10-11 12:56:51 UTC
@Daniel, can you please take a look at this?

Comment 5 Daniel Alvarez Sanchez 2016-10-14 17:56:45 UTC
I've submitted a patch to kill all subprocesses from the process tree to have an immediate solution.

However, as the ip monitor process is launched through neutron/agent/linux/async_process.py, I was considering to extend this class to allow new processes request the kernel to deliver the SIGKILL (or some other signal) when the parent dies through the use of the prctl() syscall. What do you guys think?

Also, there was a suspect that this bug was related to this other one [1] but I can't repro the issue with the steps described

[1] https://bugs.launchpad.net/neutron/+bug/1632099

Comment 6 Daniel Alvarez Sanchez 2016-12-16 19:39:18 UTC
Sent a patch to upstream gerrit:

https://review.openstack.org/#/c/411968/

Comment 9 errata-xmlrpc 2017-06-19 14:47:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1503


Note You need to log in before you can comment on or make changes to this bug.