Bug 1383448

Summary: Neutron L3 HA - stop of neutron-keepalived-state-change leaves stale "ip -o monitor address" processes
Product: Red Hat OpenStack Reporter: Matt Flusche <mflusche>
Component: openstack-neutronAssignee: Daniel Alvarez Sanchez <dalvarez>
Status: CLOSED ERRATA QA Contact: GenadiC <gcheresh>
Severity: medium Docs Contact:
Priority: medium    
Version: 9.0 (Mitaka)CC: amuller, bperkins, chrisw, gkeegan, nyechiel, oblaut, samccann, srevivo
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-neutron-8.3.0-7.el7ost Doc Type: Bug Fix
Doc Text:
When HA routers are deleted, the L3 agent uses SIGKILL to kill the neutron-keepalived-state-change process and orphan its child process, 'ip -o monitor'. This would make memory consumption grow and, eventually, OOM killers to show up. This fix implements a way to use SIGTERM to kill keepalived-state-change process gracefully and kill ip monitor on its cleanup. This ensures that there are no ip monitors being leaked when HA routers are deleted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-19 14:47:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Matt Flusche 2016-10-10 16:02:01 UTC
Description of problem:
Upon exit neutron-keepalived-state-change does not terminate "ip -o monitor address" child process.

Version-Release number of selected component (if applicable):
openstack-neutron-8.1.2-4.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
There are several ways to reproduce this; here is one:
1. Deploy Overcloud with OSP Director
2. Deploy HA router
3. On Controller/l3 node observe: "ip -o monitor address" processes and parent: neutron-keepalived-state-change
  $  ps -ef |egrep 'ip -o monitor address|keepalived-state-change'
4. Place controller into standby to stop all l3 services.
  $ sudo pcs cluster standby <controller-node-name>
5. After l3 services have been stopped and netns cleanup, observe "ip -o monitor address" processes remain.
  $  ps -ef |egrep 'ip -o monitor address|keepalived-state-change'
6. Unstandby controller node and observe new keepalived processes along with additional "ip -o monitor address" processes.
  $ sudo pcs cluster unstandby <controller-node-name>
7. repeat steps 4 - 6 and observe build-up of "ip -o monitor" processes

Actual results:
build-up of "ip -o monitor" processes as neutron-keepalived-state-change is stopped.

Expected results:
Cleanup of "ip -o monitor" when neutron-keepalived-state-change is terminated.

Additional info:
When neutron-keepalived-state-change is terminated the "ip -o monitor address" ppid is changed to 1.  The following can be used to clean out old processes:

  # ps -ef |grep 'ip -o monitor' |awk '$3 == "1"{print $2}' |xargs kill

Comment 1 Assaf Muller 2016-10-10 20:28:28 UTC
While trying to find the root cause for the issue I found something else - It seems like when we delete an HA router, we disable neutron-keepalived-state-monitor with signal 9 and the ip monitor the keepalived-state-monitor process spawns is never cleaned up.

Comment 2 Assaf Muller 2016-10-10 22:31:29 UTC
As for this RHBZ, if we look at:

https://review.rdoproject.org/r/gitweb?p=openstack/neutron-distgit.git;a=blob;f=neutron-netns-cleanup.init;h=aef6a07094952af4d4ab31f1fb21ebc9635526b9;hb=refs/heads/rpm-master#l43

It looks like we use killall, which by default uses signal 15, so I'm not sure why the 'ip' child of neutron-keepalived-state-monitor is not being cleaned up.

Comment 3 Assaf Muller 2016-10-11 10:52:41 UTC
Looking at [1] again, I ran netns-cleanup --force directly without the RDO service file (Line 40), without the following killall neutron-keepalived-state-change, and executing that line alone will kill neutron-keepalived-state-change and orphan the ip monitors.

[1] https://review.rdoproject.org/r/gitweb?p=openstack/neutron-distgit.git;a=blob;f=neutron-netns-cleanup.init;h=aef6a07094952af4d4ab31f1fb21ebc9635526b9;hb=refs/heads/rpm-master#l43

Comment 4 Assaf Muller 2016-10-11 12:56:51 UTC
@Daniel, can you please take a look at this?

Comment 5 Daniel Alvarez Sanchez 2016-10-14 17:56:45 UTC
I've submitted a patch to kill all subprocesses from the process tree to have an immediate solution.

However, as the ip monitor process is launched through neutron/agent/linux/async_process.py, I was considering to extend this class to allow new processes request the kernel to deliver the SIGKILL (or some other signal) when the parent dies through the use of the prctl() syscall. What do you guys think?

Also, there was a suspect that this bug was related to this other one [1] but I can't repro the issue with the steps described

[1] https://bugs.launchpad.net/neutron/+bug/1632099

Comment 6 Daniel Alvarez Sanchez 2016-12-16 19:39:18 UTC
Sent a patch to upstream gerrit:

https://review.openstack.org/#/c/411968/

Comment 9 errata-xmlrpc 2017-06-19 14:47:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1503