| Summary: | Neutron L3 HA - stop of neutron-keepalived-state-change leaves stale "ip -o monitor address" processes | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Matt Flusche <mflusche> |
| Component: | openstack-neutron | Assignee: | Daniel Alvarez Sanchez <dalvarez> |
| Status: | CLOSED ERRATA | QA Contact: | GenadiC <gcheresh> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 9.0 (Mitaka) | CC: | amuller, bperkins, chrisw, gkeegan, nyechiel, oblaut, samccann, srevivo |
| Target Milestone: | async | Keywords: | Triaged, ZStream |
| Target Release: | 9.0 (Mitaka) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-neutron-8.3.0-7.el7ost | Doc Type: | Bug Fix |
| Doc Text: |
When HA routers are deleted, the L3 agent uses SIGKILL to kill the neutron-keepalived-state-change process and orphan its child process, 'ip -o monitor'. This would make memory consumption grow and, eventually, OOM killers to show up.
This fix implements a way to use SIGTERM to kill keepalived-state-change process gracefully and kill ip monitor on its cleanup. This ensures that there are no ip monitors being leaked when HA routers are deleted.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-06-19 14:47:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
While trying to find the root cause for the issue I found something else - It seems like when we delete an HA router, we disable neutron-keepalived-state-monitor with signal 9 and the ip monitor the keepalived-state-monitor process spawns is never cleaned up. As for this RHBZ, if we look at: https://review.rdoproject.org/r/gitweb?p=openstack/neutron-distgit.git;a=blob;f=neutron-netns-cleanup.init;h=aef6a07094952af4d4ab31f1fb21ebc9635526b9;hb=refs/heads/rpm-master#l43 It looks like we use killall, which by default uses signal 15, so I'm not sure why the 'ip' child of neutron-keepalived-state-monitor is not being cleaned up. Looking at [1] again, I ran netns-cleanup --force directly without the RDO service file (Line 40), without the following killall neutron-keepalived-state-change, and executing that line alone will kill neutron-keepalived-state-change and orphan the ip monitors. [1] https://review.rdoproject.org/r/gitweb?p=openstack/neutron-distgit.git;a=blob;f=neutron-netns-cleanup.init;h=aef6a07094952af4d4ab31f1fb21ebc9635526b9;hb=refs/heads/rpm-master#l43 @Daniel, can you please take a look at this? I've submitted a patch to kill all subprocesses from the process tree to have an immediate solution. However, as the ip monitor process is launched through neutron/agent/linux/async_process.py, I was considering to extend this class to allow new processes request the kernel to deliver the SIGKILL (or some other signal) when the parent dies through the use of the prctl() syscall. What do you guys think? Also, there was a suspect that this bug was related to this other one [1] but I can't repro the issue with the steps described [1] https://bugs.launchpad.net/neutron/+bug/1632099 Sent a patch to upstream gerrit: https://review.openstack.org/#/c/411968/ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1503 |
Description of problem: Upon exit neutron-keepalived-state-change does not terminate "ip -o monitor address" child process. Version-Release number of selected component (if applicable): openstack-neutron-8.1.2-4.el7ost.noarch How reproducible: 100% Steps to Reproduce: There are several ways to reproduce this; here is one: 1. Deploy Overcloud with OSP Director 2. Deploy HA router 3. On Controller/l3 node observe: "ip -o monitor address" processes and parent: neutron-keepalived-state-change $ ps -ef |egrep 'ip -o monitor address|keepalived-state-change' 4. Place controller into standby to stop all l3 services. $ sudo pcs cluster standby <controller-node-name> 5. After l3 services have been stopped and netns cleanup, observe "ip -o monitor address" processes remain. $ ps -ef |egrep 'ip -o monitor address|keepalived-state-change' 6. Unstandby controller node and observe new keepalived processes along with additional "ip -o monitor address" processes. $ sudo pcs cluster unstandby <controller-node-name> 7. repeat steps 4 - 6 and observe build-up of "ip -o monitor" processes Actual results: build-up of "ip -o monitor" processes as neutron-keepalived-state-change is stopped. Expected results: Cleanup of "ip -o monitor" when neutron-keepalived-state-change is terminated. Additional info: When neutron-keepalived-state-change is terminated the "ip -o monitor address" ppid is changed to 1. The following can be used to clean out old processes: # ps -ef |grep 'ip -o monitor' |awk '$3 == "1"{print $2}' |xargs kill