1397418 – neutron-keepalived-state-change lives behind big processes

Bug 1397418 - neutron-keepalived-state-change lives behind big processes

Summary: neutron-keepalived-state-change lives behind big processes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	unspecified
Target Milestone:	z2
Target Release:	10.0 (Newton)
Assignee:	Daniel Alvarez Sanchez
QA Contact:	GenadiC
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-22 13:40 UTC by Attila Fazekas
Modified:	2017-02-23 16:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-neutron-9.1.1-6.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, every time neutron-keepalived-state-change was killed, the IP monitor process it spawned remained in an orphaned state. This resulted in leaked memory over time and required manual actions from administrators. With this update, the process is killed gracefully and its child IP monitor process will be killed as well, avoiding this memory leak.
Clone Of:
Environment:
Last Closed:	2017-02-23 16:34:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1632099	None	None	None	2016-11-23 14:41:33 UTC
OpenStack gerrit	411968	None	MERGED	Kill neutron-keepalived-state-change gracefully	2020-06-18 10:05:49 UTC
Red Hat Product Errata	RHBA-2017:0314	normal	SHIPPED_LIVE	openstack-neutron bug fix advisory	2017-02-23 21:33:22 UTC

Description Attila Fazekas 2016-11-22 13:40:21 UTC

Description of problem:
neutron l3ha wasting my memory, by living ~12 MB behind after every scenario test.

If we would use this version of neutron with L3HA for managing our CI machines, we would need fight with the OOM killer on weekly bases. (Our CI jobs always creates new router)

This also prevents to do any longer time neutron tests, or a bigger load test.

Version-Release number of selected component (if applicable):
python-neutronclient-6.0.0-1.1.el7ost.noarch
python-neutron-lib-0.4.0-1.el7ost.noarch
openstack-neutron-common-9.1.0-4.el7ost.noarch
openstack-neutron-bigswitch-lldp-9.40.0-1.1.el7ost.noarch
python-neutron-lbaas-9.1.0-1.el7ost.noarch
openstack-neutron-9.1.0-4.el7ost.noarch
openstack-neutron-openvswitch-9.1.0-4.el7ost.noarch
openstack-neutron-ml2-9.1.0-4.el7ost.noarch
openstack-neutron-metering-agent-9.1.0-4.el7ost.noarch
puppet-neutron-9.4.0-3.el7ost.noarch
python-neutron-9.1.0-4.el7ost.noarch
openstack-neutron-sriov-nic-agent-9.1.0-4.el7ost.noarch
python-neutron-tests-9.1.0-4.el7ost.noarch
openstack-neutron-bigswitch-agent-9.40.0-1.1.el7ost.noarch
openstack-neutron-lbaas-9.1.0-1.el7ost.noarch

How reproducible:
always

Steps to Reproduce:
1. 3 controller default setup
2. install tempest
3. run any tempest scenario test which creates and uses and deletes a router.
for ex.: ostestr -r 'minimum'

Actual results:
I have new sudo process ( rss 2808, shr 2112 kB (at least 684 kb wasted + the kernel side memory usage) , and neutron-rootwrap-daemon rss:14888 shr:4288, at least 10600kB .

Expected results:
The extra neutron-rootwrap-daemon and sudo processes dies when the router is deleted.

Additional info:
Similar issue: https://bugzilla.redhat.com/show_bug.cgi?id=1383448

Comment 8 Daniel Alvarez Sanchez 2016-12-02 12:11:16 UTC

Hi,

I have done some tests and indeed, there are processes leaked. In my particular case I've observed these new processes after running the following two tests:

neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase.test_ha_router_lifecycle
neutron.tests.functional.agent.l3.test_ha_router.LinuxBridgeL3HATestCase.test_ha_router_lifecycle  

246a246,250
> root     21276     1  0 11:51 ?        00:00:00 ip -o monitor address
> root     21680     1  0 11:52 ?        00:00:00 ip -o monitor address
> root     21825     1  0 11:52 ?        00:00:00 sudo /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
> root     21826 21825  1 11:52 ?        00:00:00 /usr/bin/python /usr/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf

[centos@devstack bug1397418]$ ps -o rss,sz,vsz 21276
  RSS    SZ    VSZ
  788  1666   6664
[centos@devstack bug1397418]$ ps -o rss,sz,vsz 21680
  RSS    SZ    VSZ
  788  1666   6664
[centos@devstack bug1397418]$ ps -o rss,sz,vsz 21825
  RSS    SZ    VSZ
 2796 48593 194372
[centos@devstack bug1397418]$ ps -o rss,sz,vsz 21826
  RSS    SZ    VSZ
20936 76311 305244


Regarding the 'ip -o monitor' process, that's indeed because keepalived-state-change process is spawning that one and when a HA router is deleted, the keepalived-state-change is being stopped through a SIGKILL leaving 'ip -o monitor' orphaned.

@Attila: does it make sense compared against what you have observed?

@Assaf
In my opinion, the patch to fix this could be capturing the kill signal within keepalived_state_change and get rid of children processes. 

Also, I need to investigate further about the rootwrap-daemon processes being leaked.

Comment 11 Daniel Alvarez Sanchez 2016-12-16 19:40:29 UTC

Sent a patch to upstream gerrit:

https://review.openstack.org/#/c/411968/

In my setup, at least no 'ip -o monitor processes' are orphaned anymore

Daniel

Comment 14 Jon Schlueter 2017-01-19 21:30:09 UTC

looks like patches have landed on stable/newton.

Comment 16 Daniel Alvarez Sanchez 2017-01-20 09:50:21 UTC

Build with the fix has been released in openstack-neutron-9.1.1-6.el7ost

Steps to test:
1. 3 controller default setup
2. install tempest
3. run any tempest scenario test which creates and uses and deletes a router.
for ex.: ostestr -r 'minimum'

Comment 19 GenadiC 2017-02-15 15:33:16 UTC

To verify I had to run test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario test and made sure that ps aux | grep "monitor address" didn't return anything, so the process didn't exist
Verified in openstack-neutron-ml2-9.2.0-2.el7ost.noarch

Comment 21 errata-xmlrpc 2017-02-23 16:34:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0314.html

Note You need to log in before you can comment on or make changes to this bug.