Bug 1339014

Summary: Neutron metadata agent workers are not properly shut down when 'systemctl stop neturon-metadata-agent' is issued
Product: Red Hat OpenStack Reporter: kahou <kalei>
Component: openstack-neutronAssignee: Bernard Cafarelli <bcafarel>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: aathomas, amuller, bcafarel, charcrou, chrisw, jdonohue, nyechiel, skulkarn, srevivo, tfreger
Target Milestone: asyncKeywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Linux   
Whiteboard: hot
Fixed In Version: openstack-neutron-7.1.1-4.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-24 13:30:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1194008, 1295530    

Description kahou 2016-05-23 23:57:09 UTC
Description of problem:

I have 15 neutron metadata agent processes running in my cluster. When I issue systemctl stop netruon-metadata-agent, systemctl will hang for a while and some of the neutron metadata child processes are not cleaned up properly

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:

1. Start neturon metadata agent with 15 metadata_workers. The configuration values are specified in metadata_agent.ini
2. Make sure all 15 metadata processes are running
3. Run systemctl stop neutron-metadata-agent
4. You will notice that systemctl will hang for a while. Once systemctl finish, run ps aux | grep metadata. You will notice some of the metadata process are not cleaned up


Actual results:

systemctl hangs and some of the child metadata processes are not cleaned up

Expected results:

Running systemctl stop/restart neutron-metadata-agent will not hang and all the metadata process should be cleaned up.


Additional info:

Comment 1 kahou 2016-05-24 07:30:01 UTC
If I issue strace -p <main process id>, I see it is looping wait4(0, 0x7fff51d0d6b4, WNOHANG, NULL) = 0

Comment 2 kahou 2016-05-24 17:11:10 UTC
Please note that the service was originally managed by pacemaker which uses systemd to start/stop/restart the service. I was trying to make the debugging simplier so that I just use systemctl to reproduce the issue.

Comment 24 Joe Donohue 2016-07-19 16:54:28 UTC
Any ETA on this one?

Comment 25 Bernard Cafarelli 2016-07-20 09:55:59 UTC
Waiting for customer feedback on whether the upstream fix https://review.openstack.org/#/c/331672/ (provided in a test package) solves this bug in their environment.

In the meantime, this workaround proved to work:
* change the KillMode value to "control-group" in /usr/lib/systemd/system/neutron-metadata-agent.service
* run: "systemctl daemon-reload"

Comment 26 Charles Crouch 2016-07-20 14:27:30 UTC
(In reply to Bernard Cafarelli from comment #25)
> Waiting for customer feedback on whether the upstream fix
> https://review.openstack.org/#/c/331672/ (provided in a test package) solves
> this bug in their environment.

Hi Bernard
This is the corresponding support case right: https://access.redhat.com/support/cases/#/case/01640942 ?
I'm not seeing a test package attached there? Could you add it to the case.

Thanks
Charles

Comment 29 Charles Crouch 2016-07-26 16:54:14 UTC
From support case from Kahou: "We have verified that the patch works. By applying the change, we don't see any more zombie child process anymore even we are using "process" as kill-mode."

So (thumbsup) from us :-)

Comment 30 Bernard Cafarelli 2016-07-27 12:02:37 UTC
Thanks for the test! That confirms https://review.openstack.org/#/c/331672/ fixes this bug, we will review and integrate the change

Comment 34 errata-xmlrpc 2016-08-24 13:30:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1770.html