Description of problem: I have 15 neutron metadata agent processes running in my cluster. When I issue systemctl stop netruon-metadata-agent, systemctl will hang for a while and some of the neutron metadata child processes are not cleaned up properly Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Start neturon metadata agent with 15 metadata_workers. The configuration values are specified in metadata_agent.ini 2. Make sure all 15 metadata processes are running 3. Run systemctl stop neutron-metadata-agent 4. You will notice that systemctl will hang for a while. Once systemctl finish, run ps aux | grep metadata. You will notice some of the metadata process are not cleaned up Actual results: systemctl hangs and some of the child metadata processes are not cleaned up Expected results: Running systemctl stop/restart neutron-metadata-agent will not hang and all the metadata process should be cleaned up. Additional info:
If I issue strace -p <main process id>, I see it is looping wait4(0, 0x7fff51d0d6b4, WNOHANG, NULL) = 0
Please note that the service was originally managed by pacemaker which uses systemd to start/stop/restart the service. I was trying to make the debugging simplier so that I just use systemctl to reproduce the issue.
Any ETA on this one?
Waiting for customer feedback on whether the upstream fix https://review.openstack.org/#/c/331672/ (provided in a test package) solves this bug in their environment. In the meantime, this workaround proved to work: * change the KillMode value to "control-group" in /usr/lib/systemd/system/neutron-metadata-agent.service * run: "systemctl daemon-reload"
(In reply to Bernard Cafarelli from comment #25) > Waiting for customer feedback on whether the upstream fix > https://review.openstack.org/#/c/331672/ (provided in a test package) solves > this bug in their environment. Hi Bernard This is the corresponding support case right: https://access.redhat.com/support/cases/#/case/01640942 ? I'm not seeing a test package attached there? Could you add it to the case. Thanks Charles
From support case from Kahou: "We have verified that the patch works. By applying the change, we don't see any more zombie child process anymore even we are using "process" as kill-mode." So (thumbsup) from us :-)
Thanks for the test! That confirms https://review.openstack.org/#/c/331672/ fixes this bug, we will review and integrate the change
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1770.html