Bug 1229135

Summary: [BUG] Neutron systemd unit file doesn't kill all processes
Product: Red Hat OpenStack Reporter: Pablo Iranzo Gómez <pablo.iranzo>
Component: openstack-neutronAssignee: Jakub Libosvar <jlibosva>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ofer Blaut <oblaut>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0 (Juno)CC: amuller, chrisw, jlibosva, jschwarz, mschuppe, nyechiel, pablo.iranzo, tfreger, yeylon, zshujuan
Target Milestone: ---Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1249181 1249197 (view as bug list) Environment:
Last Closed: 2015-12-15 12:51:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1249181, 1249192, 1249197    

Description Pablo Iranzo Gómez 2015-06-08 07:23:42 UTC
Description of problem:

Customer is detecting that when the neutron process is restarted with service neutron restart, it's sending the Kill signal to the process.

Apparently and according to http://www.freedesktop.org/software/systemd/man/systemd.kill.html we have other choices 

As it's spawning subprocesses, it should be either using Ccontro-group or mixed to ensure that all the pids are also killed, freeing up all used resources to enable process to be started again.

I couldn't find a BZ for this, but looks resasonable, did I miss anything here?


How reproducible:

Kill one of the neutron processes, the remaining ones will still take use of ports or other resources, not allowing neutron to start again


For example:
2015-06-03 07:17:48.950 21149 ERROR neutron.service [-] Unrecoverable error: please check log for details.
2015-06-03 07:17:48.950 21149 TRACE neutron.service Traceback (most recent call last):
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/service.py", line 102, in serve_wsgi
2015-06-03 07:17:48.950 21149 TRACE neutron.service     service.start()
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/service.py", line 73, in start
2015-06-03 07:17:48.950 21149 TRACE neutron.service     self.wsgi_app = _run_wsgi(self.app_name)
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/service.py", line 174, in _run_wsgi
2015-06-03 07:17:48.950 21149 TRACE neutron.service     workers=cfg.CONF.api_workers)
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/wsgi.py", line 207, in start
2015-06-03 07:17:48.950 21149 TRACE neutron.service     backlog=backlog)
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/wsgi.py", line 186, in _get_socket
2015-06-03 07:17:48.950 21149 TRACE neutron.service     'time': CONF.retry_until_window})
2015-06-03 07:17:48.950 21149 TRACE neutron.service RuntimeError: Could not bind to 0.0.0.0:9696 after trying for 30 seconds





Actual results:

Neutron is unabled to start after this as some resources are in use


Expected results:
Neutron should be able to start by having correctly finished all the relevant pids

Comment 3 Jakub Libosvar 2015-06-08 09:16:58 UTC
(In reply to Pablo Iranzo Gómez from comment #0)
> How reproducible:
> 
> Kill one of the neutron processes, the remaining ones will still take use of
> ports or other resources, not allowing neutron to start again

You shouldn't send signals to any of Neutron worker processes, there is one parent process that forks into multiple processes.

systemd should send signal only to this parent process that should take care of its child processes, we avoid using control-group by purpose.

Pablo, can you please confirm that what happens is that systemd restart sends SIGTERM to parent process but child processes hang?
Also to be sure, this happens in latest RHOS 6, right?

Comment 4 Pablo Iranzo Gómez 2015-06-08 11:49:39 UTC
Hi Jakub,
The versions are:

openstack-neutron-ml2-2014.2.1-6.el7ost.noarch
openstack-neutron-2014.2.1-6.el7ost.noarch
openstack-neutron-openvswitch-2014.2.1-6.el7ost.noarch
python-neutronclient-2.3.9-1.el7ost.noarch
openstack-neutron-metering-agent-2014.2.1-6.el7ost.noarch
python-neutron-2014.2.1-6.el7ost.noarch

The restart process has been tested by customer via:

- 1. Simulate the systemd kill by 'kill -STOP <any neutron-server child PID>'
- 2. systemctl restart neutron-server

Until it complains about in-use resources

Regards,
Pablo

Comment 6 Jakub Libosvar 2015-06-18 17:04:19 UTC
(In reply to Pablo Iranzo Gómez from comment #4)
> Hi Jakub,
> The versions are:
> 
> openstack-neutron-ml2-2014.2.1-6.el7ost.noarch
> openstack-neutron-2014.2.1-6.el7ost.noarch
> openstack-neutron-openvswitch-2014.2.1-6.el7ost.noarch
> python-neutronclient-2.3.9-1.el7ost.noarch
> openstack-neutron-metering-agent-2014.2.1-6.el7ost.noarch
> python-neutron-2014.2.1-6.el7ost.noarch
> 
This is the GA version, there are 3 other minor releases. Can you please ask customer to upgrade to the latest and try to reproduce?

I suspect the issue they hit is https://bugs.launchpad.net/neutron/+bug/1387053 which basically means you can't stop rpc workers.

I reproduced locally with GA version, that restart fails if you manually send SIGTERM to one of workers. I can't reproduce this issue with A3 release - what I discovered is that 'systemctl stop' hangs and at the end is killed by SIGKILL. But it also kills all child processes, so the next start of service is successful.