1229135 – [BUG] Neutron systemd unit file doesn't kill all processes

Bug 1229135 - [BUG] Neutron systemd unit file doesn't kill all processes

Summary: [BUG] Neutron systemd unit file doesn't kill all processes

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	7.0 (Kilo)
Assignee:	Jakub Libosvar
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1249181 1249192 1249197
TreeView+	depends on / blocked

Reported:	2015-06-08 07:23 UTC by Pablo Iranzo Gómez
Modified:	2023-02-22 23:02 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1249181 1249197 (view as bug list)
Environment:
Last Closed:	2015-12-15 12:51:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1473993	0	None	None	None	2019-12-23 02:02:48 UTC

Description Pablo Iranzo Gómez 2015-06-08 07:23:42 UTC

Description of problem:

Customer is detecting that when the neutron process is restarted with service neutron restart, it's sending the Kill signal to the process.

Apparently and according to http://www.freedesktop.org/software/systemd/man/systemd.kill.html we have other choices 

As it's spawning subprocesses, it should be either using Ccontro-group or mixed to ensure that all the pids are also killed, freeing up all used resources to enable process to be started again.

I couldn't find a BZ for this, but looks resasonable, did I miss anything here?


How reproducible:

Kill one of the neutron processes, the remaining ones will still take use of ports or other resources, not allowing neutron to start again


For example:
2015-06-03 07:17:48.950 21149 ERROR neutron.service [-] Unrecoverable error: please check log for details.
2015-06-03 07:17:48.950 21149 TRACE neutron.service Traceback (most recent call last):
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/service.py", line 102, in serve_wsgi
2015-06-03 07:17:48.950 21149 TRACE neutron.service     service.start()
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/service.py", line 73, in start
2015-06-03 07:17:48.950 21149 TRACE neutron.service     self.wsgi_app = _run_wsgi(self.app_name)
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/service.py", line 174, in _run_wsgi
2015-06-03 07:17:48.950 21149 TRACE neutron.service     workers=cfg.CONF.api_workers)
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/wsgi.py", line 207, in start
2015-06-03 07:17:48.950 21149 TRACE neutron.service     backlog=backlog)
2015-06-03 07:17:48.950 21149 TRACE neutron.service   File "/usr/lib/python2.7/site-packages/neutron/wsgi.py", line 186, in _get_socket
2015-06-03 07:17:48.950 21149 TRACE neutron.service     'time': CONF.retry_until_window})
2015-06-03 07:17:48.950 21149 TRACE neutron.service RuntimeError: Could not bind to 0.0.0.0:9696 after trying for 30 seconds





Actual results:

Neutron is unabled to start after this as some resources are in use


Expected results:
Neutron should be able to start by having correctly finished all the relevant pids

Comment 3 Jakub Libosvar 2015-06-08 09:16:58 UTC

(In reply to Pablo Iranzo Gómez from comment #0)
> How reproducible:
> 
> Kill one of the neutron processes, the remaining ones will still take use of
> ports or other resources, not allowing neutron to start again

You shouldn't send signals to any of Neutron worker processes, there is one parent process that forks into multiple processes.

systemd should send signal only to this parent process that should take care of its child processes, we avoid using control-group by purpose.

Pablo, can you please confirm that what happens is that systemd restart sends SIGTERM to parent process but child processes hang?
Also to be sure, this happens in latest RHOS 6, right?

Comment 4 Pablo Iranzo Gómez 2015-06-08 11:49:39 UTC

Hi Jakub,
The versions are:

openstack-neutron-ml2-2014.2.1-6.el7ost.noarch
openstack-neutron-2014.2.1-6.el7ost.noarch
openstack-neutron-openvswitch-2014.2.1-6.el7ost.noarch
python-neutronclient-2.3.9-1.el7ost.noarch
openstack-neutron-metering-agent-2014.2.1-6.el7ost.noarch
python-neutron-2014.2.1-6.el7ost.noarch

The restart process has been tested by customer via:

- 1. Simulate the systemd kill by 'kill -STOP <any neutron-server child PID>'
- 2. systemctl restart neutron-server

Until it complains about in-use resources

Regards,
Pablo

Comment 6 Jakub Libosvar 2015-06-18 17:04:19 UTC

(In reply to Pablo Iranzo Gómez from comment #4)
> Hi Jakub,
> The versions are:
> 
> openstack-neutron-ml2-2014.2.1-6.el7ost.noarch
> openstack-neutron-2014.2.1-6.el7ost.noarch
> openstack-neutron-openvswitch-2014.2.1-6.el7ost.noarch
> python-neutronclient-2.3.9-1.el7ost.noarch
> openstack-neutron-metering-agent-2014.2.1-6.el7ost.noarch
> python-neutron-2014.2.1-6.el7ost.noarch
> 
This is the GA version, there are 3 other minor releases. Can you please ask customer to upgrade to the latest and try to reproduce?

I suspect the issue they hit is https://bugs.launchpad.net/neutron/+bug/1387053 which basically means you can't stop rpc workers.

I reproduced locally with GA version, that restart fails if you manually send SIGTERM to one of workers. I can't reproduce this issue with A3 release - what I discovered is that 'systemctl stop' hangs and at the end is killed by SIGKILL. But it also kills all child processes, so the next start of service is successful.

Note You need to log in before you can comment on or make changes to this bug.