Description of problem: Customer is detecting that when the neutron process is restarted with service neutron restart, it's sending the Kill signal to the process. Apparently and according to http://www.freedesktop.org/software/systemd/man/systemd.kill.html we have other choices As it's spawning subprocesses, it should be either using Ccontro-group or mixed to ensure that all the pids are also killed, freeing up all used resources to enable process to be started again. I couldn't find a BZ for this, but looks resasonable, did I miss anything here? How reproducible: Kill one of the neutron processes, the remaining ones will still take use of ports or other resources, not allowing neutron to start again For example: 2015-06-03 07:17:48.950 21149 ERROR neutron.service [-] Unrecoverable error: please check log for details. 2015-06-03 07:17:48.950 21149 TRACE neutron.service Traceback (most recent call last): 2015-06-03 07:17:48.950 21149 TRACE neutron.service File "/usr/lib/python2.7/site-packages/neutron/service.py", line 102, in serve_wsgi 2015-06-03 07:17:48.950 21149 TRACE neutron.service service.start() 2015-06-03 07:17:48.950 21149 TRACE neutron.service File "/usr/lib/python2.7/site-packages/neutron/service.py", line 73, in start 2015-06-03 07:17:48.950 21149 TRACE neutron.service self.wsgi_app = _run_wsgi(self.app_name) 2015-06-03 07:17:48.950 21149 TRACE neutron.service File "/usr/lib/python2.7/site-packages/neutron/service.py", line 174, in _run_wsgi 2015-06-03 07:17:48.950 21149 TRACE neutron.service workers=cfg.CONF.api_workers) 2015-06-03 07:17:48.950 21149 TRACE neutron.service File "/usr/lib/python2.7/site-packages/neutron/wsgi.py", line 207, in start 2015-06-03 07:17:48.950 21149 TRACE neutron.service backlog=backlog) 2015-06-03 07:17:48.950 21149 TRACE neutron.service File "/usr/lib/python2.7/site-packages/neutron/wsgi.py", line 186, in _get_socket 2015-06-03 07:17:48.950 21149 TRACE neutron.service 'time': CONF.retry_until_window}) 2015-06-03 07:17:48.950 21149 TRACE neutron.service RuntimeError: Could not bind to 0.0.0.0:9696 after trying for 30 seconds Actual results: Neutron is unabled to start after this as some resources are in use Expected results: Neutron should be able to start by having correctly finished all the relevant pids
(In reply to Pablo Iranzo Gómez from comment #0) > How reproducible: > > Kill one of the neutron processes, the remaining ones will still take use of > ports or other resources, not allowing neutron to start again You shouldn't send signals to any of Neutron worker processes, there is one parent process that forks into multiple processes. systemd should send signal only to this parent process that should take care of its child processes, we avoid using control-group by purpose. Pablo, can you please confirm that what happens is that systemd restart sends SIGTERM to parent process but child processes hang? Also to be sure, this happens in latest RHOS 6, right?
Hi Jakub, The versions are: openstack-neutron-ml2-2014.2.1-6.el7ost.noarch openstack-neutron-2014.2.1-6.el7ost.noarch openstack-neutron-openvswitch-2014.2.1-6.el7ost.noarch python-neutronclient-2.3.9-1.el7ost.noarch openstack-neutron-metering-agent-2014.2.1-6.el7ost.noarch python-neutron-2014.2.1-6.el7ost.noarch The restart process has been tested by customer via: - 1. Simulate the systemd kill by 'kill -STOP <any neutron-server child PID>' - 2. systemctl restart neutron-server Until it complains about in-use resources Regards, Pablo
(In reply to Pablo Iranzo Gómez from comment #4) > Hi Jakub, > The versions are: > > openstack-neutron-ml2-2014.2.1-6.el7ost.noarch > openstack-neutron-2014.2.1-6.el7ost.noarch > openstack-neutron-openvswitch-2014.2.1-6.el7ost.noarch > python-neutronclient-2.3.9-1.el7ost.noarch > openstack-neutron-metering-agent-2014.2.1-6.el7ost.noarch > python-neutron-2014.2.1-6.el7ost.noarch > This is the GA version, there are 3 other minor releases. Can you please ask customer to upgrade to the latest and try to reproduce? I suspect the issue they hit is https://bugs.launchpad.net/neutron/+bug/1387053 which basically means you can't stop rpc workers. I reproduced locally with GA version, that restart fails if you manually send SIGTERM to one of workers. I can't reproduce this issue with A3 release - what I discovered is that 'systemctl stop' hangs and at the end is killed by SIGKILL. But it also kills all child processes, so the next start of service is successful.