Created attachment 1111769 [details] neutron logs and configs Description of problem: Running 'pcs resource restart haproxy' in a 3 controllers setup leaves the openvswitch-agent unable to start on one of the controllers. Version-Release number of selected component (if applicable): openstack-neutron-common-7.0.0-5.el7ost.noarch python-neutron-lbaas-7.0.0-2.el7ost.noarch python-neutronclient-3.1.0-1.el7ost.noarch openstack-neutron-lbaas-7.0.0-2.el7ost.noarch openstack-neutron-openvswitch-7.0.0-5.el7ost.noarch openstack-neutron-metering-agent-7.0.0-5.el7ost.noarch python-neutron-7.0.0-5.el7ost.noarch openstack-neutron-7.0.0-5.el7ost.noarch openstack-neutron-ml2-7.0.0-5.el7ost.noarch How reproducible: Hit this once. Steps to Reproduce: 1. Deploy overcloud with 3 controllers, 1 compute and 3 ceph nodes 2. SSH to one of the controllers 3. Run pcs resource restart haproxy Actual results: ● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled) Active: inactive (dead) Expected results: neutron-openvswitch-agent restarts. Additional info: Attaching /var/log/neutron and /etc/neutron.
ovs-agent is innocent in this case. The cause is in neutron-server taking too long to shutdown: Jan 5 12:05:09 localhost lrmd[21512]: notice: Giving up on neutron-server stop (rc=196): timeout (elapsed=19988ms, remaining=12ms) Jan 5 12:05:09 localhost crmd[21515]: error: Operation neutron-server_stop_0: Timed Out (node=overcloud-controller-1, call=401, timeout=20000ms)
As per logs, it seems like neutron-server stopped 3 seconds after pcs gave up on waiting for resource to stop: Jan 5 12:05:12 localhost systemd: Stopped Cluster Controlled neutron-server Maybe this is just duplicate of bug 1295830
Looks like Marius will increase the timeout, but is there something we can do on our side? How long would you expect it to take for the neutron-server to shutdown?
(In reply to Assaf Muller from comment #3) > Looks like Marius will increase the timeout, but is there something we can > do on our side? How long would you expect it to take for the neutron-server > to shutdown? Based on what I saw in the logs, neutron-server with 8 api workers and 8 rpc workers took ~22 seconds to shutdown while pacemaker had 20 seconds timeout. On our side we can profile the code during shutdown why it takes so long. I saw some weird messages from oslo messaging when stopping workers about calling stop() from wrong thread.
Flavio, we see at logs warning coming from oslo.messaging layer when stopping multiple rpc workers that seem to slow down shutdown of server process. oslo_messaging.server [-] start/stop/wait must be called in the same thread I found patch upstream https://review.openstack.org/#/c/238985/3 that seems to be dealing with such calls from various threads. Can you please confirm this is supposed to solve our issue? If so, I'll test it and send a patch to backport it.
Hey, The patch does seem to fix an issue along the lines of what's in the bug description. That said, I tink pcs's timeout should be higher when working w/ systemd, shouldn't it? Last time I heard something related to PCS's timeout and systemd, it was said that it should be said to ~200s.
Yes, ~200s is required - but ideally services wouldn't come close to hitting it as node shutdown would end up taking ~20 minutes by the time a few services hit this.
Since bug 1295830 was fixed, is this still high severity?
I went through the code and tested where do we hang. We basically stop every RpcWorkers from oslo_messaging library which has multiple MessageHandlingServers that as per oslo_messaging code wait for message processing to complete. I don't think there is any work that can be done on Neutron side to speed it up. I did some measurements and sometimes it takes about 25 seconds to poll_worker to finish. In oslo messaging, poll_workers were removed from RpcWorker with https://github.com/openstack/oslo.messaging/commit/1482687ff7f06b8db49b6f604e7be0db8eafe2df so I'll try to test with this patch and update.
With latest trunk the latency was moved to stop() method of RpcServer, so it still takes time: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/server.py#L426 I moved the bug to oslo.messaging so they can have a look whether the fact that stop() takes sometimes ~20 seconds is fine. I believe other projects must hit this too.
For example, dhcp-agent.log contains: "WARNING oslo_messaging.server [...] start/stop/wait must be called in the same thread" It seems like the bug was fixed upstream by: https://review.openstack.org/#/c/224054/ "Enhance start/stop concurrency race condition fix"
Note: RHOS 9 is not affected (already has the fix).
where can I get python-oslo-messaging-2.5.0-10.el7ost for verification?
(In reply to Asaf Hirshberg from comment #14) > where can I get python-oslo-messaging-2.5.0-10.el7ost for verification? https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=514847
Verified. OSPD-8 using python-oslo-messaging-2.5.0-10.el7ost.noarch. I have restarted haproxy for at least 5 times and the problem reported did'nt reproduced. 4 yum update -y python-oslo-messaging 5 pcs cluster start --all 6 pcs status 7 pcs resource restart haproxy 8 systemctl status neutron-openvswitch-agent.service 9 history [root@controller-0 ~]# rpm -qa |grep python-oslo-messaging python-oslo-messaging-2.5.0-10.el7ost.noarch [root@controller-0 [root@controller-0 ~]# pcs resource restart haproxy ... haproxy[27704]: proxy nova_osapi has no server available! haproxy-clone successfully restarted [root@controller-0 ~]# systemctl status neutron-openvswitch-agent.service ● neutron-openvswitch-agent.service - Cluster Controlled neutron-openvswitch-agent Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled) Drop-In: /run/systemd/system/neutron-openvswitch-agent.service.d └─50-pacemaker.conf Active: active (running) since Thu 2016-10-06 15:51:43 UTC; 32s ago ... ... Oct 06 15:52:16 controller-0.localdomain neutron-openvswitch-agent[28711]: execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:156 Oct 06 15:52:16 controller-0.localdomain neutron-openvswitch-agent[28711]: 2016-10-06 15:52:16.257 28711 DEBUG neutron.agent.linux.utils [req-1e9d3d3f-cb30-4a12-ba4d-f7f933837e33 - - - - -] Ru...tils.py:85 Hint: Some lines were ellipsized, use -l to show in full. [root@controller-0 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2712.html