Bug 1295692

Summary: neutron-server service sometimes fail to stop in HA env due to timeout
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: python-oslo-messagingAssignee: Victor Stinner <vstinner>
Status: CLOSED ERRATA QA Contact: Asaf Hirshberg <ahirshbe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: abeekhof, amuller, apevec, bperkins, chrisw, dmacpher, fpercoco, jeckersb, jlibosva, lhh, mcornea, nyechiel, srevivo, ushkalim, vstinner
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-oslo-messaging-2.5.0-10.el7ost Doc Type: Bug Fix
Doc Text:
A race condition prevented 'neutron-server' from stopping on the Overcloud. This caused a slow restart of the 'haproxy' resource. This bug fix corrects the race condition and provides resources with quicker and safer restarts.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-14 19:57:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
neutron logs and configs none

Description Marius Cornea 2016-01-05 09:27:59 UTC
Created attachment 1111769 [details]
neutron logs and configs

Description of problem:
Running 'pcs resource restart haproxy' in a 3 controllers setup leaves the openvswitch-agent unable to start on one of the controllers.

Version-Release number of selected component (if applicable):
openstack-neutron-common-7.0.0-5.el7ost.noarch
python-neutron-lbaas-7.0.0-2.el7ost.noarch
python-neutronclient-3.1.0-1.el7ost.noarch
openstack-neutron-lbaas-7.0.0-2.el7ost.noarch
openstack-neutron-openvswitch-7.0.0-5.el7ost.noarch
openstack-neutron-metering-agent-7.0.0-5.el7ost.noarch
python-neutron-7.0.0-5.el7ost.noarch
openstack-neutron-7.0.0-5.el7ost.noarch
openstack-neutron-ml2-7.0.0-5.el7ost.noarch

How reproducible:
Hit this once.

Steps to Reproduce:
1. Deploy overcloud with 3 controllers, 1 compute and 3 ceph nodes
2. SSH to one of the controllers 
3. Run pcs resource restart haproxy

Actual results:
● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Expected results:
neutron-openvswitch-agent restarts.

Additional info:
Attaching /var/log/neutron and /etc/neutron.

Comment 1 Jakub Libosvar 2016-01-05 17:55:41 UTC
ovs-agent is innocent in this case. The cause is in neutron-server taking too long to shutdown:
Jan  5 12:05:09 localhost lrmd[21512]:  notice: Giving up on neutron-server stop (rc=196): timeout (elapsed=19988ms, remaining=12ms)
Jan  5 12:05:09 localhost crmd[21515]:   error: Operation neutron-server_stop_0: Timed Out (node=overcloud-controller-1, call=401, timeout=20000ms)

Comment 2 Jakub Libosvar 2016-01-05 17:58:08 UTC
As per logs, it seems like neutron-server stopped 3 seconds after pcs gave up on waiting for resource to stop:
Jan  5 12:05:12 localhost systemd: Stopped Cluster Controlled neutron-server

Maybe this is just duplicate of bug 1295830

Comment 3 Assaf Muller 2016-01-05 20:31:51 UTC
Looks like Marius will increase the timeout, but is there something we can do on our side? How long would you expect it to take for the neutron-server to shutdown?

Comment 4 Jakub Libosvar 2016-01-06 09:10:08 UTC
(In reply to Assaf Muller from comment #3)
> Looks like Marius will increase the timeout, but is there something we can
> do on our side? How long would you expect it to take for the neutron-server
> to shutdown?

Based on what I saw in the logs, neutron-server with 8 api workers and 8 rpc workers took ~22 seconds to shutdown while pacemaker had 20 seconds timeout.

On our side we can profile the code during shutdown why it takes so long. I saw some weird messages from oslo messaging when stopping workers about calling stop() from wrong thread.

Comment 5 Jakub Libosvar 2016-02-02 10:39:50 UTC
Flavio, we see at logs warning coming from oslo.messaging layer when stopping multiple rpc workers that seem to slow down shutdown of server process.
oslo_messaging.server [-] start/stop/wait must be called in the same thread

I found patch upstream https://review.openstack.org/#/c/238985/3 that seems to be dealing with such calls from various threads. Can you please confirm this is supposed to solve our issue? If so, I'll test it and send a patch to backport it.

Comment 6 Flavio Percoco 2016-02-02 12:28:22 UTC
Hey,

The patch does seem to fix an issue along the lines of what's in the bug description. That said, I tink pcs's timeout should be higher when working w/ systemd, shouldn't it? 

Last time I heard something related to PCS's timeout and systemd, it was said that it should be said to ~200s.

Comment 7 Andrew Beekhof 2016-02-02 23:50:05 UTC
Yes, ~200s is required - but ideally services wouldn't come close to hitting it as node shutdown would end up taking ~20 minutes by the time a few services hit this.

Comment 8 Jakub Libosvar 2016-05-05 12:39:40 UTC
Since bug 1295830 was fixed, is this still high severity?

Comment 9 Jakub Libosvar 2016-05-05 13:58:55 UTC
I went through the code and tested where do we hang. We basically stop every RpcWorkers from oslo_messaging library which has multiple MessageHandlingServers that as per oslo_messaging code wait for message processing to complete.

I don't think there is any work that can be done on Neutron side to speed it up. I did some measurements and sometimes it takes about 25 seconds to poll_worker to finish. 

In oslo messaging, poll_workers were removed from RpcWorker with https://github.com/openstack/oslo.messaging/commit/1482687ff7f06b8db49b6f604e7be0db8eafe2df so I'll try to test with this patch and update.

Comment 10 Jakub Libosvar 2016-05-05 14:13:07 UTC
With latest trunk the latency was moved to stop() method of RpcServer, so it still takes time: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/server.py#L426

I moved the bug to oslo.messaging so they can have a look whether the fact that stop() takes sometimes ~20 seconds is fine. I believe other projects must hit this too.

Comment 11 Victor Stinner 2016-09-20 08:25:56 UTC
For example, dhcp-agent.log contains: "WARNING oslo_messaging.server [...] start/stop/wait must be called in the same thread"

It seems like the bug was fixed upstream by: https://review.openstack.org/#/c/224054/ "Enhance start/stop concurrency race condition fix"

Comment 12 Victor Stinner 2016-09-26 12:15:59 UTC
Note: RHOS 9 is not affected (already has the fix).

Comment 14 Asaf Hirshberg 2016-10-06 11:47:18 UTC
where can I get python-oslo-messaging-2.5.0-10.el7ost for verification?

Comment 15 John Eckersberg 2016-10-06 13:14:05 UTC
(In reply to Asaf Hirshberg from comment #14)
> where can I get python-oslo-messaging-2.5.0-10.el7ost for verification?

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=514847

Comment 16 Asaf Hirshberg 2016-10-06 16:01:55 UTC
Verified. OSPD-8 using python-oslo-messaging-2.5.0-10.el7ost.noarch.
I have restarted haproxy for at least 5 times and the problem reported did'nt reproduced.

    4  yum update -y python-oslo-messaging
    5  pcs cluster start --all
    6  pcs status
    7  pcs resource restart haproxy
    8  systemctl status neutron-openvswitch-agent.service
    9  history 
[root@controller-0 ~]# rpm -qa |grep python-oslo-messaging
python-oslo-messaging-2.5.0-10.el7ost.noarch

[root@controller-0

[root@controller-0 ~]# pcs resource restart haproxy
...
haproxy[27704]: proxy nova_osapi has no server available!

haproxy-clone successfully restarted
[root@controller-0 ~]# systemctl status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Cluster Controlled neutron-openvswitch-agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/neutron-openvswitch-agent.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2016-10-06 15:51:43 UTC; 32s ago
...
...
Oct 06 15:52:16 controller-0.localdomain neutron-openvswitch-agent[28711]: execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:156
Oct 06 15:52:16 controller-0.localdomain neutron-openvswitch-agent[28711]: 2016-10-06 15:52:16.257 28711 DEBUG neutron.agent.linux.utils [req-1e9d3d3f-cb30-4a12-ba4d-f7f933837e33 - - - - -] Ru...tils.py:85
Hint: Some lines were ellipsized, use -l to show in full.
[root@controller-0 ~]#

Comment 18 errata-xmlrpc 2016-11-14 19:57:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2712.html