1295692 – neutron-server service sometimes fail to stop in HA env due to timeout

Bug 1295692 - neutron-server service sometimes fail to stop in HA env due to timeout

Summary: neutron-server service sometimes fail to stop in HA env due to timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-oslo-messaging
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Victor Stinner
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-05 09:27 UTC by Marius Cornea
Modified:	2022-08-16 14:07 UTC (History)
CC List:	15 users (show)
Fixed In Version:	python-oslo-messaging-2.5.0-10.el7ost
Doc Type:	Bug Fix
Doc Text:	A race condition prevented 'neutron-server' from stopping on the Overcloud. This caused a slow restart of the 'haproxy' resource. This bug fix corrects the race condition and provides resources with quicker and safer restarts.
Clone Of:
Environment:
Last Closed:	2016-11-14 19:57:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
neutron logs and configs (28.95 KB, application/x-gzip) 2016-01-05 09:27 UTC, Marius Cornea	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	224054	None	None	None	2016-09-20 08:25:55 UTC
Red Hat Issue Tracker	OSP-4540	None	None	None	2022-08-16 14:07:54 UTC
Red Hat Product Errata	RHBA-2016:2712	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 8 Bug Fix and Enhancement Advisory	2016-11-15 00:54:10 UTC

Description Marius Cornea 2016-01-05 09:27:59 UTC

Created attachment 1111769 [details]
neutron logs and configs

Description of problem:
Running 'pcs resource restart haproxy' in a 3 controllers setup leaves the openvswitch-agent unable to start on one of the controllers.

Version-Release number of selected component (if applicable):
openstack-neutron-common-7.0.0-5.el7ost.noarch
python-neutron-lbaas-7.0.0-2.el7ost.noarch
python-neutronclient-3.1.0-1.el7ost.noarch
openstack-neutron-lbaas-7.0.0-2.el7ost.noarch
openstack-neutron-openvswitch-7.0.0-5.el7ost.noarch
openstack-neutron-metering-agent-7.0.0-5.el7ost.noarch
python-neutron-7.0.0-5.el7ost.noarch
openstack-neutron-7.0.0-5.el7ost.noarch
openstack-neutron-ml2-7.0.0-5.el7ost.noarch

How reproducible:
Hit this once.

Steps to Reproduce:
1. Deploy overcloud with 3 controllers, 1 compute and 3 ceph nodes
2. SSH to one of the controllers 
3. Run pcs resource restart haproxy

Actual results:
● neutron-openvswitch-agent.service - OpenStack Neutron Open vSwitch Agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

Expected results:
neutron-openvswitch-agent restarts.

Additional info:
Attaching /var/log/neutron and /etc/neutron.

Comment 1 Jakub Libosvar 2016-01-05 17:55:41 UTC

ovs-agent is innocent in this case. The cause is in neutron-server taking too long to shutdown:
Jan  5 12:05:09 localhost lrmd[21512]:  notice: Giving up on neutron-server stop (rc=196): timeout (elapsed=19988ms, remaining=12ms)
Jan  5 12:05:09 localhost crmd[21515]:   error: Operation neutron-server_stop_0: Timed Out (node=overcloud-controller-1, call=401, timeout=20000ms)

Comment 2 Jakub Libosvar 2016-01-05 17:58:08 UTC

As per logs, it seems like neutron-server stopped 3 seconds after pcs gave up on waiting for resource to stop:
Jan  5 12:05:12 localhost systemd: Stopped Cluster Controlled neutron-server

Maybe this is just duplicate of bug 1295830

Comment 3 Assaf Muller 2016-01-05 20:31:51 UTC

Looks like Marius will increase the timeout, but is there something we can do on our side? How long would you expect it to take for the neutron-server to shutdown?

Comment 4 Jakub Libosvar 2016-01-06 09:10:08 UTC

(In reply to Assaf Muller from comment #3)
> Looks like Marius will increase the timeout, but is there something we can
> do on our side? How long would you expect it to take for the neutron-server
> to shutdown?

Based on what I saw in the logs, neutron-server with 8 api workers and 8 rpc workers took ~22 seconds to shutdown while pacemaker had 20 seconds timeout.

On our side we can profile the code during shutdown why it takes so long. I saw some weird messages from oslo messaging when stopping workers about calling stop() from wrong thread.

Comment 5 Jakub Libosvar 2016-02-02 10:39:50 UTC

Flavio, we see at logs warning coming from oslo.messaging layer when stopping multiple rpc workers that seem to slow down shutdown of server process.
oslo_messaging.server [-] start/stop/wait must be called in the same thread

I found patch upstream https://review.openstack.org/#/c/238985/3 that seems to be dealing with such calls from various threads. Can you please confirm this is supposed to solve our issue? If so, I'll test it and send a patch to backport it.

Comment 6 Flavio Percoco 2016-02-02 12:28:22 UTC

Hey,

The patch does seem to fix an issue along the lines of what's in the bug description. That said, I tink pcs's timeout should be higher when working w/ systemd, shouldn't it? 

Last time I heard something related to PCS's timeout and systemd, it was said that it should be said to ~200s.

Comment 7 Andrew Beekhof 2016-02-02 23:50:05 UTC

Yes, ~200s is required - but ideally services wouldn't come close to hitting it as node shutdown would end up taking ~20 minutes by the time a few services hit this.

Comment 8 Jakub Libosvar 2016-05-05 12:39:40 UTC

Since bug 1295830 was fixed, is this still high severity?

Comment 9 Jakub Libosvar 2016-05-05 13:58:55 UTC

I went through the code and tested where do we hang. We basically stop every RpcWorkers from oslo_messaging library which has multiple MessageHandlingServers that as per oslo_messaging code wait for message processing to complete.

I don't think there is any work that can be done on Neutron side to speed it up. I did some measurements and sometimes it takes about 25 seconds to poll_worker to finish. 

In oslo messaging, poll_workers were removed from RpcWorker with https://github.com/openstack/oslo.messaging/commit/1482687ff7f06b8db49b6f604e7be0db8eafe2df so I'll try to test with this patch and update.

Comment 10 Jakub Libosvar 2016-05-05 14:13:07 UTC

With latest trunk the latency was moved to stop() method of RpcServer, so it still takes time: https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/server.py#L426

I moved the bug to oslo.messaging so they can have a look whether the fact that stop() takes sometimes ~20 seconds is fine. I believe other projects must hit this too.

Comment 11 Victor Stinner 2016-09-20 08:25:56 UTC

For example, dhcp-agent.log contains: "WARNING oslo_messaging.server [...] start/stop/wait must be called in the same thread"

It seems like the bug was fixed upstream by: https://review.openstack.org/#/c/224054/ "Enhance start/stop concurrency race condition fix"

Comment 12 Victor Stinner 2016-09-26 12:15:59 UTC

Note: RHOS 9 is not affected (already has the fix).

Comment 14 Asaf Hirshberg 2016-10-06 11:47:18 UTC

where can I get python-oslo-messaging-2.5.0-10.el7ost for verification?

Comment 15 John Eckersberg 2016-10-06 13:14:05 UTC

(In reply to Asaf Hirshberg from comment #14)
> where can I get python-oslo-messaging-2.5.0-10.el7ost for verification?

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=514847

Comment 16 Asaf Hirshberg 2016-10-06 16:01:55 UTC

Verified. OSPD-8 using python-oslo-messaging-2.5.0-10.el7ost.noarch.
I have restarted haproxy for at least 5 times and the problem reported did'nt reproduced.

    4  yum update -y python-oslo-messaging
    5  pcs cluster start --all
    6  pcs status
    7  pcs resource restart haproxy
    8  systemctl status neutron-openvswitch-agent.service
    9  history 
[root@controller-0 ~]# rpm -qa |grep python-oslo-messaging
python-oslo-messaging-2.5.0-10.el7ost.noarch

[root@controller-0

[root@controller-0 ~]# pcs resource restart haproxy
...
haproxy[27704]: proxy nova_osapi has no server available!

haproxy-clone successfully restarted
[root@controller-0 ~]# systemctl status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Cluster Controlled neutron-openvswitch-agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-openvswitch-agent.service; disabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/neutron-openvswitch-agent.service.d
           └─50-pacemaker.conf
   Active: active (running) since Thu 2016-10-06 15:51:43 UTC; 32s ago
...
...
Oct 06 15:52:16 controller-0.localdomain neutron-openvswitch-agent[28711]: execute /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:156
Oct 06 15:52:16 controller-0.localdomain neutron-openvswitch-agent[28711]: 2016-10-06 15:52:16.257 28711 DEBUG neutron.agent.linux.utils [req-1e9d3d3f-cb30-4a12-ba4d-f7f933837e33 - - - - -] Ru...tils.py:85
Hint: Some lines were ellipsized, use -l to show in full.
[root@controller-0 ~]#

Comment 18 errata-xmlrpc 2016-11-14 19:57:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2712.html

Note You need to log in before you can comment on or make changes to this bug.