Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1275324

Summary:	neutron-openvswitch-agent fails to restart under pacemaker
Product:	Red Hat OpenStack	Reporter:	Jan Provaznik <jprovazn>
Component:	openstack-tripleo-heat-templates	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED ERRATA	QA Contact:	Leonid Natapov <lnatapov>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	7.0 (Kilo)	CC:	abeekhof, calfonso, dnavale, gfidente, jslagle, kbasil, mburns, mcornea, michele, ohochman, rhel-osp-director-maint, rscarazz
Target Milestone:	y2	Keywords:	Triaged
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-0.8.6-77.el7ost	Doc Type:	Bug Fix
Doc Text:	Previously, depending on the status, the OpenvSwitch neutron agent did not stop on an initial SIGTERM signal and only terminated when it received a SIGKILL signal, which was sent by systemd only on the occurrence of its stop timeout. As a result, the Pacemaker stop timeout for the OpenStack Networking OVS agent occurred too early causing the resource to go into a failed state. With this update, the Pacemaker stop timeout for OpenStack Networking OVS agent is increased to wait longer than the systemd stop timeout. As a result, the Pacemaker will successfully identify the OpenStack Networking OVS agent status after the stop attempt.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-21 16:52:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Provaznik 2015-10-26 14:12:23 UTC

Description of problem:
During update from 7.0 to 7.1, after update process completes some services managed by pacemaker are stopped:
[root@overcloud-controller-0 ~]# pcs status|grep -B 1 Stopped
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Stopped: [ overcloud-controller-0 ]
--
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Stopped: [ overcloud-controller-0 ]
--
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Stopped: [ overcloud-controller-0 ]

Failed Actions:
* openstack-nova-scheduler_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=769, status=complete, exitreason='none',
    last-rc-change='Mon Oct 26 08:54:37 2015', queued=0ms, exec=2ms
* openstack-nova-consoleauth_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=823, status=complete, exitreason='none',
    last-rc-change='Mon Oct 26 08:55:21 2015', queued=0ms, exec=4ms
* neutron-openvswitch-agent_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=867, status=Timed Out, exitreason='none',
    last-rc-change='Mon Oct 26 08:56:26 2015', queued=13ms, exec=2ms
* neutron-server_monitor_60000 on overcloud-controller-0 'OCF_PENDING' (196): call=926, status=complete, exitreason='none',
    last-rc-change='Mon Oct 26 09:21:19 2015', queued=0ms, exec=0ms


The reason might be that pacemaker fails to restart neutron-openvswitch-agent service (other stop services depend on this one). From logs it seems that restart of openvswitch-agent fails because pacemaker timeout is 20 seconds, but stopping this service might take ~60 seconds due to waiting for "message timeout" cause by rabbitmq-server restart. Possible solution would be probably increasing pacemaker's service operation timeout from 20s to >60s.

Some highlights from logs:
Oct 26 08:56:06 overcloud-controller-0.localdomain crmd[11953]: notice: Initiating action 212: stop neutron-openvswitch-agent_stop_0 on overcloud-controller-0 (local)
Oct 26 08:56:06 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled neutron-openvswitch-agent...
...

Oct 26 08:56:26 overcloud-controller-0.localdomain lrmd[18346]: notice: action_complete: Giving up on neutron-openvswitch-agent monitor (rc=196): timeout (elapsed=19987ms, remaining=13ms)
Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: error: Operation neutron-openvswitch-agent_stop_0: Timed Out (node=overcloud-controller-0, call=867, timeout=27ms)
Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: warning: Action 212 (neutron-openvswitch-agent_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error
Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: notice: Transition aborted by neutron-openvswitch-agent_stop_0 'modify' on overcloud-controller-0: Event failed (magic=2:198;212:11:0:78fad70c-5001-4bad-9d19-0f4e43c16e44, cib=0.101.89, source=match_graph_event:381, 0)
Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: warning: Action 212 (neutron-openvswitch-agent_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error
Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: notice: Transition 11 (Complete=17, Pending=0, Fired=0, Skipped=6, Incomplete=90, Source=/var/lib/pacemaker/pengine/pe-input-96.bz2): Stopped

and in /var/log/neutron/openvswitch-agent.log:
2015-10-26 08:56:02.051 31091 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] Agent tunnel out of sync with plugin!
2015-10-26 08:56:02.052 31091 DEBUG oslo_messaging._drivers.amqpdriver [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] MSG_ID is 35fabb5d2485476fadb76b5bb66017ea _send /usr/lib/python2.7/site-p
ackages/oslo_messaging/_drivers/amqpdriver.py:311
2015-10-26 08:56:02.052 31091 DEBUG oslo_messaging._drivers.amqp [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] UNIQUE_ID is ad1c5791e6ae4a3aa580392e658ce1c0. _add_unique_id /usr/lib/python2.7
/site-packages/oslo_messaging/_drivers/amqp.py:258
2015-10-26 08:56:06.978 31091 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neut
ron/plugins/openvswitch/agent/ovs_neutron_agent.py:1669
2015-10-26 08:57:01.514 31091 ERROR neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", lin
e 281, in _report_state
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     self.use_call)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 80, in report_state
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     retry=retry)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 6039288709e54e5f8ef49d8a67336223
2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent 
2015-10-26 08:57:01.515 31091 WARNING neutron.openstack.common.loopingcall [-] task <bound method OVSNeutronAgent._report_state of <neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent object at 0x39cffd0>> run outlasted interval by 30.01 sec
...
2015-10-26 08:57:02.187 31091 DEBUG neutron.agent.linux.utils [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] Running command (rootwrap daemon): ['kill', '-9', '16157'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100
2015-10-26 08:57:02.203 31091 DEBUG neutron.agent.linux.utils [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] 
Command: ['kill', '-9', '16157']


Steps to Reproduce:
1. deploy 7.0
2. try update it to 7.1 with command:
openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml

Actual results:
update completes, but "pcs status" on OC controller node shows stopped services

Expected results:
pacemaker services are all running

Comment 2 Jan Provaznik 2015-10-27 20:41:28 UTC

I can confirm that setting stop timeout to 90s solves the issue. For 7.1 I wonder what is the best way to set this timeout *before* package update starts, one option (though not very clean) would be setting/extending the timeout from the yum_update.sh script itself.

Comment 3 Jan Provaznik 2015-10-30 17:02:54 UTC

for updates to 7.1, setting default timeout in yum script solved the issue - extraconfig/tasks/yum_update.sh:

  if [[ "$pacemaker_status" == "active" ]] ; then
+     pcs resource op defaults timeout=120s
      echo "Pacemaker running, stopping cluster node and doing full package update"

It's not a nice place, but assures that pacemaker stop timeout is set before puppet stops/starts services.

Another option would be setting pacemaker into maintenance mode probably as discussed before.

Comment 6 Andrew Beekhof 2015-11-10 03:35:03 UTC

Short version, I wanted to investigate how systemd is stopping this agent. 

The service has Type=simple, so no forking is involved on start, it just calls the ExecStart line:

ExecStart=/usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --log-file /var/log/neutron/openvswitch-agent.log

The man pages are quiet on what happens for stop, but having looked at the source, it does what you'd expect:
- sends SIGTERM
- starts a timer
- waits for either the process to exit or the timer to go off
- if the timer goes off, it sends SIGKILL

There are no timeouts specified in the service file, so systemd will be using the default, which according to /etc/systemd/system.conf is:

#DefaultTimeoutStopSec=90s

I suspect its not a co-incidence that pacemaker needed the same value.
My guess is that the neutron agent is ignoring SIGTERM and only stopping on SIGKILL.

So I would suggest two things:

1. have someone from the neutron team check that the agent is correctly handling SIGTERM
2. until then, use a timeout of 95s or 100s in pacemaker (to ensure the systemd one can reliably kick in first)

Comment 7 Giulio Fidente 2015-11-10 12:04:30 UTC

hi Andrew, thanks a lot. I will update the upstream change so that it waits more than 90s.

I wonder if it makes sense to extend the pcmk start timeout for the systemd servics to something  >90s as well?

Comment 8 Andrew Beekhof 2015-11-11 23:36:30 UTC

Longer term, I have changed the way systemd agents are described to pcs.
This will cause pcs to automatically use 100s as the default timeout when creating systemd resources.

   https://github.com/beekhof/pacemaker/commit/17d65e9

Comment 12 Leonid Natapov 2015-12-06 10:28:32 UTC

openstack-tripleo-heat-templates-0.8.6-85.el7ost.noarch

default timeout set to 100 sec

Pacemaker::Resource::Service {
    op_params => 'start timeout=100s stop timeout=100s',

Comment 13 Deepti Navale 2015-12-09 01:19:29 UTC

NEEDINFO: Giulio Fidente

This bug is marked for inclusion in RHSA-2015:21886 [1] but does not currently contain draft documentation text. To ensure the timely release of this advisory, please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".

To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.
* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

For further information on this process, refer to "Describing Errata Release and Technical Notes for Engineers" [2]. Thank you for your assistance!

[1] https://errata.devel.redhat.com/advisory/21886
[2] https://engineering.redhat.com/docs/en-US/Policy/70.ecs/html/Describing_Errata_Release_and_Technical_Notes_for_Engineers/index.html

Comment 14 Raoul Scarazzini 2015-12-09 16:22:21 UTC

Further tests showed the need of increasing once more the timeouts, to 120s.

This should be addressed either for start and stop operations. Ask Andrew if you need more specifications about which services needs this setting, but it is in any case safe to globally apply it.

Comment 15 Andrew Beekhof 2015-12-09 22:24:33 UTC

(In reply to Raoul Scarazzini from comment #14)
> Further tests showed the need of increasing once more the timeouts, to 120s.

I've been searching all morning for this bug to say the same thing :-)

We based the 100s recommendation on these values from /etc/systemd/system.conf

  #DefaultTimeoutStartSec=90s
  #DefaultTimeoutStopSec=90s

However when we saw nova misbehave recently (due to something in oslo), it was pretty clear that systemd is using a different value internally (possibly allowing for a grace period):

Dec 04 08:47:59 overcloud-controller-1.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-conductor...
Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: openstack-nova-conductor.service stop-sigterm timed out. Killing.
Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: openstack-nova-conductor.service: main process exited, code=killed, status=9/KILL
Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: Unit openstack-nova-conductor.service entered failed state.
Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: openstack-nova-conductor.service failed.

So the value in practice is more like 114s.

tl;dr - I agree that we need to set to 120s

> 
> This should be addressed either for start and stop operations. Ask Andrew if
> you need more specifications about which services needs this setting, but it
> is in any case safe to globally apply it.

It should be set for any systemd service that doesn't already have a longer one.

Comment 17 errata-xmlrpc 2015-12-21 16:52:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2650