Description of problem: During update from 7.0 to 7.1, after update process completes some services managed by pacemaker are stopped: [root@overcloud-controller-0 ~]# pcs status|grep -B 1 Stopped Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Stopped: [ overcloud-controller-0 ] -- Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Stopped: [ overcloud-controller-0 ] -- Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Stopped: [ overcloud-controller-0 ] Failed Actions: * openstack-nova-scheduler_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=769, status=complete, exitreason='none', last-rc-change='Mon Oct 26 08:54:37 2015', queued=0ms, exec=2ms * openstack-nova-consoleauth_monitor_0 on overcloud-controller-0 'OCF_PENDING' (196): call=823, status=complete, exitreason='none', last-rc-change='Mon Oct 26 08:55:21 2015', queued=0ms, exec=4ms * neutron-openvswitch-agent_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=867, status=Timed Out, exitreason='none', last-rc-change='Mon Oct 26 08:56:26 2015', queued=13ms, exec=2ms * neutron-server_monitor_60000 on overcloud-controller-0 'OCF_PENDING' (196): call=926, status=complete, exitreason='none', last-rc-change='Mon Oct 26 09:21:19 2015', queued=0ms, exec=0ms The reason might be that pacemaker fails to restart neutron-openvswitch-agent service (other stop services depend on this one). From logs it seems that restart of openvswitch-agent fails because pacemaker timeout is 20 seconds, but stopping this service might take ~60 seconds due to waiting for "message timeout" cause by rabbitmq-server restart. Possible solution would be probably increasing pacemaker's service operation timeout from 20s to >60s. Some highlights from logs: Oct 26 08:56:06 overcloud-controller-0.localdomain crmd[11953]: notice: Initiating action 212: stop neutron-openvswitch-agent_stop_0 on overcloud-controller-0 (local) Oct 26 08:56:06 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled neutron-openvswitch-agent... ... Oct 26 08:56:26 overcloud-controller-0.localdomain lrmd[18346]: notice: action_complete: Giving up on neutron-openvswitch-agent monitor (rc=196): timeout (elapsed=19987ms, remaining=13ms) Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: error: Operation neutron-openvswitch-agent_stop_0: Timed Out (node=overcloud-controller-0, call=867, timeout=27ms) Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: warning: Action 212 (neutron-openvswitch-agent_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: notice: Transition aborted by neutron-openvswitch-agent_stop_0 'modify' on overcloud-controller-0: Event failed (magic=2:198;212:11:0:78fad70c-5001-4bad-9d19-0f4e43c16e44, cib=0.101.89, source=match_graph_event:381, 0) Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: warning: Action 212 (neutron-openvswitch-agent_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error Oct 26 08:56:26 overcloud-controller-0.localdomain crmd[11953]: notice: Transition 11 (Complete=17, Pending=0, Fired=0, Skipped=6, Incomplete=90, Source=/var/lib/pacemaker/pengine/pe-input-96.bz2): Stopped and in /var/log/neutron/openvswitch-agent.log: 2015-10-26 08:56:02.051 31091 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] Agent tunnel out of sync with plugin! 2015-10-26 08:56:02.052 31091 DEBUG oslo_messaging._drivers.amqpdriver [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] MSG_ID is 35fabb5d2485476fadb76b5bb66017ea _send /usr/lib/python2.7/site-p ackages/oslo_messaging/_drivers/amqpdriver.py:311 2015-10-26 08:56:02.052 31091 DEBUG oslo_messaging._drivers.amqp [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] UNIQUE_ID is ad1c5791e6ae4a3aa580392e658ce1c0. _add_unique_id /usr/lib/python2.7 /site-packages/oslo_messaging/_drivers/amqp.py:258 2015-10-26 08:56:06.978 31091 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent caught SIGTERM, quitting daemon loop. _handle_sigterm /usr/lib/python2.7/site-packages/neut ron/plugins/openvswitch/agent/ovs_neutron_agent.py:1669 2015-10-26 08:57:01.514 31091 ERROR neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state! 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last): 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", lin e 281, in _report_state 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent self.use_call) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 80, in report_state 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent retry=self.retry) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent timeout=timeout, retry=retry) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent retry=retry) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent result = self._waiter.wait(msg_id, timeout) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id) 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 6039288709e54e5f8ef49d8a67336223 2015-10-26 08:57:01.514 31091 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent 2015-10-26 08:57:01.515 31091 WARNING neutron.openstack.common.loopingcall [-] task <bound method OVSNeutronAgent._report_state of <neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent object at 0x39cffd0>> run outlasted interval by 30.01 sec ... 2015-10-26 08:57:02.187 31091 DEBUG neutron.agent.linux.utils [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] Running command (rootwrap daemon): ['kill', '-9', '16157'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:100 2015-10-26 08:57:02.203 31091 DEBUG neutron.agent.linux.utils [req-46defde6-495d-49f6-b2ee-93b0730bdbbc ] Command: ['kill', '-9', '16157'] Steps to Reproduce: 1. deploy 7.0 2. try update it to 7.1 with command: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml Actual results: update completes, but "pcs status" on OC controller node shows stopped services Expected results: pacemaker services are all running
I can confirm that setting stop timeout to 90s solves the issue. For 7.1 I wonder what is the best way to set this timeout *before* package update starts, one option (though not very clean) would be setting/extending the timeout from the yum_update.sh script itself.
for updates to 7.1, setting default timeout in yum script solved the issue - extraconfig/tasks/yum_update.sh: if [[ "$pacemaker_status" == "active" ]] ; then + pcs resource op defaults timeout=120s echo "Pacemaker running, stopping cluster node and doing full package update" It's not a nice place, but assures that pacemaker stop timeout is set before puppet stops/starts services. Another option would be setting pacemaker into maintenance mode probably as discussed before.
Short version, I wanted to investigate how systemd is stopping this agent. The service has Type=simple, so no forking is involved on start, it just calls the ExecStart line: ExecStart=/usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini --log-file /var/log/neutron/openvswitch-agent.log The man pages are quiet on what happens for stop, but having looked at the source, it does what you'd expect: - sends SIGTERM - starts a timer - waits for either the process to exit or the timer to go off - if the timer goes off, it sends SIGKILL There are no timeouts specified in the service file, so systemd will be using the default, which according to /etc/systemd/system.conf is: #DefaultTimeoutStopSec=90s I suspect its not a co-incidence that pacemaker needed the same value. My guess is that the neutron agent is ignoring SIGTERM and only stopping on SIGKILL. So I would suggest two things: 1. have someone from the neutron team check that the agent is correctly handling SIGTERM 2. until then, use a timeout of 95s or 100s in pacemaker (to ensure the systemd one can reliably kick in first)
hi Andrew, thanks a lot. I will update the upstream change so that it waits more than 90s. I wonder if it makes sense to extend the pcmk start timeout for the systemd servics to something >90s as well?
Longer term, I have changed the way systemd agents are described to pcs. This will cause pcs to automatically use 100s as the default timeout when creating systemd resources. https://github.com/beekhof/pacemaker/commit/17d65e9
openstack-tripleo-heat-templates-0.8.6-85.el7ost.noarch default timeout set to 100 sec Pacemaker::Resource::Service { op_params => 'start timeout=100s stop timeout=100s',
NEEDINFO: Giulio Fidente This bug is marked for inclusion in RHSA-2015:21886 [1] but does not currently contain draft documentation text. To ensure the timely release of this advisory, please provide draft documentation text for this bug as soon as possible. If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-". To add draft documentation text: * Select the documentation type from the "Doc Type" drop down field. * A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field. For further information on this process, refer to "Describing Errata Release and Technical Notes for Engineers" [2]. Thank you for your assistance! [1] https://errata.devel.redhat.com/advisory/21886 [2] https://engineering.redhat.com/docs/en-US/Policy/70.ecs/html/Describing_Errata_Release_and_Technical_Notes_for_Engineers/index.html
Further tests showed the need of increasing once more the timeouts, to 120s. This should be addressed either for start and stop operations. Ask Andrew if you need more specifications about which services needs this setting, but it is in any case safe to globally apply it.
(In reply to Raoul Scarazzini from comment #14) > Further tests showed the need of increasing once more the timeouts, to 120s. I've been searching all morning for this bug to say the same thing :-) We based the 100s recommendation on these values from /etc/systemd/system.conf #DefaultTimeoutStartSec=90s #DefaultTimeoutStopSec=90s However when we saw nova misbehave recently (due to something in oslo), it was pretty clear that systemd is using a different value internally (possibly allowing for a grace period): Dec 04 08:47:59 overcloud-controller-1.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-conductor... Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: openstack-nova-conductor.service stop-sigterm timed out. Killing. Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: openstack-nova-conductor.service: main process exited, code=killed, status=9/KILL Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: Unit openstack-nova-conductor.service entered failed state. Dec 04 08:49:53 overcloud-controller-1.localdomain systemd[1]: openstack-nova-conductor.service failed. So the value in practice is more like 114s. tl;dr - I agree that we need to set to 120s > > This should be addressed either for start and stop operations. Ask Andrew if > you need more specifications about which services needs this setting, but it > is in any case safe to globally apply it. It should be set for any systemd service that doesn't already have a longer one.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2650