Bug 1290599

Summary: Unable to shutdown controller nodes
Product: Red Hat Enterprise Linux 7 Reporter: Chris Dearborn <christopher_dearborn>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED DUPLICATE QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.2CC: abeekhof, arkady_kanevsky, cdevine, christopher_dearborn, cluster-maint, fdinitto, hbrock, jjarvis, John_walsh, kbader, kgaillot, kurt_hey, mburns, morazi, rajini.karthik, randy_perryman, rhel-osp-director-maint, rsussman, sclewis, sreichar, wayne_allen
Target Milestone: pre-dev-freeze   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-16 18:14:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1261979    
Attachments:
Description Flags
Controller node console hang
none
controller0 sos report
none
controller1 sos report
none
controller2 sos report none

Description Chris Dearborn 2015-12-10 23:07:09 UTC
Created attachment 1104514 [details]
Controller node console hang

Description of problem:
I tried to reboot the controller nodes, and all three appeared to hang indefinitely while shutting down.  After 10 minutes I powered them off and back on again.

Version-Release number of selected component (if applicable):
8.0 Beta 2

How reproducible:
Try shutting down a controller node.  It hung during shutdown when I executed a "reboot" command.

Steps to Reproduce:
1. Deploy the overcloud
2. Try to shutdown the controller nodes

Actual results:
Controller node shutdown hangs at the same place on each node.

Expected results:
Controller nodes should shutdown.

Additional info:
All 3 controllers hung at:
A stop job is running for Pacemaker...r Manager (time)

See attached image.

Comment 2 Mike Burns 2015-12-16 14:35:24 UTC
I suspect that this is not a valid use case, though one that is particularly common.

Since the controllers are configured in a pacemaker cluster, you should first take the node out of the cluster before reboot, then re-add it after it comes back.

If you're attempting to simulate fencing or HA recovery, there are other ways to do that.  Adding a member of the HA team to comment on that.

Comment 3 Fabio Massimo Di Nitto 2015-12-17 08:18:25 UTC
Andrew is currently on vacation.

Can you please attach sosreports from the 3 controller nodes?

From experience, pacemaker cannot shutdown because one of the services is unable to stop, but we need to verify which service and why.

Comment 4 Chris Dearborn 2015-12-17 14:52:25 UTC
Created attachment 1106753 [details]
controller0 sos report

Comment 5 Chris Dearborn 2015-12-17 14:53:21 UTC
Created attachment 1106754 [details]
controller1 sos report

Comment 6 Chris Dearborn 2015-12-17 14:55:03 UTC
Created attachment 1106755 [details]
controller2 sos report

I ended up redeploying the overcloud, but the new deployment exhibits the same behavior.  I attached the requested sosreports from the 3 controllers.

Comment 7 Andrew Beekhof 2016-01-05 22:42:28 UTC
This is a Pacemaker bug.

If one runs crm_simulate on:

   sosreport-CDearborn.1290599-20151217094252/sos_commands/cluster/crm_report/overcloud-controller-2.localdomain/pengine/pe-input-5.bz2

You can see that an invalid transition is produced (even with the current upstream master):

Executing cluster transition:
 * Pseudo action:   redis-master_pre_notify_demote_0
 * Resource action: redis           notify on overcloud-controller-2
 * Pseudo action:   redis-master_confirmed-pre_notify_demote_0
Transition failed: terminated
An invalid transition was produced

Comment 9 Andrew Beekhof 2016-01-06 22:53:37 UTC
Actually, not so much.

I went back to check and sure enough, fencing is disabled.
Almost certainly a stop operation failed (probably timed out, possibly due to the oslo bug) at which point the cluster can't continue because fencing, the one thing it could use to make progress, is disabled.

So looks like an invalid test to me based on config + unreliable resource

Comment 10 Chris Dearborn 2016-01-06 23:00:25 UTC
But fencing is disabled by default.  It needs to be explicitly turned on.  Shouldn't you be able to shut down the controller nodes when you use the default installation options?

Comment 11 Ken Gaillot 2016-01-08 21:15:43 UTC
In Pacemaker itself, and clusters created by pcs cluster setup, fencing is enabled by default. Scripts that automate cluster deployment may explicitly disable it, but we don't recommend that, partly because of the stop issue discussed here. Also, Red Hat does not support clusters with fencing disabled.

If someone configures a constraint "A after B", that means that A must start after B, *and* A must stop before B. If A fails to stop, what can Pacemaker do? If fencing is available, it can kill the machine, and be certain that A is stopped. But with fencing disabled, it's stuck. There's no safe way to continue the shutdown.

Here's a (facetious) example that illustrates why it's better to hang than continue shutting down: pcs constraint order nuclear_reactor after reactor_cooling

In this case, there are a number of non-pacemaker errors in /var/log/messages (numerous python ImportError backtraces, "missing timeouts", etc.) that could be meaningful. There are also repeated messages like:

Dec 17 09:21:06 overcloud-controller-1 heat-engine: 2015-12-17 09:21:06.858 14086 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.140.22:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.

which is likely the failing stop action that's causing the hang. I'm not sure which "oslo bug" Beekhof is referring to, but that's worth looking into. Also make sure you have the constraints mentioned in BZ#1257414 (easiest way is to check that "pcs resource defaults" includes resource-stickiness=INFINITY).

Comment 12 Chris Dearborn 2016-01-12 20:32:10 UTC
Ok, I went back and looked at the beta docs, and it does clearly state that turning on fencing is the final step in standing up an overcloud.  Mea Culpa.

I went ahead and enabled fencing per the documentation, then tried shutting down a single controller node.  The node did shut down, but it took around 12-15 minutes for it to do so.  I then tried shutting down a second controller node, and it also took 12-15 minutes.

Is this expected?  This seems very long to me.

Comment 13 Ken Gaillot 2016-01-12 20:50:51 UTC
Usually, shutdown is quicker than that, but it depends on the specific resources being used. It should take about the same amount of time as stopping all the services manually would.

After a node is shut down, does "pcs status" on one of the other nodes show any failed actions?

Comment 14 Kurt Hey 2016-01-28 19:37:13 UTC
I did a shutdown -h now on overcloud-controller-1.  Doing a pcs status from controller-2 shows the following failed actions:

Failed Actions:
* neutron-ovs-cleanup_stop_0 on overcloud-controller-1 'unknown error' (1): call=327, status=complete, exitreason='none',
    last-rc-change='Thu Jan 28 14:33:06 2016', queued=0ms, exec=727ms

* rabbitmq_monitor_10000 on overcloud-controller-1 'not running' (7): call=227, status=complete, exitreason='none',
    last-rc-change='Thu Jan 28 14:32:57 2016', queued=0ms, exec=0ms

Comment 16 Fabio Massimo Di Nitto 2016-01-28 20:24:38 UTC
Kurt,

has the pacemaker configuration been changed since the sosreports have been generated?

There has been a series of fixes around stop actions timeout that could affect this problem but I am not entirely sure in which OSP8d builds have landed.

pcs config should show, for systemd resources: "op stop timeout=200s"

if not, then we need to apply the fix manually. If yes, then the system is just taking a long time to stop all the services and that might be unrelated to pacemaker. We will need to re-check all the logs for openstack service errors.

Comment 17 Kurt Hey 2016-01-28 20:34:03 UTC
Fabio,

Here is the output of pcs config for system:

[root@overcloud-controller-0 ~]# pcs config | grep systemd
  Resource: haproxy (class=systemd type=haproxy)
  Resource: mongod (class=systemd type=mongod)
  Resource: memcached (class=systemd type=memcached)
  Resource: openstack-nova-scheduler (class=systemd type=openstack-nova-scheduler)
  Resource: neutron-l3-agent (class=systemd type=neutron-l3-agent)
  Resource: openstack-ceilometer-alarm-notifier (class=systemd type=openstack-ceilometer-alarm-notifier)
  Resource: openstack-heat-engine (class=systemd type=openstack-heat-engine)
  Resource: openstack-ceilometer-api (class=systemd type=openstack-ceilometer-api)
  Resource: neutron-metadata-agent (class=systemd type=neutron-metadata-agent)
  Resource: openstack-heat-api (class=systemd type=openstack-heat-api)
  Resource: openstack-cinder-scheduler (class=systemd type=openstack-cinder-scheduler)
  Resource: openstack-nova-api (class=systemd type=openstack-nova-api)
  Resource: openstack-heat-api-cloudwatch (class=systemd type=openstack-heat-api-cloudwatch)
  Resource: openstack-ceilometer-collector (class=systemd type=openstack-ceilometer-collector)
  Resource: openstack-keystone (class=systemd type=openstack-keystone)
  Resource: openstack-nova-consoleauth (class=systemd type=openstack-nova-consoleauth)
  Resource: openstack-glance-registry (class=systemd type=openstack-glance-registry)
  Resource: openstack-ceilometer-notification (class=systemd type=openstack-ceilometer-notification)
  Resource: openstack-cinder-api (class=systemd type=openstack-cinder-api)
  Resource: neutron-dhcp-agent (class=systemd type=neutron-dhcp-agent)
  Resource: openstack-glance-api (class=systemd type=openstack-glance-api)
  Resource: neutron-openvswitch-agent (class=systemd type=neutron-openvswitch-agent)
  Resource: openstack-nova-novncproxy (class=systemd type=openstack-nova-novncproxy)
  Resource: neutron-server (class=systemd type=neutron-server)
  Resource: httpd (class=systemd type=httpd)
  Resource: openstack-ceilometer-central (class=systemd type=openstack-ceilometer-central)
  Resource: openstack-ceilometer-alarm-evaluator (class=systemd type=openstack-ceilometer-alarm-evaluator)
  Resource: openstack-heat-api-cfn (class=systemd type=openstack-heat-api-cfn)
 Resource: openstack-cinder-volume (class=systemd type=openstack-cinder-volume)
  Resource: openstack-nova-conductor (class=systemd type=openstack-nova-conductor)


Am I looking at the right thing?

Comment 18 Fabio Massimo Di Nitto 2016-01-28 20:37:07 UTC
(In reply to Kurt Hey from comment #17)
> Fabio,
> 
> Here is the output of pcs config for system:
> 
> [root@overcloud-controller-0 ~]# pcs config | grep systemd
>   Resource: haproxy (class=systemd type=haproxy)
>   Resource: mongod (class=systemd type=mongod)
>   Resource: memcached (class=systemd type=memcached)
>   Resource: openstack-nova-scheduler (class=systemd
> type=openstack-nova-scheduler)
>   Resource: neutron-l3-agent (class=systemd type=neutron-l3-agent)
>   Resource: openstack-ceilometer-alarm-notifier (class=systemd
> type=openstack-ceilometer-alarm-notifier)
>   Resource: openstack-heat-engine (class=systemd type=openstack-heat-engine)
>   Resource: openstack-ceilometer-api (class=systemd
> type=openstack-ceilometer-api)
>   Resource: neutron-metadata-agent (class=systemd
> type=neutron-metadata-agent)
>   Resource: openstack-heat-api (class=systemd type=openstack-heat-api)
>   Resource: openstack-cinder-scheduler (class=systemd
> type=openstack-cinder-scheduler)
>   Resource: openstack-nova-api (class=systemd type=openstack-nova-api)
>   Resource: openstack-heat-api-cloudwatch (class=systemd
> type=openstack-heat-api-cloudwatch)
>   Resource: openstack-ceilometer-collector (class=systemd
> type=openstack-ceilometer-collector)
>   Resource: openstack-keystone (class=systemd type=openstack-keystone)
>   Resource: openstack-nova-consoleauth (class=systemd
> type=openstack-nova-consoleauth)
>   Resource: openstack-glance-registry (class=systemd
> type=openstack-glance-registry)
>   Resource: openstack-ceilometer-notification (class=systemd
> type=openstack-ceilometer-notification)
>   Resource: openstack-cinder-api (class=systemd type=openstack-cinder-api)
>   Resource: neutron-dhcp-agent (class=systemd type=neutron-dhcp-agent)
>   Resource: openstack-glance-api (class=systemd type=openstack-glance-api)
>   Resource: neutron-openvswitch-agent (class=systemd
> type=neutron-openvswitch-agent)
>   Resource: openstack-nova-novncproxy (class=systemd
> type=openstack-nova-novncproxy)
>   Resource: neutron-server (class=systemd type=neutron-server)
>   Resource: httpd (class=systemd type=httpd)
>   Resource: openstack-ceilometer-central (class=systemd
> type=openstack-ceilometer-central)
>   Resource: openstack-ceilometer-alarm-evaluator (class=systemd
> type=openstack-ceilometer-alarm-evaluator)
>   Resource: openstack-heat-api-cfn (class=systemd
> type=openstack-heat-api-cfn)
>  Resource: openstack-cinder-volume (class=systemd
> type=openstack-cinder-volume)
>   Resource: openstack-nova-conductor (class=systemd
> type=openstack-nova-conductor)
> 
> 
> Am I looking at the right thing?

Close enough :) but if you grep for systemd you will need to be able to see the timeout configure. pcs config|less , then search for systemd and you will 3/4 lines below each resource with timeout configuration.

Comment 19 Kurt Hey 2016-01-28 20:48:51 UTC
Looks like all the timeouts are 90s or at least most of them.  Nothing at 200s.

Comment 20 Fabio Massimo Di Nitto 2016-01-28 20:57:56 UTC
(In reply to Kurt Hey from comment #19)
> Looks like all the timeouts are 90s or at least most of them.  Nothing at
> 200s.

Ok then the version of OSPd you guys used doesn´t have the fix yet.

You can bump it manually for each service with:

pcs resource update <name> op stop timeout=200s
pcs resource update <name> op monitor timeout=200s
pcs resource update <name> op start timeout=200s

where name is the systemd resource. Example:

  Resource: neutron-openvswitch-agent (class=systemd type=neutron-openvswitch-agent)

name is neutron-openvswitch-agent (or awk ´{print $2}´).

Fabio

Comment 21 Chris Dearborn 2016-01-28 21:57:55 UTC
I just tried this too with beta 4.  It did take like 12-15 minutes to shutdown a controller node.  pcs status was clean before the shutdown.  pcs status after the shutdown hung for a minute or 2 when trying to get the status of the controller that was down, but the status came up clean:

...
 Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
     Stopped: [ overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
     Stopped: [ overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-1
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
     Stopped: [ overcloud-controller-2 ]

PCSD Status:
  overcloud-controller-0: Online
  overcloud-controller-1: Online
<hang was here>
  overcloud-controller-2: Offline

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled

Comment 22 Fabio Massimo Di Nitto 2016-01-28 22:24:29 UTC
(In reply to Chris Dearborn from comment #21)
> I just tried this too with beta 4.  It did take like 12-15 minutes to
> shutdown a controller node.  pcs status was clean before the shutdown.

Ok, I don´t know why it´s taking that long. Can you get me sosreports from the Beta 4 please?

>  pcs
> status after the shutdown hung for a minute or 2 when trying to get the
> status of the controller that was down,

this is expected. The pcs to pcsd connection towards that controller will have to timeout (TCP). We do have a bug to improve that.

> but the status came up clean:
> 
> ...
>  Clone Set: openstack-ceilometer-alarm-evaluator-clone
> [openstack-ceilometer-alarm-evaluator]
>      Started: [ overcloud-controller-0 overcloud-controller-1 ]
>      Stopped: [ overcloud-controller-2 ]
>  Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
>      Started: [ overcloud-controller-0 overcloud-controller-1 ]
>      Stopped: [ overcloud-controller-2 ]
>  openstack-cinder-volume        (systemd:openstack-cinder-volume):     
> Started overcloud-controller-1
>  Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
>      Started: [ overcloud-controller-0 overcloud-controller-1 ]
>      Stopped: [ overcloud-controller-2 ]
> 
> PCSD Status:
>   overcloud-controller-0: Online
>   overcloud-controller-1: Online
> <hang was here>
>   overcloud-controller-2: Offline
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled

Ok perfect!

Comment 23 Andrew Beekhof 2016-01-29 01:42:43 UTC
(In reply to Fabio Massimo Di Nitto from comment #22)
> (In reply to Chris Dearborn from comment #21)
> > I just tried this too with beta 4.  It did take like 12-15 minutes to
> > shutdown a controller node.  pcs status was clean before the shutdown.
> 
> Ok, I don´t know why it´s taking that long. Can you get me sosreports from
> the Beta 4 please?

We can look at the sosreports, but if you're being affected by the oslo bug (almost certain if bumping timeouts to 200s helped), then 15 minutes will disappear really fast if 5 or 6 systemd services are using their maximum of 190s.

Increasing the timeout just avoids failed actions by waiting long enough for systemd to handle. So the lack of failed actions doesn't mean the service started shutting down cleanly or in a timely manner.

Comment 24 Ken Gaillot 2016-01-29 20:46:11 UTC
Most likely this is a duplicate of BZ#1288528, an oslo.service bug whose fix has recently passed QA but not been incorporated into 8.0 yet. The signature is log messages containing "Cannot switch to MAINLOOP from MAINLOOP", which we do see in the original logs here.

New reports from the beta 4 deployment would still be helpful to verify that those messages are still involved, and to check whether any additional issues may be relevant.

Besides the logs, probably the best course is to keep the higher timeouts (and unfortunately long shutdown time) until the fix makes it into a new beta. If that is undesirable, from the description of the bug, it sounds like systemd unit file overrides for the affected daemons with KillMode=process might work around the issue, but that has not been tested here.

Comment 25 Ken Gaillot 2016-05-16 17:28:30 UTC
The fix for BZ#1288528 was released 2016-04-07 as part of 8.0. Have you had a chance to test this behavior with that fix deployed?

Comment 26 Chris Dearborn 2016-05-16 18:08:27 UTC
Just tested this with OSP 8 GA, and I was able to shutdown a controller node in just over 2 minutes, which seems very reasonable.  Feel free to close this one out.  Thanks for all the help!

Comment 27 Ken Gaillot 2016-05-16 18:14:21 UTC

*** This bug has been marked as a duplicate of bug 1288528 ***