1430112 – [Intel 7.4 Bug] Pacemaker occasionally takes minutes to respond

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1430112 - [Intel 7.4 Bug] Pacemaker occasionally takes minutes to respond

Summary: [Intel 7.4 Bug] Pacemaker occasionally takes minutes to respond

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	7.4
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1296649 1384256 1446211
TreeView+	depends on / blocked

Reported:	2017-03-07 21:51 UTC by Kelsey Prantis
Modified:	2017-08-01 17:54 UTC (History)
CC List:	12 users (show)
Fixed In Version:	pacemaker-1.1.16-7.el7
Doc Type:	Bug Fix
Doc Text:	Cause: A node that is joining the cluster did not know whether the cluster's current DC was in the process of shutting down. Consequence: If the joining node became the new DC, it would fence the old DC unnecessarily. Fix: The DC now notifies joining nodes whether it is shutting down, and the joining nodes update the DC's expected state accordingly. Result: No fencing occurs in this situation. ### Cause: The cluster would not calculate new resource placement if any node had too many fence failures previously. Consequence: If a node had too many fence failures, the cluster might not react immediately to configuration changes later, such as enabling and disabling resources. Fix: The cluster skips recalculating resource placement due to too many fence failures only if the triggering event was another fence failure. Result: Resource placement occurs after a configuration change, even if a node previously had too many fence failures.
Clone Of:
Environment:
Last Closed:	2017-08-01 17:54:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Tar of the corosync.log and messages logs from the HA pair. (1.29 MB, application/x-gzip) 2017-03-07 21:51 UTC, Kelsey Prantis	no flags	Details
testfs-OST0000_749fba_failed_to_stop_messages_log (58.52 KB, application/rtf) 2017-03-27 22:36 UTC, Kelsey Prantis	no flags	Details
testfs-OST0000_2c0452_failed_to_stop_logs.tar.gz (43.79 KB, application/x-gzip) 2017-03-28 17:12 UTC, Kelsey Prantis	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1862	0	normal	SHIPPED_LIVE	pacemaker bug fix and enhancement update	2017-08-01 18:04:15 UTC

Description Kelsey Prantis 2017-03-07 21:51:01 UTC

Created attachment 1260990 [details]
Tar of the corosync.log and messages logs from the HA pair.

Description of problem:
We internally have an automated test case that stops a pair of resources in an active-active configuration on an ha cluster of 2 nodes, like so:

crm_resource -r <resource_to_stop> -p target-role -m -v Stopped

Usually this completes in a matter of seconds. But every so often, the time between when we issue this command and when the stop succeeds for one of the resources takes a few minutes instead. We were looking into this today (as it causes us spurious test failure timeouts).

You can see that we issue the crm_resource command at 11:07:26:

[07/Mar/2017:11:07:26] console DEBUG Shell.run: ['crm_resource', '-r', u'testfs-OST0001_c9a5cc', '-p', 'target-role', '-m', '-v', 'Stopped']  # from our logging

But then there is nothing in /var/log/cluster/corosync.log until 11:11:58:

Mar 07 11:11:58 [29918] lotus-33vm7       crmd:   notice: te_rsc_command:       Initiating stop operation testfs-OST0001_c9a5cc_stop_0 on lotus-33vm8.lotus.hpdd.lab.intel.com | action 5

and the command then finishes stopping at 11:12:00:

Mar 07 11:12:00 [25058] lotus-33vm8       crmd:   notice: process_lrm_event:    Result of stop operation for testfs-OST0001_c9a5cc on lotus-33vm8.lotus.hpdd.lab.intel.com: 0 (ok) | call=17 key=testfs-OST0001_c9a5cc_stop_0 confirmed=true cib-update=18

When we look in /var/log/cluster/corosync.log, there is a weird multi-minute gap on both nodes in the 2 node cluster with absolutely no messages in the logs:

Mar 07 11:07:44 [29913] lotus-33vm7        cib:     info: cib_process_ping:     Reporting our current digest to lotus-33vm7.lotus.hpdd.lab.intel.com: 2a856930e7a42960911490716c2883d4 for 0.34.1 (0x7f3c3f33f280 0)
Mar 07 11:11:58 [29913] lotus-33vm7        cib:     info: cib_perform_op:       Diff: --- 0.34.1 2

This log is usually very chatty, but you can see it went silent for 4 minutes and 14 seconds after we sent the crm_resource command but before the action ends up being initiated. /var/log/messages similarly shows no pacemaker/corosync activity in this time window.

Version-Release number of selected component (if applicable):

pacemaker-1.1.15-11.el7_3.2.x86_64
corosync-2.4.0-4.el7.x86_64

How reproducible:

Today it occurred twice in 13 test runs. Each test runs the stop command on the 2 resources on the 2 node cluster.

Steps to Reproduce:
1. Initiate 'crm_resource -r <resource_to_stop> -p target-role -m -v Stopped' on 2 resources in parallel on their respective nodes.
2. Time the time it takes to initiate a response in corosync.log.

Actual results:

pacemaker shows now sign of activity for multiple minutes, and then resumes processing the command given.

Expected results:

pacemaker begins initiating the action <30 seconds after the command is run.

Additional info:
corosync.log and messages from both nodes attached.

If this is expected behavior, please let us know.

Comment 2 Ken Gaillot 2017-03-08 20:16:18 UTC

The sequence of events is:

1. crm_resource disables testfs-OST0000_775020, which aborts the transition
2. A new transition begins, which schedules and initiates the stop for that resource
3. crm_resource disables testfs-OST0001_c9a5cc, which aborts the transition
4. The result of the initial stop is recorded
5. The initial new transition completes

That last step seems wrong. I'd expect a new transition to begin instead. I do see some unusual messages in the logs, including:

crmd: warning: stonith_api_device_metadata: Could not obtain metadata for fence_chroma

Without metadata, the cluster will not know what actions the device supports, and it will not be able to dynamically query what nodes the device can fence. If you have control over the agent code, I'd recommend adding metadata. Existing agents can provide some examples.

At 10:35:00, pacemaker on node vm8 shuts down quickly and restarts (I'm not sure why). That leads node vm7 to want to fence it:

Mar 07 10:35:06 [23484] lotus-33vm7 pengine: warning: pe_fence_node: Node lotus-33vm8.lotus.hpdd.lab.intel.com will be fenced because our peer process is no longer available

However, it is not able to find a suitable fence device:

Mar 07 10:35:08 [23481] lotus-33vm7 stonith-ng: info: call_remote_stonith: None of the 1 peers have devices capable of fencing (reboot) lotus-33vm8.lotus.hpdd.lab.intel.com for crmd.23485 (0)

This may be related to the metadata issue. If neither pcmk_host_list or pcmk_host_map is specified for the device, then without metadata the cluster will not know whether the device can fence a node. Fencing fails, and the failure is noticed for multiple transitions until it finally gives up:

Mar 07 10:41:31 [23485] lotus-33vm7 crmd: notice: too_many_st_failures: No devices found in cluster to fence lotus-33vm8.lotus.hpdd.lab.intel.com, giving up

The "giving up" is to prevent the cluster from infinitely retrying a failing fence device. My current suspicion is that this is preventing the cluster from scheduling a new transition when it should for other reasons.

If you can, try adding pcmk_host_list to the fence device, and see if the problem still happens.

Comment 3 Kelsey Prantis 2017-03-09 18:36:43 UTC

Hi Ken, sorry for the slow response, each of our tests takes ~20 minutes, so it can take a while to reproduce. It's still looping to reproduce it as I type.

We definitely are not setting pcmk_host_list currently - the Pacemaker documentation says that it is optional, and our fence agent supports the list action to get a list of fenceable nodes. Could you perhaps expound a bit why we think its necessary in this instance?

As far as interpreting messages in the log, I do want to be clear that the test actually sets up the pacemaker configuration from scratch each run, so there also can be messages fairly nearby that may be occurring as the cluster is being set up (fencing agent being added, etc).

And those "No metadata for fence_chroma" messages you noted are definitely in the window the cluster is being configured initially, in fact, you can see they are actually output the same second that we are configuring the fencing device, and listed just prior (from corosync_vm7.log):

25947 Mar 07 11:02:48 [29918] lotus-33vm7       crmd:  warning: stonith_api_device_metadata:  Could not obtain metadata for fence_chroma
25948 Mar 07 11:02:48 [29918] lotus-33vm7       crmd:    error: build_operation_update:       No metadata for stonith::(null):fence_chroma

...

25980 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       +  /cib/status/node_state[@id='2']:  @crm-debug-origin=do_lrm_query_internal
25981 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++ /cib/status/node_state[@id='2']:  <lrm id="2"/>
25982 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++                                     <lrm_resources>
25983 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++                                       <lrm_resource id="st-fencing" type="fence_chroma" class="stonith">
25984 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++                                         <lrm_rsc_op id="st-fencing_last_0" operation_key="st-fencing_start_0" operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="      3.0.10" transition-key="4:6:0:ca2ea403-e936-4d4d-94fd-4d7c2d2be6b3" transition-magic="0:0;4:6:0:ca2ea403-e936-4d4d-94fd-4d7c2d2be6b3" on_node="lotus-33vm7.lotus.hpdd.lab.intel.com" call-id="6" rc-code="0" op-status="0" interval="0" last-run="1488913346"
25985 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++                                       </lrm_resource>
25986 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++                                     </lrm_resources>
25987 Mar 07 11:02:48 [29913] lotus-33vm7        cib:     info: cib_perform_op:       ++                                   </lrm>

I had assumed this might be some sort of transient state as our configuration is being added internal to pacemaker. You never see that message again after that fencing agent has completed being added.

Comment 4 Ken Gaillot 2017-03-09 20:36:30 UTC

(In reply to Kelsey Prantis from comment #3)
> We definitely are not setting pcmk_host_list currently - the Pacemaker
> documentation says that it is optional, and our fence agent supports the
> list action to get a list of fenceable nodes. Could you perhaps expound a
> bit why we think its necessary in this instance?

rgmanager-based clusters assumed that all fence agents supported certain actions, so they didn't need or use agent metadata.

Pacemaker, however, uses the agent metadata to detect what actions are supported by each agent. This allows some agents to support certain actions while others don't. But it means that agents that don't provide metadata will only be assumed to support "off".

So, in this case, Pacemaker can't know that the agent supports "list", so it assumes the agent doesn't. Without pcmk_host_list, Pacemaker can't know which nodes are fenceable.

> And those "No metadata for fence_chroma" messages you noted are definitely
> in the window the cluster is being configured initially, in fact, you can
> see they are actually output the same second that we are configuring the
> fencing device, and listed just prior (from corosync_vm7.log):

Right -- Pacemaker checks for metadata when the device is configured. If it can't load the metadata at that time, it assumes minimal supported actions (basically just "off").

I am still investigating possible solutions for the stop action not taking effect immediately. However, fixing that will still leave the underlying issue that Pacemaker can't use the fence device without either metadata or pcmk_host_list.

Comment 5 Brian J. Murrell 2017-03-09 20:45:24 UTC

Ken,

The fence agent in question most certainly does (or is supposed to, modulo any bugs) support the metadata action as well as the list action.  I think what Kelsey was trying to point out in comment #3 is that pacemaker was complaining about the agent not providing metadata even before or while the CIB was being populated with the fence agent resource.

Perhaps this is just some bad timing/ordering in the log, but why would pacemaker be reporting anything about the fence agent before it was even added to the CIB?

It would be interesting to see why pacemaker thinks it can't get metadata when it says:

crmd:  warning: stonith_api_device_metadata:  Could not obtain metadata for fence_chroma

such as the fence_agent's return code and/or any stdout/stderr that it might have reported trying to get the metadata.  Is any of that possible?

Comment 6 Kelsey Prantis 2017-03-09 21:45:35 UTC

And to be clear, we are trying adding it, just trying to understand the why as well, given our fence_agent does have a metadata action.

Comment 7 Ken Gaillot 2017-03-09 22:20:23 UTC

(In reply to Brian J. Murrell from comment #5)
> Ken,
> 
> The fence agent in question most certainly does (or is supposed to, modulo
> any bugs) support the metadata action as well as the list action.  I think
> what Kelsey was trying to point out in comment #3 is that pacemaker was
> complaining about the agent not providing metadata even before or while the
> CIB was being populated with the fence agent resource.
> 
> Perhaps this is just some bad timing/ordering in the log, but why would
> pacemaker be reporting anything about the fence agent before it was even
> added to the CIB?
> 
> It would be interesting to see why pacemaker thinks it can't get metadata
> when it says:
> 
> crmd:  warning: stonith_api_device_metadata:  Could not obtain metadata for
> fence_chroma
> 
> such as the fence_agent's return code and/or any stdout/stderr that it might
> have reported trying to get the metadata.  Is any of that possible?

Ah, I missed that. Looking at it again, I see I missed the significance that it's the crmd that can't read the metadata, not stonithd. That shouldn't cause any serious problems, because crmd only uses metadata to detect whether parameter changes result in a reload or restart.

That could be a permissions or SELinux issue -- stonithd runs as root, while crmd runs as hacluster.


So, that's unrelated to why stonith failed, and looking at that again, I can see that it is indeed doing a dynamic list, but not seeing what it's looking for in the result:

Mar 07 10:35:08 [23481] lotus-33vm7 stonith-ng:     info: call_remote_stonith:  None of the 1 peers have devices capable of fencing (reboot) lotus-33vm8.lotus.hpdd.lab.intel.com for crmd.23485 (0)

Can you try running list manually (e.g. "echo -e "action=list\n" | fence_chroma, adding any other necessary parameters to the echo)? If it doesn't show the desired node at all, try to figure out whether it's a configuration issue, network issue, or what. If it shows the node but by a different name, then either pcmk_host_list or pcmk_host_map should fix the issue.

Comment 8 Kelsey Prantis 2017-03-09 22:27:43 UTC

[root@lotus-33vm7 ~]#  fence_chroma -o list
lotus-33vm7.lotus.hpdd.lab.intel.com,
lotus-33vm8.lotus.hpdd.lab.intel.com,

Is on it currently. I will check specifically when it reproduces next as well to check for variance.

It is also worth pointing out that only messages starting at 10:51:19 and onward should be relevant. This test is cycling repeatedly until first failure, so the messages prior are from previous test runs that passed just fine. The pacemaker/corosync configuration is completely wiped out between test runs, and created anew as part of the test.

Comment 9 Ken Gaillot 2017-03-09 22:53:26 UTC

(In reply to Kelsey Prantis from comment #8)
> [root@lotus-33vm7 ~]#  fence_chroma -o list
> lotus-33vm7.lotus.hpdd.lab.intel.com,
> lotus-33vm8.lotus.hpdd.lab.intel.com,
> 
> Is on it currently. I will check specifically when it reproduces next as
> well to check for variance.

Odd, that should work fine. To check for variance, you'd have to run it when fencing is needed, not when the stop issue happens, because by then it's given up on fencing.

> It is also worth pointing out that only messages starting at 10:51:19 and
> onward should be relevant. This test is cycling repeatedly until first
> failure, so the messages prior are from previous test runs that passed just
> fine. The pacemaker/corosync configuration is completely wiped out between
> test runs, and created anew as part of the test.

Unless pacemaker is restarted, crmd will remember how many times it has failed to fence a target, and give up trying after 10 failures. The failure count is tracked independently by target name, so it is remembered regardless of the configuration.

Comment 10 Ken Gaillot 2017-03-09 23:08:17 UTC

(In reply to Ken Gaillot from comment #9)
> Unless pacemaker is restarted, crmd will remember how many times it has
> failed to fence a target, and give up trying after 10 failures. The failure
> count is tracked independently by target name, so it is remembered
> regardless of the configuration.

Though that's not entirely relevant here :) Obviously, if you're wiping the corosync config, you're restarting pacemaker/corosync as well, so the 10 limit won't come into play.

However, crmd does also remember whether no devices were found to fence the target, using the same mechanism. I believe it currently remembers that even if the device configuration or status changes, though it probably should forget it in such a case.

The "None of the 1 peers" message recurred at 11:02:41, and that was what led (indirectly) to the stop issue at 11:07.

Comment 11 Kelsey Prantis 2017-03-09 23:30:52 UTC

To be clear, by wipe everything, I mean:

service pacemaker stop
service corosync stop
rm -f /etc/corosync/corosync.conf
rm -f /var/lib/pacemaker/cib/*
rm -f /var/lib/corosync/*

Then we configure for brand new resources. (You'll see the characters after the underscore in the target names change run to run, eg testfs-OST0000_775020).

So iiuc, anything from prior to that wipe should be obliterated and not remembered.

Looking at the time you mentioned, we see a failed attempt to stonith lotus-33vm8 at 11:02:41, which is I suppose not all that surprising it couldn't stonith since we don't see the fencing device get configured until 11:02:48. (And before any resources are added.)

Does crmd "remember" that no devices were found to fence a target even if new fencing devices have been added? I could see an argument that a new fencing device being added ought to "reset" any memory of that...

I wonder what is causing it to decide it needs to fence the other node when they are just forming up into a cluster like that.

Comment 12 Ken Gaillot 2017-03-10 00:07:52 UTC

(In reply to Kelsey Prantis from comment #11)
> Looking at the time you mentioned, we see a failed attempt to stonith
> lotus-33vm8 at 11:02:41, which is I suppose not all that surprising it
> couldn't stonith since we don't see the fencing device get configured until
> 11:02:48. (And before any resources are added.)
> 
> Does crmd "remember" that no devices were found to fence a target even if
> new fencing devices have been added? I could see an argument that a new
> fencing device being added ought to "reset" any memory of that...

Exactly ... I was just talking with Andrew Beekhof about that situation. We are converging on a plan:

1. Currently "no devices" is considered a permanent failure after 1 failure rather than 10. We'll get rid of that special handling so it's permanent after 10 like any other stonith failure.

2. Currently the remembered failures are only cleared when fencing succeeds. That means if it hits the 10 limit, it's permanently remembered. We'll try to figure out some additional points to clear it, possibly when any fence device configuration changes, any fence device is cleaned up, and/or a target peer is purged from the cluster configuration.

3. Currently the 10 failure limit prevents any aborted transition from triggering a new one. We will change that so it only affects transitions aborted due to stonith failures. If the transition is aborted for some other reason (like the resource being disabled in this case), a new transition will still be triggered.
 
> I wonder what is causing it to decide it needs to fence the other node when
> they are just forming up into a cluster like that.

First, pacemaker gets asked to stop on vm8 (could be by systemctl stop, system shutdown, kill command, etc.):

Mar 07 11:02:11 [19861] lotus-33vm8 pacemakerd:   notice: crm_signal_dispatch:  Caught 'Terminated' signal | 15 (invoking handler)
Mar 07 11:02:11 [19861] lotus-33vm8 pacemakerd:   notice: pcmk_shutdown_worker: Shutting down Pacemaker

This occurs while the two peers are just forming a cluster, and vm8 is the DC. Normally, when the DC shuts down, it notifies all peers, so they don't fence it. But since vm7 had not finished joining the cluster by this point, it looks like that notification wasn't sent (since vm7 wasn't a full peer yet). So, vm8 shuts down at 11:02:32, and vm7 sees it go away, takes over DC, and fences it.

It's a timing fluke. I'll look into whether there's some way we could pass that information even before the joining node is a full peer, but it may just be a "better safe than sorry" corner case.

Comment 13 Kelsey Prantis 2017-03-10 19:29:18 UTC

These sound like they'd be some excellent improvements.

I also am wondering, is there a particular value to stonithing a node when there are no resources configured yet? Not sure if that was a valid use case for some users, or perhaps a way to avoid general raciness in setup to begin with.

Lastly, is there something we could do to workaround in the meantime? Perhaps a command that would "clear" any failed attempts we could call after we configure the fencing device?

Comment 14 Ken Gaillot 2017-03-13 15:32:15 UTC

(In reply to Kelsey Prantis from comment #13)
> These sound like they'd be some excellent improvements.
> 
> I also am wondering, is there a particular value to stonithing a node when
> there are no resources configured yet? Not sure if that was a valid use case
> for some users, or perhaps a way to avoid general raciness in setup to begin
> with.

Yes, it is the safest route -- even if no resources are configured, the other node could be running some service by mistake, and if we can't communicate with it, we can only fence it to be sure.
 
> Lastly, is there something we could do to workaround in the meantime?
> Perhaps a command that would "clear" any failed attempts we could call after
> we configure the fencing device?

Unfortunately, not without disruption.

There are currently only two ways to clear a "permanent" stonith fail count: you can either put the current DC in standby and restart pacemaker on it, or fence the target node manually with stonith_admin.

Comment 15 Kelsey Prantis 2017-03-14 18:52:50 UTC

Ok, thanks. Until the above changes you guys mentioned then are done, we will extend the timeouts in our application to allow for Pacemaker occasionally taking longer to stop/start targets for the times this happens. We can then back that out of our application code at the point it is no longer necessary (as it does delay our users being informed of a real failure). It'll still be a race (unless there is some maximum time that might be you guys can tell us), but we can at least make it less likely to be hit.

Comment 16 Ken Gaillot 2017-03-14 18:59:31 UTC

There is a workaround for your use case. If you run "crm_resource --wait --timeout 60s" between the two crm_resource commands, it will wait until the first command's effects are done (or at most one minute) before starting the second. That way, the second won't abort an in-progress transition -- it'll be an entirely new transition.

Comment 17 Kelsey Prantis 2017-03-14 19:19:48 UTC

Excellent. We will give that a go.

Does this affect any crm_resource command after the failed stonith attempt? Or just specifically stopping?

Comment 18 Ken Gaillot 2017-03-14 19:40:59 UTC

crm_resource --wait simply looks for any cluster activity, so it applies to any change. If other changes are happening at the same time (failures, nodes joining, etc.), it will continue waiting, thus I added the 60-second timeout just in case.

It serves a workaround here because the failed stonith only causes a problem when a transition in progress is aborted. The wait ensures that the transition started after the first crm_resource command completes, so the second command doesn't abort it.

Comment 19 Kelsey Prantis 2017-03-15 15:33:36 UTC

So I notice this raises an error if the time expires: "Error performing operation: Timer expired". Can I confirm that what we're supposed to do in that case then is just ignore the error and move on?

Comment 20 Ken Gaillot 2017-03-15 16:47:14 UTC

(In reply to Kelsey Prantis from comment #19)
> So I notice this raises an error if the time expires: "Error performing
> operation: Timer expired". Can I confirm that what we're supposed to do in
> that case then is just ignore the error and move on?

Correct. It leaves a small chance that you'll end up with the stop issue, but I expect that 60s should be long enough for the initial transition to complete, so I don't think it will be a problem in practice. To fine-tune it, instead of 60s, you could use whatever your stop timeout is for your resource plus a margin of maybe 10-30s.

Comment 21 Kelsey Prantis 2017-03-27 15:55:12 UTC

We seem to be able to experience these delayed stops even with the recommended work around. Specifically, we call:

crm_resource --wait --timeout 300s

Which executes and exits fine with a 0 return code, it does not reach the time but exits within a couple of seconds. We then immediately run:

crm_resource -r testfs-OST0000_749fba -p target-role -m -v Stopped

And then begin waiting for the target to stop. Just over 4 minutes later our test times out and the target still hasn't stopped, then a few more seconds past that it stops.

Comment 22 Ken Gaillot 2017-03-27 22:00:07 UTC

(In reply to Kelsey Prantis from comment #21)
> We seem to be able to experience these delayed stops even with the
> recommended work around. Specifically, we call:
> 
> crm_resource --wait --timeout 300s
> 
> Which executes and exits fine with a 0 return code, it does not reach the
> time but exits within a couple of seconds. We then immediately run:
> 
> crm_resource -r testfs-OST0000_749fba -p target-role -m -v Stopped
> 
> And then begin waiting for the target to stop. Just over 4 minutes later our
> test times out and the target still hasn't stopped, then a few more seconds
> past that it stops.

Hmm, that's unexpected. I'd be interested in looking at the logs around that incident, but we may just need the fixes described in Comment 12.

Comment 23 Kelsey Prantis 2017-03-27 22:35:49 UTC

Unfortunately our CI isn't capturing the full corosync.log from test runs (which we should amend), but I will attach the section from /var/log/messages that correlates to the test run. I will try to catch it locally to get a full corosync.log if that is helpful, but it may take a couple of days to reproduce. It does seem to be happening less frequently locally at least, though unclear if that is just because the timing is slightly changed with the extra command, as it was often finishing moments after the timeout previously.

As little bit of a timeline:

1) at 08:25:15 we run crm_resource --wait --tmieout 300s, which exits with a 0:

[25/Mar/2017:08:25:16] console DEBUG Shell.run: ['crm_resource', '--wait', '--timeout', '300s']

2) at 08:25:17 we run the command to stop the resource:

[25/Mar/2017:08:25:17] console DEBUG Shell.run: ['crm_resource', '-r', u'testfs-OST0000_749fba', '-p', 'target-role', '-m', '-v', 'Stopped']

3) Our test waits for the resource to stop, timing out at 08:29:24, 4 minutes and 7 seconds after the command to stop.

4) The resource finally stops at 08:29:45, 21 seconds past our timeout.

Attaching messages as testfs-OST0000_749fba_failed_to_stop_messages_log.

Comment 24 Kelsey Prantis 2017-03-27 22:36:50 UTC

Created attachment 1266774 [details]
testfs-OST0000_749fba_failed_to_stop_messages_log

Comment 25 Kelsey Prantis 2017-03-28 17:12:00 UTC

Ok, I've reproduced it again locally, so I can capture the full corosync.log. Here is the timeline for the new reproduction:

1) at 16:32:14 we call crm_resource --wait --timeout 300s, which exits with a 0:

[27/Mar/2017:16:32:14] console DEBUG Shell.run: ['crm_resource', '--wait', '--timeout', '300s']

2) at 16:32:15 we run the command to stop the resource:

[27/Mar/2017:16:32:15] console DEBUG Shell.run: ['crm_resource', '-r', u'testfs-OST0000_2c0452', '-p', 'target-role', '-m', '-v', 'Stopped']

3) Our test waits for the resource to stop, timing out at 16:36:18, 4 minutes and 3 seconds after the command to stop.

4) The resource finally stops at 16:36:45, 27 seconds past our timeout.

Attaching messages and corosync.log from that test run as testfs-OST0000_2c0452_failed_to_stop_logs.tar.gz

Comment 26 Kelsey Prantis 2017-03-28 17:12:47 UTC

Created attachment 1267051 [details]
testfs-OST0000_2c0452_failed_to_stop_logs.tar.gz

Comment 27 Kelsey Prantis 2017-03-28 21:36:22 UTC

Attempted to configure our timeout to 5 minutes, still managed to timeout with one nearer to 6 minutes. Is there any sort of internal upper bound on the time this might take we could rely on it always moving on shorter than?

Comment 28 Kelsey Prantis 2017-03-31 02:57:58 UTC

And hit another that took 14 minutes. In case this information is useful to you guys...

Comment 29 Kelsey Prantis 2017-03-31 16:41:52 UTC

Another that was 15 min, 14 seconds. I am almost wondering if us checking is somehow prolonging it, so the longer we wait, the longer it takes. But all we are doing is calling 'crm_mon -1 -r' ever 2 seconds to check if the resource has stopped or no. And it can be 20+ seconds after we stop checking that it does finish, so maybe not. But just seems like the more I increase the timeout, the higher durations we see, from the tiny sample size we get to see.

Comment 30 Ken Gaillot 2017-04-04 19:02:36 UTC

Looking at the recent incidents, unfortunately it seems the suggested workaround isn't sufficient; it appears disabling is always treated as an aborted transition. The least disruptive workaround would be to put the cluster into maintenance mode, and restart the full cluster. (Maintenance mode allows pacemaker to be restarted without stopping any managed services.) We'll need the changes in Comment 12 (still expected in 7.4 time frame) for a satisfactory solution.

Comment 34 Ken Gaillot 2017-04-18 16:25:08 UTC

Better stonith failure handling has been merged upstream as of:

https://github.com/ClusterLabs/pacemaker/pull/1263

A clarification that I was a bit confused about earlier: if stonith fails on a target too many times, the cluster does not give up on future fencing attempts, rather it gives up on *immediately* retrying fencing.

These changes were made:

* If the DC is shutting down when another node is joining the cluster, it will pass along that information, so the joining node does not fence the old DC if it becomes the new DC. This alone should be enough to prevent the situation in this bz, since fencing will not be required in the situation that triggered it.

* Too many stonith failures will now prevent a new transition only if a stonith failure is the cause of the new transition. This more directly addresses the cause of the "resource disable" not taking immediate effect, which occurred when (1) a transition was aborted for a reason other than stonith failure (such as a configuration change), (2) a previous transition was still in progress, and (3) a target had previously had too many stonith failures. So, even if fencing was required for some other reason and led to this situation again, the cluster would stop the resource as intended.

* Too many fencing failures for a target will now prevent a new transition only if the failed stonith operation was for that particular target. (Previously, failures for any target were considered.)

* The stonith failure count for a target is now reset if that target is removed from the configuration ("pcs cluster node remove"). This ensures the failures are not considered if a node with the same name is re-added to the cluster later.

* All nodes now count stonith failures, rather than just the current DC. This ensures that failures *are* considered if a different node is elected DC.

This should be sufficient to address this BZ. There is one more change I'd like to make, clearing the failure count if a fence device configuration changes, but it will not be ready in the 7.4 timeframe, and it is not essential, so I won't consider it part of this BZ fix.

Comment 35 Ken Gaillot 2017-04-18 17:35:25 UTC

QA: Here is extensive information on reproducing, since multiple issues are involved that need to be tested separately.

1. Test a DC shutting down as another node is joining:

1a. Configure and start a test cluster with two nodes.
1b, Create a dummy resource that takes a long time to stop:

pcs resource create snail ocf:pacemaker:Dummy op_sleep=20 \
op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s

1c. If the resource doesn't start on the DC, create a location constraint to move it there. (The point is to make the DC shutdown take a long time.)
1d. Do "pcs cluster stop" for the node that is *not* DC.
1e. Do "pcs cluster stop --force" for the DC, wait a few seconds, then start the other node again. Before this fix, when the DC leaves, the newly joining node will show the old DC as UNCLEAN and fence it. After this fix, the old DC will cleanly leave the cluster and be shown as offline.

2. Test aborting a transition in progress for a configuration change, when there has been previous stonith failures:

2a. Configure and start a test cluster with at least three nodes (so we still have quorum with one failed node).
2b. On all nodes, install the pacemaker-cts package, and deploy its dummy fence device (so we can simulate fence device failure):

cp /usr/share/pacemaker/tests/cts/fence_dummy /usr/sbin/fence_dummy

2c. Leave one node without any fence devices capable of fencing it (I'll call it $NODE1 in this example), and configure the dummy fence device to fail for the other two nodes ($NODE2 and $NODE3):

pcs stonith create WorstFencingEver fence_dummy mode=fail pcmk_host_list="$NODE2 $NODE3"

2d, Create two dummy resources, including one that takes a long time to perform any operation (so we can ensure one transition is in progress when we abort it):

pcs resource create rabbit ocf:pacemaker:Dummy op monitor interval=10s
pcs resource create snail ocf:pacemaker:Dummy op_sleep=20 \
op monitor interval=30s timeout=30s op start timeout=30s op stop timeout=30s

2e. Add location constraints so the dummy resources prefer one of the fence-failure nodes (this is not really necessary but ensures the resources don't move around for any other reason than our tests):

pcs constraint location rabbit prefers $NODE2
pcs constraint location snail prefers $NODE2

2f. If $NODE1 is the DC, do "pcs cluster stop; pcs cluster start" on it, so another node becomes DC. (This is not really necessary, but it conveniently lets us watch the logs on one host.)
2g. On the DC node, run this command to see relevant log messages:

tail -f /var/log/messages | grep -E "was not terminated|Too many|No devices found in cluster"

2h. Block corosync on $NODE1, to simulate its failure. If the node has firewalld enabled, this will do it (otherwise do something equivalent with iptables):

firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP

2i. On the DC, wait for a log message that the cluster is "giving up" on fencing $NODE1.
2j. Run "date; pcs resource disable snail; sleep 5; pcs resource disable rabbit". The first "disable" will start a new transition that will still be in progress when the second "disable" happens.

2k, On the DC, look at the pacemaker detail log (/var/log/pacemaker.log or /var/log/cluster/corosync.log). Around the time of the first "disable", there will be a "Transition aborted" message referencing "snail" and ending with "complete=true". After the second "disable", there will be a similar message for "rabbit" with "complete=false". This indicates that the second "disable" interrupted a transition in progress.

2l. Before the fix, the "Transition aborted" messages will be followed by a "No devices found" message, and the cluster will *not* immediately stop "rabbit", but will wait until the next transition (up to cluster-recheck-interval). After the fix, the cluster will stop "rabbit" more quickly.

2m. Unblock corosync on $NODE1. If using firewalld:

firewall-cmd --direct --remove-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP

2n, To clean up, remove the dummy resources, and restart the entire cluster (restarting ensures that all fencing fail counts are reset):

pcs resource delete snail
pcs resource delete rabbit
pcs cluster stop --all
pcs cluster start --all

3. Test whether one node's fencing failures affect another's:

3a. If $NODE2 is DC, restart the cluster on it, so another node becomes DC (for convenience of watching logs).
3b. On the DC, follow the log as described in 2g.
3c. On $NODE2, block corosync as described in 2h.
3d. Watch the logs on the DC. Each time that fencing fails, there will be a "was not terminated" message. After about 10 such failures, there will be a "Too many failures" message, and the cluster will stop retrying.
3e. On $NODE2, unblock corosync as described in 2m.
3f. On $NODE3, block corosync.
3g. Watch the logs on the DC. Before this fix, there will be a "Too many failures" message with the first fence failure (the $NODE2 failures are being counted), After this fix, only the $NODE3 failures will count, so the "Too many failures" message will be after about 10 failures.
3h, To clean up, unblock corosync on $NODE3, and restart the entire cluster.

4. Test whether a removed node's fencing failures are remembered:

4a. Ensure $NODE2 is not DC, follow the logs on the DC, and block corosync on $NODE2 (i.e. repeat 3a-3c).
4b. Follow the logs on the DC, and wait for the "Too many failures" message.
4c. Unblock corosync on $NODE2.
4d. pcs cluster node remove $NODE2
4e, pcs cluster node add $NODE2
4f. Block corosync on $NODE2.
4g. Follow the logs on the DC. Before this fix, the "Too many failures" message will occur with the first failure (the original node's failures are being remembered). After this fix, the "Too many failures" message will occur after about 10 failures (the node is considered a new node).
4h. To clean up, unblock $NODE2, and restart the entire cluster.

5. Test whether failures are remembered when the DC changes:
5a. Restart the cluster on $NODE2. This ensures that it is last in line to be DC.
5b. Follow the logs on the DC, block corosync on $NODE2, and wait for the "Too many failures" message.
5c, Unblock corosync on $NODE2.
5d. Restart the cluster on the DC, which will make another node become DC.
5e. Follow the logs on the new DC, and block corosync on $NODE2. Before the fix, the "Too many failures" message will occur after about 10 failures (the new DC doesn't know about the previous failures). After the fix, the "Too many failures" message will occur with the first failure.

Comment 36 Ken Gaillot 2017-04-18 19:55:16 UTC

QA: An additional note about testing: the ocf:pacemaker:Dummy resource agent in the older packages has an unrelated bug that can be triggered by this testing. It's fixed along with this bz, so no workaround is necessary with the newer packages.

With the older packages, before step 2j (disabling the dummy resources), on the node that is running "snail", do "ls /var/run/Dummy-*.serialized". If a file shows up, there is a monitor in progress; wait until the file disappears before disabling the resources. The symptom otherwise is that the stop happens immediately rather than respecting op_sleep, which defeats the purpose of this test.

Comment 37 Ken Gaillot 2017-04-18 22:41:53 UTC

Documentation: This BZ involves multiple issues. I've added Doc Text for the two most significant ones, which may or may not be worth a release note.

Comment 39 michal novacek 2017-06-09 13:09:42 UTC

I have tested the following points from comment #35 works as described in them
on pacemaker-1.1.16-10.el7

> 1) While (nodeA) DC is shutting down and another node (nodeB) tries to join
    will not cause nodeA to be fenced

> 2): Test aborting a transition in progress for a

> 3) Test whether one nodes fencing failures affect anothers

> 4) Removed node's fencing failures are remembered

> 5) Failures are remembered when DC changes

-----

Marking bug as SanityOnly as these really are corner cases and the test done
are very limited.

Comment 40 errata-xmlrpc 2017-08-01 17:54:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1862

Note You need to log in before you can comment on or make changes to this bug.