Bug 1665343

Summary: Regression: Cleanup when operation in progress delays cluster response
Product: Red Hat Enterprise Linux 7 Reporter: Ken Gaillot <kgaillot>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.6CC: abeekhof, aherr, cfeist, cluster-maint, mnovacek, phagara, pkomarov
Target Milestone: rcKeywords: Regression, ZStream
Target Release: 7.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.20-1.el7 Doc Type: Bug Fix
Doc Text:
Previously, the originator of the operation was not informed of its completion when a resource was cleaned up, refreshed, or removed from the configuration while an operation was in progress on that resource. As a consequence, the cluster was blocked from responding to new events for a couple of minutes, and the operation was then considered timed out. This sometimes led to fencing: for example, if the operation was a stop. With this update, the originator is now directly notified of the result, and the cluster is able to respond to new events.
Story Points: ---
Clone Of:
: 1665816 (view as bug list) Environment:
Last Closed: 2019-08-06 12:54:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1665816    
Attachments:
Description Flags
pcs cluster report for failed reproducer
none
corosync.log files for failed reproducer none

Description Ken Gaillot 2019-01-11 05:36:09 UTC
Description of problem: As of RHEL 7.6, if someone issues a pcs resource cleanup/refresh/remove on a specific resource while that resource has an operation in progress, an assertion will be logged and the transition will time out, preventing the cluster from reacting to any new events for a minute or so.

Steps to Reproduce:
1. Configure and start a cluster.
2. Configure an ocf:pacemaker:Dummy resource with op_delay=10, and wait for it to be active.
3. Force a failure of the dummy resource on the node it's running on (i.e. rm /run/Dummy*state*)
4. While the resource is being recovered (pcs status shows it FAILED or Starting), do a pcs resource cleanup with the resource name.

Actual results: The log on the node that was running the resource will show "Triggered assert" for "rsc != NULL", and a minute or so later, the DC's log will show a "Timer popped" message. If the cleanup happened while the resource was stopping, the node will likely then get fenced due to the stop failure. The cluster will not react to new events (e.g. fencing a different node that is killed) until the timer pops.

Expected results: No assertion is logged, the cluster reacts immediately to new events, and the node is not fenced if cleanup happens during the stop operation.

Comment 8 Patrik Hagara 2019-06-27 12:11:32 UTC
Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker packages downgraded to 1.1.19-8.el7-c3c624ea3d.

Reproduction steps taken:
  * 3-node virt cluster
  * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10
  * /run/Dummy-dummy.state resource state file removed on the node where it's running
  * while the dummy resource is in "FAILED" or "Started (Monitoring)" state, run "pcs resource cleanup dummy"
  * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered assert" returns no matches, dummy resource is recovered successfully and no fencing occurs

Ken, could you please confirm this version is supposed to be affected and/or provide a working reproducer?

Comment 9 Ken Gaillot 2019-06-27 16:20:59 UTC
(In reply to Patrik Hagara from comment #8)
> Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker
> packages downgraded to 1.1.19-8.el7-c3c624ea3d.
> 
> Reproduction steps taken:
>   * 3-node virt cluster
>   * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10
>   * /run/Dummy-dummy.state resource state file removed on the node where
> it's running
>   * while the dummy resource is in "FAILED" or "Started (Monitoring)" state,
> run "pcs resource cleanup dummy"

The issue wouldn't show up in the "Started (Monitoring)" state, only "FAILED" or "Starting". It might be possible that cleaning up early enough in the "FAILED" state might not trigger the issue, so cleaning up in "Starting" would be ideal.

>   * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered
> assert" returns no matches, dummy resource is recovered successfully and no
> fencing occurs

Those messages should show up in the DC's logs, which might be different from the node running the resource.

> Ken, could you please confirm this version is supposed to be affected and/or
> provide a working reproducer?

This BZ was z-streamed back to 7.6 (Bug 1665816) in pacemaker-1.1.19-8.el7_6.4, and QA verified there that the issue was present in 1.1.19-8.el7_6.2. If the above suggestions don't clear it up, something odd is going on.

Comment 10 Patrik Hagara 2019-07-02 09:59:55 UTC
(In reply to Ken Gaillot from comment #9)
> (In reply to Patrik Hagara from comment #8)
> > Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker
> > packages downgraded to 1.1.19-8.el7-c3c624ea3d.
> > 
> > Reproduction steps taken:
> >   * 3-node virt cluster
> >   * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10
> >   * /run/Dummy-dummy.state resource state file removed on the node where
> > it's running
> >   * while the dummy resource is in "FAILED" or "Started (Monitoring)" state,
> > run "pcs resource cleanup dummy"
> 
> The issue wouldn't show up in the "Started (Monitoring)" state, only
> "FAILED" or "Starting". It might be possible that cleaning up early enough
> in the "FAILED" state might not trigger the issue, so cleaning up in
> "Starting" would be ideal.

I've tried running cleanup at different times in the recovery process with no luck.

All I saw was "FAILED" and "Started (Monitoring)" states, no "Starting" (except when the dummy resource was starting for the first time after creation).

> >   * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered
> > assert" returns no matches, dummy resource is recovered successfully and no
> > fencing occurs
> 
> Those messages should show up in the DC's logs, which might be different
> from the node running the resource.

No node had any of these lines in the corosync.log.

> > Ken, could you please confirm this version is supposed to be affected and/or
> > provide a working reproducer?
> 
> This BZ was z-streamed back to 7.6 (Bug 1665816) in
> pacemaker-1.1.19-8.el7_6.4, and QA verified there that the issue was present
> in 1.1.19-8.el7_6.2. If the above suggestions don't clear it up, something
> odd is going on.

Attaching a pcs cluster report and corosync.log files.

I can try to repro using the pre-fix 7.6.z packages, unless you have a better suggestion based on the report.

Comment 11 Patrik Hagara 2019-07-02 10:00:46 UTC
Created attachment 1586619 [details]
pcs cluster report for failed reproducer

Comment 12 Patrik Hagara 2019-07-02 10:01:24 UTC
Created attachment 1586620 [details]
corosync.log files for failed reproducer

Comment 13 Ken Gaillot 2019-07-02 21:09:13 UTC
I think you may just be missing the 10-second window ... try bumping to op_sleep to 20

"FAILED" is the right state for cleanup

Comment 14 Patrik Hagara 2019-07-10 13:07:41 UTC
Still unable to reproduce the issue using the unfixed packages. Martin Juricek <mjuricek>, who verified the 7.6 Z-stream BZ, also tried without success.

Could it be that this issue was triggerable only in 7.6 Z-stream packages (starting with 1.1.19-8.el7_6.2 and then fixed in pacemaker-1.1.19-8.el7_6.4 as per the BZ clone)?

I'll check that it's still not reproducible in this "fixed" package and mark this bug as verified, as we're already way past the 7.7 deadline.


after fix (1.1.20-5.el7):
=========================

> [root@virt-038 ~]# rpm -q pacemaker
> pacemaker-1.1.20-5.el7.x86_64
> [root@virt-038 ~]# pcs resource create dummy ocf:pacemaker:Dummy op_sleep=10


wait for dummy to be started:

> [root@virt-038 ~]# pcs status
> Cluster name: STSRHTS540
> Stack: corosync
> Current DC: virt-038 (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
> Last updated: Tue Jul  9 16:31:09 2019
> Last change: Tue Jul  9 16:20:01 2019 by root via cibadmin on virt-038
> 
> 3 nodes configured
> 4 resources configured
> 
> Online: [ virt-038 virt-040 virt-061 ]
> 
> Full list of resources:
> 
>  fence-virt-038	(stonith:fence_xvm):	Started virt-038
>  fence-virt-040	(stonith:fence_xvm):	Started virt-040
>  fence-virt-061	(stonith:fence_xvm):	Started virt-061
>  dummy	(ocf::pacemaker:Dummy):	Started virt-038
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled


on the node where dummy is started, delete the resource state files:

> [root@virt-038 ~]# rm -f /run/Dummy-dummy.state*


keep querying `pcs status` until dummy transitions to FAILED state. note the timestamp of the failure (last_rc_change of the failed monitor action).

~5 seconds after the failure timestamp, run cleanup of the dummy resource:

> [root@virt-038 ~]# pcs resource cleanup dummy
> Cleaned up dummy on virt-061
> Cleaned up dummy on virt-040
> Cleaned up dummy on virt-038
> Waiting for 1 reply from the CRMd. OK


`pcs status` will now show the dummy resource as "stopped" for up to 10 seconds, then "starting" for 10 seconds, and "started (monitoring)" for another 10s. afterwards the resource will be successfully recovered.

verify that the bug was not hit (should return no matches):

> [root@virt-038 ~]# for node in virt-{038,040,061}; do ssh $node 'grep -iE "(Timer popped)|(Triggered assert)" /var/log/cluster/corosync.log'; done


marking as verified in 1.1.20-5.el7

Comment 16 errata-xmlrpc 2019-08-06 12:54:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129