1665343 – Regression: Cleanup when operation in progress delays cluster response

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1665343 - Regression: Cleanup when operation in progress delays cluster response

Summary: Regression: Cleanup when operation in progress delays cluster response

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	7.7
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1665816
TreeView+	depends on / blocked

Reported:	2019-01-11 05:36 UTC by Ken Gaillot
Modified:	2019-08-06 12:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pacemaker-1.1.20-1.el7
Doc Type:	Bug Fix
Doc Text:	Previously, the originator of the operation was not informed of its completion when a resource was cleaned up, refreshed, or removed from the configuration while an operation was in progress on that resource. As a consequence, the cluster was blocked from responding to new events for a couple of minutes, and the operation was then considered timed out. This sometimes led to fencing: for example, if the operation was a stop. With this update, the originator is now directly notified of the result, and the cluster is able to respond to new events.
Clone Of:
Clones:	1665816 (view as bug list)
Environment:
Last Closed:	2019-08-06 12:54:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pcs cluster report for failed reproducer (382.84 KB, application/x-bzip) 2019-07-02 10:00 UTC, Patrik Hagara	no flags	Details
corosync.log files for failed reproducer (102.82 KB, application/x-bzip) 2019-07-02 10:01 UTC, Patrik Hagara	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2129	0	None	None	None	2019-08-06 12:54:33 UTC

Description Ken Gaillot 2019-01-11 05:36:09 UTC

Description of problem: As of RHEL 7.6, if someone issues a pcs resource cleanup/refresh/remove on a specific resource while that resource has an operation in progress, an assertion will be logged and the transition will time out, preventing the cluster from reacting to any new events for a minute or so.

Steps to Reproduce:
1. Configure and start a cluster.
2. Configure an ocf:pacemaker:Dummy resource with op_delay=10, and wait for it to be active.
3. Force a failure of the dummy resource on the node it's running on (i.e. rm /run/Dummy*state*)
4. While the resource is being recovered (pcs status shows it FAILED or Starting), do a pcs resource cleanup with the resource name.

Actual results: The log on the node that was running the resource will show "Triggered assert" for "rsc != NULL", and a minute or so later, the DC's log will show a "Timer popped" message. If the cleanup happened while the resource was stopping, the node will likely then get fenced due to the stop failure. The cluster will not react to new events (e.g. fencing a different node that is killed) until the timer pops.

Expected results: No assertion is logged, the cluster reacts immediately to new events, and the node is not fenced if cleanup happens during the stop operation.

Comment 6 Ken Gaillot 2019-01-14 16:23:50 UTC

Fixed upstream by commit https://github.com/ClusterLabs/pacemaker/commit/a48d73f23c6800ae51522c1a91d0f0e1eb967078

Comment 8 Patrik Hagara 2019-06-27 12:11:32 UTC

Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker packages downgraded to 1.1.19-8.el7-c3c624ea3d.

Reproduction steps taken:
  * 3-node virt cluster
  * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10
  * /run/Dummy-dummy.state resource state file removed on the node where it's running
  * while the dummy resource is in "FAILED" or "Started (Monitoring)" state, run "pcs resource cleanup dummy"
  * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered assert" returns no matches, dummy resource is recovered successfully and no fencing occurs

Ken, could you please confirm this version is supposed to be affected and/or provide a working reproducer?

Comment 9 Ken Gaillot 2019-06-27 16:20:59 UTC

(In reply to Patrik Hagara from comment #8)
> Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker
> packages downgraded to 1.1.19-8.el7-c3c624ea3d.
> 
> Reproduction steps taken:
>   * 3-node virt cluster
>   * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10
>   * /run/Dummy-dummy.state resource state file removed on the node where
> it's running
>   * while the dummy resource is in "FAILED" or "Started (Monitoring)" state,
> run "pcs resource cleanup dummy"

The issue wouldn't show up in the "Started (Monitoring)" state, only "FAILED" or "Starting". It might be possible that cleaning up early enough in the "FAILED" state might not trigger the issue, so cleaning up in "Starting" would be ideal.

>   * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered
> assert" returns no matches, dummy resource is recovered successfully and no
> fencing occurs

Those messages should show up in the DC's logs, which might be different from the node running the resource.

> Ken, could you please confirm this version is supposed to be affected and/or
> provide a working reproducer?

This BZ was z-streamed back to 7.6 (Bug 1665816) in pacemaker-1.1.19-8.el7_6.4, and QA verified there that the issue was present in 1.1.19-8.el7_6.2. If the above suggestions don't clear it up, something odd is going on.

Comment 10 Patrik Hagara 2019-07-02 09:59:55 UTC

(In reply to Ken Gaillot from comment #9)
> (In reply to Patrik Hagara from comment #8)
> > Unable to reproduce the issue on 7.7-20190619.n.0 compose with pacemaker
> > packages downgraded to 1.1.19-8.el7-c3c624ea3d.
> > 
> > Reproduction steps taken:
> >   * 3-node virt cluster
> >   * one ocf:pacemaker:Dummy resource named "dummy" with op_sleep=10
> >   * /run/Dummy-dummy.state resource state file removed on the node where
> > it's running
> >   * while the dummy resource is in "FAILED" or "Started (Monitoring)" state,
> > run "pcs resource cleanup dummy"
> 
> The issue wouldn't show up in the "Started (Monitoring)" state, only
> "FAILED" or "Starting". It might be possible that cleaning up early enough
> in the "FAILED" state might not trigger the issue, so cleaning up in
> "Starting" would be ideal.

I've tried running cleanup at different times in the recovery process with no luck.

All I saw was "FAILED" and "Started (Monitoring)" states, no "Starting" (except when the dummy resource was starting for the first time after creation).

> >   * grepping /var/log/cluster/corosync.log for "Timer popped" and "Triggered
> > assert" returns no matches, dummy resource is recovered successfully and no
> > fencing occurs
> 
> Those messages should show up in the DC's logs, which might be different
> from the node running the resource.

No node had any of these lines in the corosync.log.

> > Ken, could you please confirm this version is supposed to be affected and/or
> > provide a working reproducer?
> 
> This BZ was z-streamed back to 7.6 (Bug 1665816) in
> pacemaker-1.1.19-8.el7_6.4, and QA verified there that the issue was present
> in 1.1.19-8.el7_6.2. If the above suggestions don't clear it up, something
> odd is going on.

Attaching a pcs cluster report and corosync.log files.

I can try to repro using the pre-fix 7.6.z packages, unless you have a better suggestion based on the report.

Comment 11 Patrik Hagara 2019-07-02 10:00:46 UTC

Created attachment 1586619 [details]
pcs cluster report for failed reproducer

Comment 12 Patrik Hagara 2019-07-02 10:01:24 UTC

Created attachment 1586620 [details]
corosync.log files for failed reproducer

Comment 13 Ken Gaillot 2019-07-02 21:09:13 UTC

I think you may just be missing the 10-second window ... try bumping to op_sleep to 20

"FAILED" is the right state for cleanup

Comment 14 Patrik Hagara 2019-07-10 13:07:41 UTC

Still unable to reproduce the issue using the unfixed packages. Martin Juricek <mjuricek>, who verified the 7.6 Z-stream BZ, also tried without success.

Could it be that this issue was triggerable only in 7.6 Z-stream packages (starting with 1.1.19-8.el7_6.2 and then fixed in pacemaker-1.1.19-8.el7_6.4 as per the BZ clone)?

I'll check that it's still not reproducible in this "fixed" package and mark this bug as verified, as we're already way past the 7.7 deadline.


after fix (1.1.20-5.el7):
=========================

> [root@virt-038 ~]# rpm -q pacemaker
> pacemaker-1.1.20-5.el7.x86_64
> [root@virt-038 ~]# pcs resource create dummy ocf:pacemaker:Dummy op_sleep=10


wait for dummy to be started:

> [root@virt-038 ~]# pcs status
> Cluster name: STSRHTS540
> Stack: corosync
> Current DC: virt-038 (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
> Last updated: Tue Jul  9 16:31:09 2019
> Last change: Tue Jul  9 16:20:01 2019 by root via cibadmin on virt-038
> 
> 3 nodes configured
> 4 resources configured
> 
> Online: [ virt-038 virt-040 virt-061 ]
> 
> Full list of resources:
> 
>  fence-virt-038	(stonith:fence_xvm):	Started virt-038
>  fence-virt-040	(stonith:fence_xvm):	Started virt-040
>  fence-virt-061	(stonith:fence_xvm):	Started virt-061
>  dummy	(ocf::pacemaker:Dummy):	Started virt-038
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled


on the node where dummy is started, delete the resource state files:

> [root@virt-038 ~]# rm -f /run/Dummy-dummy.state*


keep querying `pcs status` until dummy transitions to FAILED state. note the timestamp of the failure (last_rc_change of the failed monitor action).

~5 seconds after the failure timestamp, run cleanup of the dummy resource:

> [root@virt-038 ~]# pcs resource cleanup dummy
> Cleaned up dummy on virt-061
> Cleaned up dummy on virt-040
> Cleaned up dummy on virt-038
> Waiting for 1 reply from the CRMd. OK


`pcs status` will now show the dummy resource as "stopped" for up to 10 seconds, then "starting" for 10 seconds, and "started (monitoring)" for another 10s. afterwards the resource will be successfully recovered.

verify that the bug was not hit (should return no matches):

> [root@virt-038 ~]# for node in virt-{038,040,061}; do ssh $node 'grep -iE "(Timer popped)|(Triggered assert)" /var/log/cluster/corosync.log'; done


marking as verified in 1.1.20-5.el7

Comment 16 errata-xmlrpc 2019-08-06 12:54:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129

Note You need to log in before you can comment on or make changes to this bug.