1627948 – Bundle component actions can be scheduled on wrong node

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1627948 - Bundle component actions can be scheduled on wrong node

Summary: Bundle component actions can be scheduled on wrong node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	rc
Target Release:	7.7
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-12 00:00 UTC by Marian Krcmarik
Modified:	2019-08-06 12:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pacemaker-1.1.20-1.el7
Doc Type:	No Doc Update
Doc Text:	The issue is mostly invisible to end users.
Clone Of:
Environment:
Last Closed:	2019-08-06 12:53:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
part of sosreports (sos_commands/pacemaker) (17.16 MB, application/x-xz) 2018-09-12 00:00 UTC, Marian Krcmarik	no flags	Details
sosreports from pcmk remote node (11.33 MB, application/x-xz) 2018-09-12 00:01 UTC, Marian Krcmarik	no flags	Details
The rest of the sosreports for DC (9.13 MB, application/x-xz) 2018-09-12 00:04 UTC, Marian Krcmarik	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2129	0	None	None	None	2019-08-06 12:54:11 UTC

Description Marian Krcmarik 2018-09-12 00:00:13 UTC

Created attachment 1482497 [details]
part of sosreports (sos_commands/pacemaker)

Description of problem:
pacemaker remote node does not recover from failover correctly. The observed flow is as following:
- hard reset is triggered on pacemaker remote node which is part of cluster
- the boot of the remote node is delayed
- pacemaker detects that the remote node is down and performs fencing
- the remote node has delayed booting
- for some reason pacemaker tries to start pacemaker:remote resource even though the remote node is down (and even It displays the remote node as online)
- it fails to start  pacemaker:remote resource on all full pacemaker nodes and fence the node again
- now it marks the pacemaker:remote resource as stopped and remote node as offline even though it eventually finally comes back online and pacemaker_remote daemon starts to work fully.

This was observed in Openstack based cluster with 3 full pacemaker nodes called controller-* and 6 pacemaker remote nodes

I am attaching some sosreports, full sosreport from database-0 pacemaker remote node which was reset at 21:23:04 Sept 11 of log time and then part of sosreport from DC (controller-0):
pacemaker.tar.xz contains part of sosreport from sos_commands/pacemaker
sosreport-controller-1-20180911214317.tar.xz contains the rest of the sosreports
(I am not able to place the logs anywhere I used to, it does not work)

Version-Release number of selected component (if applicable):
pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64
pacemaker-libs-1.1.18-11.el7_5.3.x86_64
puppet-pacemaker-0.7.2-0.20180423212251.el7ost.noarch
pacemaker-remote-1.1.18-11.el7_5.3.x86_64
pacemaker-cli-1.1.18-11.el7_5.3.x86_64
ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
pacemaker-1.1.18-11.el7_5.3.x86_64

How reproducible:
Always

Steps to Reproduce:
1. I deploy Openstack based cluster with 3 full pacemaker nodes and 6 remote pacemaker nodes (see attached reports for specific cluster configuration)
2. Reset one of the remote node and delay the boot.

Actual results:
pacemaker:remote resource for that remote node will end up in stopped state and no resource on that node will be started

Expected results:


Additional info:

Comment 2 Marian Krcmarik 2018-09-12 00:01:19 UTC

Created attachment 1482498 [details]
sosreports from pcmk remote node

Comment 3 Marian Krcmarik 2018-09-12 00:04:50 UTC

Created attachment 1482512 [details]
The rest of the sosreports for DC

Comment 4 Andrew Beekhof 2018-09-12 13:14:01 UTC

(In reply to Marian Krcmarik from comment #0)
> Created attachment 1482497 [details]
> part of sosreports (sos_commands/pacemaker)
> 
> Description of problem:
> pacemaker remote node does not recover from failover correctly. The observed
> flow is as following:
> - hard reset is triggered on pacemaker remote node which is part of cluster
> - the boot of the remote node is delayed
> - pacemaker detects that the remote node is down and performs fencing
> - the remote node has delayed booting
> - for some reason pacemaker tries to start pacemaker:remote resource even
> though the remote node is down (and even It displays the remote node as
> online)

we have no idea when the remote will come back, so we try anyway - knowing that the failures will time out and we'll retry later.

> - it fails to start  pacemaker:remote resource on all full pacemaker nodes
> and fence the node again

it should be blocking, not fencing.
i suspect something might have changed :-(

> - now it marks the pacemaker:remote resource as stopped and remote node as
> offline even though it eventually finally comes back online and
> pacemaker_remote daemon starts to work fully.
> 
> This was observed in Openstack based cluster with 3 full pacemaker nodes
> called controller-* and 6 pacemaker remote nodes
> 
> I am attaching some sosreports, full sosreport from database-0 pacemaker
> remote node which was reset at 21:23:04 Sept 11 of log time and then part of
> sosreport from DC (controller-0):
> pacemaker.tar.xz contains part of sosreport from sos_commands/pacemaker
> sosreport-controller-1-20180911214317.tar.xz contains the rest of the
> sosreports
> (I am not able to place the logs anywhere I used to, it does not work)
> 
> Version-Release number of selected component (if applicable):
> pacemaker-cluster-libs-1.1.18-11.el7_5.3.x86_64
> pacemaker-libs-1.1.18-11.el7_5.3.x86_64
> puppet-pacemaker-0.7.2-0.20180423212251.el7ost.noarch
> pacemaker-remote-1.1.18-11.el7_5.3.x86_64
> pacemaker-cli-1.1.18-11.el7_5.3.x86_64
> ansible-pacemaker-1.0.4-0.20180220234310.0e4d7c0.el7ost.noarch
> pacemaker-1.1.18-11.el7_5.3.x86_64
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> 1. I deploy Openstack based cluster with 3 full pacemaker nodes and 6 remote
> pacemaker nodes (see attached reports for specific cluster configuration)
> 2. Reset one of the remote node and delay the boot.
> 
> Actual results:
> pacemaker:remote resource for that remote node will end up in stopped state
> and no resource on that node will be started
> 
> Expected results:
> 
> 
> Additional info:

Comment 5 Ken Gaillot 2018-10-12 22:28:54 UTC

There are multiple issues.

All of the issues arise from the bundle's remote connection being hosted on a different node than the bundle's container. That's required here because the container is hosted on a remote node, but it triggered some bugs in failure handling.

One issue is that an inappropriate clearing of the container's fail count is scheduled, and on the wrong node to boot. Because it's on the wrong node, it doesn't really hurt anything, it just causes a new transition when it times out. I believe I have a fix for this one.

Another issue is that a stop of the container is scheduled on the cluster node hosting the container's remote connection (which is not hosting the container). Since a stop of an already stopped resource is a success, this doesn't cause any serious harm either. I'm still investigating a fix for this.

The most significant issue is that the remote node is fenced a second time unnecessarily. This appears to have been introduced by a fix in upstream version 1.1.17 (RHEL 7.5) that ensured that unrecoverable remote nodes are fenced even if no resources can run on them. I am still investigating this one as well.

Comment 6 Ken Gaillot 2018-10-23 19:19:56 UTC

After further investigation, most of the concerns here are expected behavior, and can be modified with appropriate configuration.

If a remote node is fenced, by default Pacemaker immediately and repeatedly tries reconnecting to it. If the remote node does not come back in time, the reconnect attempts will fail on all nodes, causing a second fencing and preventing the connection from being attempted again until the failure is cleaned. The second fencing is necessary because Pacemaker does not know whether a failed start leaves the resource cleanly stopped or partially started. (This is questionable in the case of ocf:pacemaker:remote, which the cluster does have more direct knowledge of, but it makes sense as a general behavior.)

Two configuration settings affect this behavior:

* If the ocf:pacemaker:remote agent's reconnect_interval parameter is set, Pacemaker will try to connect to the node at this interval rather than immediately. This gives the node more time to come back up, potentially avoiding the failures and second fencing.

* A failure-timeout (as usual) can expire the start failures, allowing the cluster to retry connecting again.

The only bugs here are the wrongly scheduled clear_failcount and stop actions, which are rather low impact. I have fixes for these, however it would require an ABI compatibility break, so I am cautious about putting it into RHEL 7. If you think it's important enough, we can do it, or if you think it's important to avoid the second fencing in this case I can investigate that more, but otherwise I'll merge those fixes upstream only for now and close this.

Comment 9 Ken Gaillot 2018-11-16 00:14:09 UTC

After further consideration, I do want to backport the fixes for the wrongly scheduled actions to RHEL 7.7. We can use this BZ to track those. I do not think we will need z-streams.

The libpe_status API has up to this point been undocumented, so technically it is not a public API yet. Along with this change, we can add API documentation, so we are technically introducing the API rather than breaking it. The only practical effect should be that sbd will need to be rebuilt.

Comment 10 Ken Gaillot 2018-11-17 03:16:51 UTC

Fixed in upstream master branch (for RHEL 8) by commits 64852e3a through 556796e0, and in upstream 1.1 branch (for RHEL 7) by commits 163742c4 through 58e4eb80

Comment 13 errata-xmlrpc 2019-08-06 12:53:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2129

Note You need to log in before you can comment on or make changes to this bug.