1794062 – Starting cluster using --wait on all nodes in parallel often ends up with error

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1794062 - Starting cluster using --wait on all nodes in parallel often ends up with error

Summary: Starting cluster using --wait on all nodes in parallel often ends up with error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.4
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-22 15:24 UTC by Ken Gaillot
Modified:	2021-05-18 15:12 UTC (History)
CC List:	13 users (show)
Fixed In Version:	pcs-0.10.8-1.el8
Doc Type:	Bug Fix
Doc Text:	Cause: User runs 'pcs cluster start --wait' command. Consequence: Pcs is checking pacemaker daemons to see if the cluster already started. A race condition may happen when only part of pcs daemons on the local node has started, which causes pcs to report an error. Fix: Properly check status of all pacemaker daemons and wait for all of them to start. Result: 'pcs cluster start --wait' succeeds.
Clone Of:	1793653
Environment:
Last Closed:	2021-05-18 15:12:05 UTC
Type:	Enhancement
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed fix (9.24 KB, patch) 2021-01-11 12:09 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Description Ken Gaillot 2020-01-22 15:24:51 UTC

+++ This bug was initially created as a clone of Bug #1793653 +++

Description of problem:
When trying to start cluster with --wait on all nodes at the same time, cluster is 'Unable to get node status'. Frequency and number of nodes returning the error vary.

Version-Release number of selected component (if applicable):
pcs-0.10.4-3.el8.x86_64

How reproducible:
often (not always)

Steps to Reproduce:
1. stop the cluster
[root@virt-038 ~]# pcs cluster stop --all
virt-036: Stopping Cluster (pacemaker)...
virt-037: Stopping Cluster (pacemaker)...
virt-038: Stopping Cluster (pacemaker)...
virt-037: Stopping Cluster (corosync)...
virt-038: Stopping Cluster (corosync)...
virt-036: Stopping Cluster (corosync)...

2. run 'pcs cluster start --wait' on all nodes in parallel

> There are two types of errors occuring (sometimes combined)

A)
[root@virt-038 ~]# pcs cluster start --wait
Starting Cluster...
Waiting for node(s) to start...
Error: Unable to get node status: unable to get local node name from pacemaker: error: Could not connect to cluster (is it running?)

B)
[root@virt-037 ~]# pcs cluster start --wait
Starting Cluster...
Waiting for node(s) to start...
Error: Unable to get node status: cannot load cluster status, xml does not conform to the schema
 
3. check the nodes status afterwards
[root@virt-038 ~]# pcs status nodes
Pacemaker Nodes:
 Online: virt-036 virt-037 virt-038
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Standby with resource(s) running:
 Maintenance:
 Offline:

Actual results:
Some (all) of the nodes sometimes return one of the mentioned errors. Cluster nodes will start anyway. 

Expected results:
Starting of cluster passes without errors on all nodes.

Additional info:
This might suggest problem in pacemaker:

A)
[root@virt-036 ~]# pcs cluster start --wait --debug
...
Finished running: /usr/sbin/crm_node --name
Return value: 102
--Debug Stdout Start--

--Debug Stdout End--
--Debug Stderr Start--
error: Could not connect to cluster (is it running?)

--Debug Stderr End--

Error: Unable to get node status: unable to get local node name from pacemaker: error: Could not connect to cluster (is it running?)

B)
[root@virt-037 ~]# pcs cluster start --wait --debug
...
--Debug Stdout Start--
Connection to the cluster-daemons terminated
Reading stonith-history failed<crm_mon version="2.0.3"/>

--Debug Stdout End--
--Debug Stderr Start--
Critical: Unable to get stonith-history

--Debug Stderr End--

Error: Unable to get node status: cannot load cluster status, xml does not conform to the schema

This seems to be also related to bz1793574

--- Additional comment from Tomas Jelinek on 2020-01-22 09:22:47 UTC ---

This seems like a pacemaker issue to me. Both error messages come from pacemaker and there is not much we can do about them in pcs.


(In reply to Nina Hostakova from comment #0)
> A)
> [root@virt-036 ~]# pcs cluster start --wait --debug
> ...
> Finished running: /usr/sbin/crm_node --name
> Return value: 102
> --Debug Stdout Start--
> 
> --Debug Stdout End--
> --Debug Stderr Start--
> error: Could not connect to cluster (is it running?)
> 
> --Debug Stderr End--
> 
> Error: Unable to get node status: unable to get local node name from
> pacemaker: error: Could not connect to cluster (is it running?)

This is coming from pacemaker, for some reason it is unable to figure out the local node's name. Not sure if this issue is new or it was waiting there for some time to be discovered. Nina can for sure provide pacemaker versions where this is happening.


> B)
> [root@virt-037 ~]# pcs cluster start --wait --debug
> ...
> --Debug Stdout Start--
> Connection to the cluster-daemons terminated
> Reading stonith-history failed<crm_mon version="2.0.3"/>
> 
> --Debug Stdout End--
> --Debug Stderr Start--
> Critical: Unable to get stonith-history
> 
> --Debug Stderr End--
> 
> Error: Unable to get node status: cannot load cluster status, xml does not
> conform to the schema

Unfortunately, QE snipped the output, so we cannot see the command that has been run and what was its return code. Nina, can you provide the missing info?

Anyway, it looks like crm_mon --as-xml mixes error / warning messages with xml output on stdout. And it may even exit with return code 0 in this case, which causes pcs to try parse the stdout as xml instead of exiting with an error. (This is a guess based on pcs code, since QE have not provided the command nor its return code.)


> This seems to be also related to bz1793574
How?

--- Additional comment from Ken Gaillot on 2020-01-22 15:06:12 UTC ---

(In reply to Tomas Jelinek from comment #1)
> This seems like a pacemaker issue to me. Both error messages come from
> pacemaker and there is not much we can do about them in pcs.
> 
> 
> (In reply to Nina Hostakova from comment #0)
> > A)
> > [root@virt-036 ~]# pcs cluster start --wait --debug
> > ...
> > Finished running: /usr/sbin/crm_node --name
> > Return value: 102
> > --Debug Stdout Start--
> > 
> > --Debug Stdout End--
> > --Debug Stderr Start--
> > error: Could not connect to cluster (is it running?)
> > 
> > --Debug Stderr End--
> > 
> > Error: Unable to get node status: unable to get local node name from
> > pacemaker: error: Could not connect to cluster (is it running?)
> 
> This is coming from pacemaker, for some reason it is unable to figure out
> the local node's name. Not sure if this issue is new or it was waiting there
> for some time to be discovered. Nina can for sure provide pacemaker versions
> where this is happening.

crm_node contacts the local cluster to get the local node name (which may be different than the hostname), so if the cluster isn't running, it will give that error.

There was a change a while back, where it now contacts the pacemaker controller rather than corosync directly. That allows it to work reliably on remote nodes. But that does mean pacemaker as well as corosync must be running.

There's no reliable way to get the node name if the cluster isn't running, because node names don't have to match host names. If it's a full cluster node, you could probably figure it out from corosync.conf, but remote nodes don't have any way of knowing their own name (if it's different from their hostname) without the cluster connecting to them first.

> > B)
> > [root@virt-037 ~]# pcs cluster start --wait --debug
> > ...
> > --Debug Stdout Start--
> > Connection to the cluster-daemons terminated
> > Reading stonith-history failed<crm_mon version="2.0.3"/>
> > 
> > --Debug Stdout End--
> > --Debug Stderr Start--
> > Critical: Unable to get stonith-history
> > 
> > --Debug Stderr End--
> > 
> > Error: Unable to get node status: cannot load cluster status, xml does not
> > conform to the schema
> 
> Unfortunately, QE snipped the output, so we cannot see the command that has
> been run and what was its return code. Nina, can you provide the missing
> info?
> 
> Anyway, it looks like crm_mon --as-xml mixes error / warning messages with
> xml output on stdout. And it may even exit with return code 0 in this case,
> which causes pcs to try parse the stdout as xml instead of exiting with an
> error. (This is a guess based on pcs code, since QE have not provided the
> command nor its return code.)

The goal of crm_mon XML is to put everything, including error messages, into the XML. However that's not possible for certain early errors (e.g. argument processing) before the output format has been selected, and there are still a handful of later messages (including this one) that are going straight to stderr that really shouldn't be.

I'm not sure this should be labeled a "Critical" error, either. It just means crm_mon can't show the fencing section. So, I'm not sure whether that should result in an error exit status or not. It does highlight that the current XML output doesn't have any possibility of showing partial errors; maybe the <fence_history> tag should have an attribute like available="true"/"false" to tell the difference between "there's no fence history" and "we couldn't get fence history".

> > This seems to be also related to bz1793574
> How?

I believe this is unrelated.

--- Additional comment from Ken Gaillot on 2020-01-22 15:12:10 UTC ---

To summarize action items:

1. I think the first error is something pcs needs to address. crm_node is correctly giving an error when the cluster is not running.

2. We can use this BZ to redirect as many error messages as possible to the XML output, and add something like the "available" suggestion to indicate partial failures. pcs would likely need to be modified to look for the new information.

I'll clone this BZ for the pcs end.

--- Additional comment from Ken Gaillot on 2020-01-22 15:15:38 UTC ---

(In reply to Tomas Jelinek from comment #1)
> (In reply to Nina Hostakova from comment #0)
> > Error: Unable to get node status: cannot load cluster status, xml does not
> > conform to the schema
> 
> Unfortunately, QE snipped the output, so we cannot see the command that has
> been run and what was its return code. Nina, can you provide the missing
> info?

We do still need this info to check whether there's anything missing in the schema. The package version is important, too.

Comment 1 Tomas Jelinek 2020-01-23 10:51:34 UTC

>--- Additional comment from Ken Gaillot on 2020-01-22 15:12:10 UTC ---
> 1. I think the first error is something pcs needs to address. crm_node is correctly giving an error when the cluster is not running.

Pcs first tries to run 'crm_mon --one-shot --as-xml --inactive' to get cluster status from the local node. If that returns non-zero, pcs considers the local node offline (not fully started) and tries again later. If crm_mon returns 0, pcs proceeds and runs 'crm_node --name'. Is the idea "that crm_mon exiting with 0 => the cluster is started and ready" flawed? Or perhaps there has been a related change in pacemaker recently? Again, we are missing the full debug output from pcs here...

Comment 2 Tomas Jelinek 2020-01-23 10:53:17 UTC

Nina, can you provide the full debug output from pcs, perhaps as an attachment if it's too long. See comment 1. Thanks.

Comment 8 Ken Gaillot 2020-01-24 21:01:56 UTC

(In reply to Tomas Jelinek from comment #1)
> >--- Additional comment from Ken Gaillot on 2020-01-22 15:12:10 UTC ---
> > 1. I think the first error is something pcs needs to address. crm_node is correctly giving an error when the cluster is not running.
> 
> Pcs first tries to run 'crm_mon --one-shot --as-xml --inactive' to get
> cluster status from the local node. If that returns non-zero, pcs considers
> the local node offline (not fully started) and tries again later. If crm_mon
> returns 0, pcs proceeds and runs 'crm_node --name'. Is the idea "that
> crm_mon exiting with 0 => the cluster is started and ready" flawed? Or
> perhaps there has been a related change in pacemaker recently? Again, we are
> missing the full debug output from pcs here...

crm_mon will have exit status 102 when the cluster is down (which is "Not connected" in pacemaker exit codes). That's the same exit status crm_node will give if the cluster is down.

However, I just realized there is a race condition that is the likely culprit here. The crm_mon command only needs to be able to query the CIB, while the crm_node command needs to be able to contact the controller. The CIB is literally the first sub-daemon started, and the controller the last. So there's a small window when the CIB is responding but the controller isn't.

Probably the easiest solution would be to check the crm_node exit status, and if it's 102, do the same "try again later" you do for nonzero crm_mon. Alternatively, you could run the crm_node command first, retrying until it's not 102, and do the crm_mon second.

I can imagine it would be useful to have a pacemaker tool specifically for checking the status of the local daemons, returning codes for "everything is down", "only corosync is up", "some pacemaker daemons are up", and "pacemaker is fully up". But checking for exit status 102 will be simpler.

Comment 9 Tomas Jelinek 2020-01-28 16:42:52 UTC

(In reply to Ken Gaillot from comment #8)
> Probably the easiest solution would be to check the crm_node exit status,
> and if it's 102, do the same "try again later" you do for nonzero crm_mon.
> Alternatively, you could run the crm_node command first, retrying until it's
> not 102, and do the crm_mon second.

We'll do that. Thanks for the analysis!

Comment 12 Tomas Jelinek 2021-01-11 12:09:35 UTC

Created attachment 1746235 [details]
proposed fix

Test: Run 'pcs cluster start --wait' on all cluster nodes simultaneously. For more details see comment 0.

Comment 13 Miroslav Lisik 2021-02-01 16:42:03 UTC

Test:

[root@r8-node-01 ~]# rpm -q pcs
pcs-0.10.8-1.el8.x86_64

[root@r8-node-01 ~]# pcs cluster start --wait
Starting Cluster...
Waiting for node(s) to start...
Started

[root@r8-node-02 ~]# pcs cluster start --wait
Starting Cluster...
Waiting for node(s) to start...
Started

Comment 20 errata-xmlrpc 2021-05-18 15:12:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pcs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1737

Note You need to log in before you can comment on or make changes to this bug.