2227234 – Improve error message when adding a new node to a cluster with sbd is not possible

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2227234 - Improve error message when adding a new node to a cluster with sbd is not possible

Summary: Improve error message when adding a new node to a cluster with sbd is not pos...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	rc
Target Release:	8.10
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe
Docs Contact:
URL:
Whiteboard:
Depends On:	2175797
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-28 13:01 UTC by Tomas Jelinek
Modified:	2023-09-22 20:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Feature: Provide a guidance in error messages when adding or removing a node in a cluster with odd number of nodes, SBD enabled without disks, and auto_tie_breaker disabled. Reason: Originally, pcs in this situation just informed, that it was going to enable auto_tie_breaker, and then exited with an error saying corosync was running. This was not explanatory and it didn't provide enough information for users to solve this issue. Result: Error messages have been updated. They now explain that auto_tie_breaker must be enabled due to SBD and provide instructions to stop the cluster, enable auto_tie_breaker, start the cluster and run command for adding or removing a node again.
Clone Of:	2175797
Environment:
Last Closed:	2023-09-22 20:34:21 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7732	0	None	Migrated	None	2023-09-22 20:34:15 UTC
Red Hat Issue Tracker	RHELPLAN-163767	0	None	None	None	2023-07-28 13:02:35 UTC

Description Tomas Jelinek 2023-07-28 13:01:34 UTC

+++ This bug was initially created as a clone of Bug #2175797 +++

Description of problem:
pcs can get to a situation, when adding a new node to a cluster with sbd is impossible, because there are 2 conditions - one prompting the user that cluster has to be offline due to auto_tie_breaker and the other one that it is not possible to get CIB (because the cluster is offline). Pcs should be aware of this situation and provide a more intuitive output. 


Version-Release number of selected component (if applicable):
found in pcs-0.11.4-6.el9


How reproducible:
every time the number of nodes would be even after adding a node (thus the cluster would have auto_tie_breaker) and sbd without disks is used.


Steps to Reproduce:

## enable sbd on 3 node cluster

[root@virt-553 ~]# pcs stonith sbd enable 
Running SBD pre-enabling checks...
virt-484: SBD pre-enabling checks done
virt-493: SBD pre-enabling checks done
virt-553: SBD pre-enabling checks done
Distributing SBD config...
virt-553: SBD config saved
virt-493: SBD config saved
virt-484: SBD config saved
Enabling sbd...
virt-493: sbd enabled
virt-484: sbd enabled
virt-553: sbd enabled
Warning: Cluster restart is required in order to apply these changes.

[root@virt-553 ~]# pcs cluster stop --all && pcs cluster start --all
virt-553: Stopping Cluster (pacemaker)...
virt-484: Stopping Cluster (pacemaker)...
virt-493: Stopping Cluster (pacemaker)...
virt-493: Stopping Cluster (corosync)...
virt-553: Stopping Cluster (corosync)...
virt-484: Stopping Cluster (corosync)...
virt-553: Starting Cluster...
virt-493: Starting Cluster...
virt-484: Starting Cluster...


## Try to add new node to the cluster

1. in a running cluster

[root@virt-553 ~]# pcs cluster node add virt-551
No addresses specified for host 'virt-551', using 'virt-551'
No watchdog has been specified for node 'virt-551'. Using default watchdog '/dev/watchdog'
Warning: auto_tie_breaker quorum option will be enabled to make SBD fencing effective. Cluster has to be offline to be able to make this change.
Checking corosync is not running on nodes...
Error: virt-493: corosync is running
Error: virt-484: corosync is running
Error: virt-553: corosync is running
Running SBD pre-enabling checks...
virt-551: SBD pre-enabling checks done
Error: Errors have occurred, therefore pcs is unable to continue
[root@virt-553 ~]# echo $?
1

> Node can't be added in a running cluster due to auto_tie_breaker sbd check. 


2. in a stopped cluster

[root@virt-553 ~]# pcs cluster stop --all
virt-553: Stopping Cluster (pacemaker)...
virt-493: Stopping Cluster (pacemaker)...
virt-484: Stopping Cluster (pacemaker)...
virt-484: Stopping Cluster (corosync)...
virt-553: Stopping Cluster (corosync)...
virt-493: Stopping Cluster (corosync)...

[root@virt-553 ~]# pcs cluster node add virt-551
No addresses specified for host 'virt-551', using 'virt-551'
No watchdog has been specified for node 'virt-551'. Using default watchdog '/dev/watchdog'
Error: Unable to load CIB to get guest and remote nodes from it, those nodes cannot be considered in configuration validation, use --force to override
Warning: auto_tie_breaker quorum option will be enabled to make SBD fencing effective. Cluster has to be offline to be able to make this change.
Checking corosync is not running on nodes...
virt-484: corosync is not running
virt-553: corosync is not running
virt-493: corosync is not running
Running SBD pre-enabling checks...
virt-551: SBD pre-enabling checks done
Error: Errors have occurred, therefore pcs is unable to continue
[root@virt-553 ~]# echo $?
1

> Node can't be added in a stopped cluster because CIB is unavailable.


Actual results:
In this state, it's not possible to add a node (without using --force), because the 2 checks are mutually exclusive - cluster needs to be stopped for auto_tie_breaker and cluster needs to be started to get CIB. 


Expected results:
More intuitive error message, which explains the situation why the node in this state can never be added and what to do to solve it (for example disable sbd first). Alternative solutions can be discussed as well.

--- Additional comment from Tomas Jelinek on 2023-03-06 16:20:14 CET ---

possible solutions:
* enable auto_tie_breaker
* disable sbd temporarily
* use --force in pcs cluster node add

Comment 1 Tomas Jelinek 2023-08-10 10:45:19 UTC

Upstream patch: https://github.com/ClusterLabs/pcs/commit/8a19eb351a98019993ae4a17ccf735a51cc67dc9

See bz2175797 comment 2 for tests.

Comment 3 RHEL Program Management 2023-09-22 20:32:51 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 4 RHEL Program Management 2023-09-22 20:34:21 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.