Bug 1872490
Summary: | [RFE] Show in cluster status when Pacemaker is waiting on sbd at start-up | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Ken Gaillot <kgaillot> |
Component: | pacemaker | Assignee: | Klaus Wenninger <kwenning> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | low | Docs Contact: | |
Priority: | high | ||
Version: | 8.3 | CC: | cluster-maint, jpokorny, kgaillot, msmazova |
Target Milestone: | rc | Keywords: | FutureFeature, Triaged |
Target Release: | 8.4 | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-2.0.5-6.el8 | Doc Type: | Enhancement |
Doc Text: |
Feature: Cluster status (via crm_mon or "pcs status") will now display more detailed information when the cluster is in the process of starting up.
Reason: Previously, cluster status would have deficient or misleading information when the cluster was starting up on the local node.
Result: Users get more accurate information when they check cluster status during cluster start-up.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-18 15:26:40 UTC | Type: | Feature Request |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1195703, 1229826, 1251196 |
Description
Ken Gaillot
2020-08-25 21:45:06 UTC
Probably a good idea to make crm_resource --why show it too Rolling back to selinux-policy-3.14.3-11.el8.noarch should be another way to observe the issue as this will prevent ipc between sbd & pacemaker and thus prevent pacemaker getting kicked by sbd. But of course the debugger/signal makes it easier to resume normal operation. 'crm_mon' or 'pcs status' called on the node that is waiting may even make it look as if pacemaker wasn't running at all as the sub-daemon(s) contacted by crm_mon are in fact not running. On the node that is waiting the state of pacemakerd can be queried using 'crmadmin -P'. If we want status/analysis tools to be able to do that query from other nodes we might have to make this usable from other nodes as well while pcs might make use of pcsd instead. (In reply to Klaus Wenninger from comment #3) > 'crm_mon' or 'pcs status' called on the node that is waiting may even make > it look as if pacemaker wasn't running at all as the sub-daemon(s) contacted > by crm_mon are in fact not running. > On the node that is waiting the state of pacemakerd can be queried using > 'crmadmin -P'. > If we want status/analysis tools to be able to do that query from other > nodes we might have to make this usable from other nodes as well while pcs > might make use of pcsd instead. I was thinking only of crm_mon on the host that's waiting -- if we make crm_mon query pacemakerd first, it could show a useful message before the other daemons start. If run from other nodes, I believe crm_mon will already show the node as "pending" (i.e. in the corosync ring but not joined to the pacemaker cluster), which is probably fine. I don't think we could reasonably do anything different on the pacemaker side, though pcs status could potentially check it via pcsd as you suggested. Your comment made me realize Bug 1194761 ('[RFE] make crm_mon indicate "pacemaker being started here" as a per-node state') overlaps with this one -- I'll close that one as a duplicate since the comments here have more detail. Basically the idea is to print that the local node is starting instead of showing all other nodes as unclean. Maybe we could show something along the lines of: Pacemaker does not appear to be running on this node = unable to contact pacemakerd Pacemaker is waiting to be contacted by sbd before starting = pacemakerd in sbd wait Pacemaker is starting = pacemakerd starting subdaemons -or- no node_state entry for local node Beyond that, we could check whether any other node has a node_state entry, but without one we still can't know whether we've finished starting in a single-node situation, started and are now waiting for other nodes, or started and are now cut off from other nodes. So I'm not sure we can (or should) avoid the UNCLEAN messages at that point. Maybe we could show something like "never seen yet" instead of "UNCLEAN" if a node has no node_state. *** Bug 1194761 has been marked as a duplicate of this bug. *** Fix merged upstream as of commit 586e69ec crm_mon will now display more informative messages at cluster start-up when in interactive console mode, including "Pacemaker daemons starting ...", "Waiting for startup-trigger from SBD ...", and ""Waiting for CIB ..." as appropriate (in normal operation they should flash by very quickly, but if any step is slow it will show up). Similarly, there are more informative messages at shutdown. For "pcs status" (or equivalently, running crm_mon in one-shot mode), these states will be shown as error messages. > [root@virt-175 ~]# rpm -q pacemaker > pacemaker-2.0.5-6.el8.x86_64 > [root@virt-175 ~]# rpm -q sbd > sbd-1.4.2-2.el8.x86_64 Configure cluster using sbd: > [root@virt-175 ~]# pcs host auth virt-1{75,76} -u hacluster -p password > virt-175: Authorized > virt-176: Authorized > [root@virt-175 ~]# pcs cluster setup test_cluster virt-175 virt-176 > [...] > Cluster has been successfully set up. Configure a long sbd timeout: > [root@virt-175 ~]# pcs stonith sbd enable watchdog=/dev/watchdog SBD_WATCHDOG_TIMEOUT=40 > Running SBD pre-enabling checks... > virt-175: SBD pre-enabling checks done > virt-176: SBD pre-enabling checks done > Warning: auto_tie_breaker quorum option will be enabled to make SBD fencing effective. Cluster has to be offline to be able to make this change. > Checking corosync is not running on nodes... > virt-175: corosync is not running > virt-176: corosync is not running > Sending updated corosync.conf to nodes... > virt-175: Succeeded > virt-176: Succeeded > Distributing SBD config... > virt-175: SBD config saved > virt-176: SBD config saved > Enabling sbd... > virt-175: sbd enabled > virt-176: sbd enabled > Warning: Cluster restart is required in order to apply these changes. Start the cluster and immediately kill sbd. As a result, nodes are fenced: > [root@virt-175 ~]# pcs cluster start --all && killall -STOP sbd > virt-176: Starting Cluster... > virt-175: Starting Cluster... Watch output of crm_mon in another window on cluster startup: > [root@virt-175 ~]# crm_mon > Waiting until cluster is available on this node ... > Waiting for startup-trigger from SBD ... Test if additional messages are displayed at cluster shutdown: > [root@virt-175 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-175 (version 2.0.5-6.el8-ba59be7122) - partition with quorum > * Last updated: Mon Feb 15 20:04:14 2021 > * Last change: Mon Feb 15 20:02:19 2021 by hacluster via crmd on virt-175 > * 2 nodes configured > * 2 resource instances configured > Node List: > * Online: [ virt-175 virt-176 ] > Full List of Resources: > * dummy1 (ocf::pacemaker:Dummy): Started virt-175 > * dummy2 (ocf::pacemaker:Dummy): Started virt-176 > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > sbd: active/enabled Stop the cluster and run `pcs status` in another window: > [root@virt-175 ~]# pcs cluster destroy --all > virt-176: Stopping Cluster (pacemaker)... > virt-175: Stopping Cluster (pacemaker)... > virt-176: Successfully destroyed cluster > virt-175: Successfully destroyed cluster > [root@virt-176 ~]# pcs status > Error: error running crm_mon, is pacemaker running? > crm_mon: Error: cluster is not available on this node > Pacemaker daemons shut down - reporting to SBD ... > Error: error running crm_mon, is pacemaker running? marking verified in pacemaker-2.0.5-6.el8 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |