Bug 1374175
Summary: | "crm_node -n" needs to return the right name on remote nodes | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomas Jelinek <tojeline> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | Patrik Hagara <phagara> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 7.3 | CC: | abeekhof, cfeist, cluster-maint, cluster-qe, idevat, jpokorny, kgaillot, kwenning, michele, mnovacek, phagara, rmarigny, rsteiger, sbradley, tojeline |
Target Milestone: | rc | ||
Target Release: | 7.6 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.19-2.el7 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Pacemaker had no way to report the node name of a Pacemaker Remote node to a tool executed on that node's command line.
Consequence: If a Pacemaker Remote's node name were different from its local hostname, tools like crm_node would incorrectly report the hostname as the node name, when run from that node's command line.
Fix: A new cluster daemon request provides the local node name to any requesting tool.
Result: crm_node, and tools that use it such as crm_standby and crm_failcount, now correctly report the local node name, even when run from the command line of a Pacemaker Remote node whose node name is different from its local hostname.
|
Story Points: | --- |
Clone Of: | 1290512 | Environment: | |
Last Closed: | 2018-10-30 07:57:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1290512, 1477664 |
Description
Tomas Jelinek
2016-09-08 07:46:48 UTC
Pcs currently uses "crm_node --cluster-id" and "crm_node --name-for-id" to get a local node name instead of "crm_node -n". It is because "crm_node -n" returns node's hostname when a cluster is not running on the node. Here is the function: https://github.com/ClusterLabs/pcs/blob/2439c263cad6952c12f3a4fe73db6656a7094a1b/pcs/lib/pacemaker.py#L210 The function is intended to get pacemaker's name of the local node, so it raises an exception if pacemaker is not running. That is completely OK for our use case - we either get the name or we know pacemaker is not running. So we would like to have "crm_node --cluster-id" and "crm_node --name-for-id" working on remote nodes as well. This will have to be addressed in the 7.4 timeframe This will not be ready in the 7.4 timeframe *** Bug 1417936 has been marked as a duplicate of this bug. *** When thinking of solutions for this issue I wanted to raise attention that sbd is another instance that might benefit from a way to ask the node-name on remote nodes. Having sbd with the shared-block-device(s) disabled (as up to 7.3) it is simpler as just the pacemaker-watcher needs this info to check the cib for the state of the remote node. So it is fine if the name is available just after the remote-node is connected by a cluster-node. With shared-block-device(s) enabled the node-name is needed to occupy the correct slot on the device(s) as well. So it would of course be nice to have the info right at the start of sbd. Dreaming is allowed ;-) Of course there are ways to work around this issue as giving the node-name in the sbd-config-file (already available and the way to do it at the moment), occupy a 2nd slot once pacemaker-remote is connected, use pcmk_host_map to fence remote-nodes, ... The latter 2 would use a mechanisms to query the host-name via pacemaker-remote for the pacemaker-watcher - something that could as well be used by crm_node - and would thus avoid the need for a node-name to be configured in the sbd-config. Just wanted to state that there are issues with sbd in general on remote-nodes why we don't officially support it there. The thought above might serve as a piece of the puzzle making it smooth enough to be supported. I'd like to rise another point, when pcs finally do not depend on "pacemaker" package (which unnecessarily pulls in also corosync on remote node where you want pcs installed, see also [bug 1388398]), there's no crm_node utility whatsoever. Then it would be wise to move crm_node over to -cli package. That being said, pcs already expects crm_node to be present, see reopened [bug 1327302], and it's questionable if pcs can make do without it on remote nodes. Tomáš can comment more on this topic. Pcs indeed relies on crm_node to be present on remote nodes. Pcs uses it to figure out the local node name. That is needed for example in commands "pcs node standby" and "pcs node maintenance" when no node is specified. This was discussed in bz1290512 from which this bz was cloned and is summarized here in comment 0. *** Bug 1327302 has been marked as a duplicate of this bug. *** Just for completeness: All occurrences of crm_node within pcs pacemaker to be running (e.g. an lrmd-instance, pacemaker-remote or anything)!? (In reply to Jan Pokorný from comment #6) > I'd like to rise another point, when pcs finally do not depend on > "pacemaker" package (which unnecessarily pulls in also corosync on > remote node where you want pcs installed, see also [bug 1388398]), > there's no crm_node utility whatsoever. Then it would be wise to > move crm_node over to -cli package. To clarify, this is already a goal that depends on this bz. crm_node is not in the -cli package precisely because it requires the -cluster-libs package, and we do not want -cli to depend on that. The same is true of crm_attribute. If that dependency can be removed, those tools will be moved to -cli. This will not make it in time for 7.5 Fixed upstream as of pull request https://github.com/ClusterLabs/pacemaker/pull/1515 To summarize the final implementation: crm_node -n/--name, -N/--name-for-id, and -i/--cluster-id now work on full cluster nodes and Pacemaker Remote nodes, whether or not their name in the cluster matches their local hostname, and whether or not they are called from a resource agent or manually. (Note that --name-for-id is intended to be useful only for full cluster nodes, as remote nodes do not have a corosync id.) The crm_node commands will now return an error if the cluster is not running. Not relevant to RHEL, but for completeness: the upstream fix for the 1.1 series fixes -i/--cluster-id for the corosync 2+ stack only (-n and -N are fixed for all stacks). The upstream fix for the 2.0 series additionally fixes -q/--quorum and -R/--remove. Also for completeness' sake: the crm_standby and crm_failcount tools both default to "crm_node -n" if no node is explicitly specified, so they are also fixed by this. The latest build fixes one regression in the original: With the original fix, if a resource agent called "crm_node -n" (or indirectly via the ocf_local_nodename function) for its meta-data action, the meta-data action would time out when called by the cluster, because the node name was not passed for meta-data actions, causing a deadlock between the agent and the cluster. With the latest build, the node name is passed to meta-data actions, so they succeed as usual when called by the cluster. before: ======= > [root@virt-148 ~]# rpm -q pacemaker > pacemaker-1.1.18-12.el7.x86_64 > [root@virt-148 ~]# ssh virt-149 rpm -q pacemaker-remote > pacemaker-remote-1.1.18-12.el7.x86_64 > [root@virt-148 ~]# pcs status > Cluster name: bzzt > Stack: corosync > Current DC: virt-148.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-12.el7-2b07d5c5a9) - partition with quorum > Last updated: Tue Aug 21 13:57:37 2018 > Last change: Tue Aug 21 13:52:10 2018 by root via cibadmin on virt-148.cluster-qe.lab.eng.brq.redhat.com > > 1 node configured > 0 resources configured > > Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com ] > > No resources > > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-148 ~]# pcs cluster node add-remote virt-149.cluster-qe.lab.eng.brq.redhat.com my-remote-node > Sending remote node configuration files to 'virt-149.cluster-qe.lab.eng.brq.redhat.com' > virt-149.cluster-qe.lab.eng.brq.redhat.com: successful distribution of the file 'pacemaker_remote authkey' > Requesting start of service pacemaker_remote on 'virt-149.cluster-qe.lab.eng.brq.redhat.com' > virt-149.cluster-qe.lab.eng.brq.redhat.com: successful run of 'pacemaker_remote enable' > virt-149.cluster-qe.lab.eng.brq.redhat.com: successful run of 'pacemaker_remote start' > [root@virt-148 ~]# pcs status > ... > 2 nodes configured > 1 resource configured > > Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ my-remote-node ] > > Full list of resources: > > my-remote-node (ocf::pacemaker:remote): Started virt-148.cluster-qe.lab.eng.brq.redhat.com > ... > [root@virt-148 ~]# ssh virt-149 crm_node -n > virt-149.cluster-qe.lab.eng.brq.redhat.com > [root@virt-148 ~]# echo $? > 0 > [root@virt-148 ~]# ssh virt-149 crm_node -i > [root@virt-148 ~]# echo $? > 1 > [root@virt-148 ~]# ssh virt-149 pcs cluster standby > Error: unable to get local node name from pacemaker: node id not found > [root@virt-148 ~]# echo $? > 1 > [root@virt-148 ~]# pcs status > ... > Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ my-remote-node ] > ... > [root@virt-148 ~]# ssh virt-149 pcs node maintenance > Error: unable to get local node name from pacemaker: node id not found > [root@virt-148 ~]# echo $? > 1 > [root@virt-148 ~]# pcs status > ... > Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ my-remote-node ] > ... > [root@virt-148 ~]# ssh virt-149 pcs cluster standby my-remote-node > [root@virt-148 ~]# echo $? > 0 > [root@virt-148 ~]# pcs status > ... > RemoteNode my-remote-node: standby > Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com ] > ... Before the fix, a remote node configured with a node name different from its hostname was unable to determine its correct cluster node name and ID. Consequently, it was not possible to put such a remote node into standby or maintenance mode or take it out from such mode by running the appropriate command on the remote node itself without passing the correct node name as an argument. Passing the correct node name as an argument to the (un)standby/(un)maintenance commands successfully worked around the issue. after: ====== > [root@virt-136 ~]# rpm -q pacemaker > pacemaker-1.1.19-7.el7.x86_64 > [root@virt-136 ~]# ssh virt-138 rpm -q pacemaker-remote > pacemaker-remote-1.1.19-7.el7.x86_64 > [root@virt-136 ~]# pcs status > ... > 1 node configured > 0 resources configured > > Online: [ virt-136.cluster-qe.lab.eng.brq.redhat.com ] > > No resources > ... > [root@virt-136 ~]# pcs cluster node add-remote virt-138.cluster-qe.lab.eng.brq.redhat.com my-remote-node > Sending remote node configuration files to 'virt-138.cluster-qe.lab.eng.brq.redhat.com' > virt-138.cluster-qe.lab.eng.brq.redhat.com: successful distribution of the file 'pacemaker_remote authkey' > Requesting start of service pacemaker_remote on 'virt-138.cluster-qe.lab.eng.brq.redhat.com' > virt-138.cluster-qe.lab.eng.brq.redhat.com: successful run of 'pacemaker_remote enable' > virt-138.cluster-qe.lab.eng.brq.redhat.com: successful run of 'pacemaker_remote start' > [root@virt-136 ~]# pcs status > ... > Online: [ virt-136.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ my-remote-node ] > > Full list of resources: > > my-remote-node (ocf::pacemaker:remote): Started virt-136.cluster-qe.lab.eng.brq.redhat.com > ... > [root@virt-136 ~]# ssh virt-138 crm_node -n > my-remote-node > [root@virt-136 ~]# ssh virt-138 crm_node -i > my-remote-node > [root@virt-136 ~]# ssh virt-138 pcs cluster standby > [root@virt-136 ~]# pcs status > ... > RemoteNode my-remote-node: standby > Online: [ virt-136.cluster-qe.lab.eng.brq.redhat.com ] > ... > [root@virt-136 ~]# ssh virt-138 pcs cluster unstandby > [root@virt-136 ~]# pcs status > ... > Online: [ virt-136.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ my-remote-node ] > ... > [root@virt-136 ~]# ssh virt-138 pcs node maintenance > [root@virt-136 ~]# pcs status > ... > RemoteNode my-remote-node: maintenance > Online: [ virt-136.cluster-qe.lab.eng.brq.redhat.com ] > ... > [root@virt-136 ~]# ssh virt-138 pcs node unmaintenance > [root@virt-136 ~]# pcs status > ... > Online: [ virt-136.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ my-remote-node ] > ... After the fix, remote nodes are able to correctly determine their name. Marking verified in pacemaker-1.1.19-7.el7. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3055 |