Bug 1176018
Summary: | pcs/pcsd should be able to configure pacemaker remote | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Fabio Massimo Di Nitto <fdinitto> | ||||||||||||||
Component: | pcs | Assignee: | Ivan Devat <idevat> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||||
Severity: | urgent | Docs Contact: | Steven J. Levine <slevine> | ||||||||||||||
Priority: | urgent | ||||||||||||||||
Version: | 7.1 | CC: | cluster-maint, idevat, jpokorny, kgaillot, michele, mlisik, rsteiger, tlavigne, tojeline | ||||||||||||||
Target Milestone: | rc | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||
OS: | Unspecified | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | pcs-0.9.158-3.el7 | Doc Type: | Release Note | ||||||||||||||
Doc Text: |
New commands for supporting and removing remote and guest nodes
Red Hat Enterprise Linux 7.4 provides the following new commands for creating and removing remote and guest nodes:
* pcs cluster node add-guest
* pcs cluster node remove-guest
* pcs cluster node add-remote
* pcs cluster node remove-remote
These commands replace the `pcs cluster remote-node add` and `pcs cluster remote-node remove` commands, which have been deprecated.
|
Story Points: | --- | ||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2017-08-01 18:22:57 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 1450880 | ||||||||||||||||
Attachments: |
|
Description
Fabio Massimo Di Nitto
2014-12-19 07:37:19 UTC
Considerations for the future: * I believe the documentation currently does not recommend that users install pcs (and therefore pcsd) on remote nodes, because not all pcs commands will work when run from a remote node command line (even "pcs status" fails because it uses "crm_node -l"). So, if we want pcs to take action on the remote node itself, we'll have to update the documentation accordingly. * pcs currently depends on pacemaker, but remote nodes should not be required to install pacemaker. Not sure of the best way around that if we want pcsd on remote nodes. Perhaps pacemaker and pacemaker-remote could both provide a virtual package (e.g. pacemaker-daemon) that pcs could depend on, but that might break existing workflows that assume pcs will drag in pacemaker. * There is some confusion in that remote nodes come in two flavors, those created by ocf:pacemaker:remote resource, and those created by the remote-node meta-attribute of another resource (such as VirtualDomain). This is why Fabio found the pcs cluster remote-node command misleading. Upstream documentation now tries to consistently refer to the first kind as "remote nodes" and the second kind as "guest nodes". I wonder if, instead of a separate remote-node command, we could use options to pcs cluster node, e.g. "pcs cluster node add --remote mynode" or "pcs cluster node add --guest=myvm mynode". * Expanding on the previous point, if we ran pcsd on remote nodes, --start/--enable could be meaningful with the previous suggestion. The command could additionally copy the authentication key etc. * Per the upstream documentation at http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html#idm140473070951888 , there are a variety of options that can be configured for remote nodes. One complication is that some options (e.g. a non-default port number) must be configured both in the CIB and in /etc/sysconfig/pacemaker on the remote node (and of course must match). Another complication is that the CIB options are different for remote nodes and guest nodes. * Regarding a remote node staying in status after its resource is removed from the configuration, that is expected behavior. The history of the node itself must be removed with "crm_node --force --remove $NODE_NAME". I believe that's true even of cluster nodes. > * pcs currently depends on pacemaker, but remote nodes should not be > required to install pacemaker. Not sure of the best way around that > if we want pcsd on remote nodes. Perhaps pacemaker and > pacemaker-remote could both provide a virtual package (e.g. > pacemaker-daemon) that pcs could depend on, but that might break > existing workflows that assume pcs will drag in pacemaker. Making such an assumption is IMHO broken from the beginning, just as combining both client (pcs) and server part (pcsd) into single package. At least I still do hope pcs will be able to operate merely remotely one day, the same way as ccs can operate in the old stack. There's [bug 1210833] for that. That would require that pcs (CLI part) is a self-contained client, without any cluster stack related Requires. That being told, thre is [bug 1300413] asking for splitting the monolithic package that supports previous sketch of "ideal state". Additionally, "pcs cluster status" at the RHEL 6 remote node will show:
> Error: Unable to read /etc/cluster/cluster.conf:
> No such file or directory
which is irrelevant in this scenario.
re [comment 7]: There's another issue I've just observed in RHEL 7.3: - install pacemaker/pcs - remove pacemaker, it will remove pcs along - install pacemaker-remote, wonder why only pcs has been previously removed when at least pcs CLI part would come handy beyond the lifetime of pacemaker installation + if you subsequently want to install pcs, it will bring pacemaker back when not needed at all This sub-issue should be rectified as of [bug 1388398]. Created attachment 1282152 [details]
proposed fix (part1)
Created attachment 1282153 [details]
proposed fix (part2)
Created attachment 1282154 [details]
proposed fix (part3)
Created attachment 1282163 [details]
proposed fix - backup and restore keys
Pcs is unable to destroy a stopped cluster. It is trying to load the CIB to destroy remote and guest nodes as well. When the cluster is stopped, pcs crashes with exception ERROR CIB_LOAD_ERROR Signon to CIB failed: Transport endpoint is not connected Init failed, could not perform requested operations Pcs crashes on cluster setup when --force is used: # pcs cluster setup --name test rh73-node1 rh73-node2 --force Destroying cluster on nodes: rh73-node1, rh73-node2... rh73-node2: Stopping Cluster (pacemaker)... rh73-node1: Stopping Cluster (pacemaker)... rh73-node2: Successfully destroyed cluster rh73-node1: Successfully destroyed cluster Traceback (most recent call last): File "/usr/sbin/pcs", line 9, in <module> load_entry_point('pcs==0.9.158', 'console_scripts', 'pcs')() File "/usr/lib/python2.7/site-packages/pcs/app.py", line 191, in main cmd_map[command](argv) File "/usr/lib/python2.7/site-packages/pcs/cluster.py", line 85, in cluster_cmd cluster_setup([utils.pcs_options["--name"]] + argv) File "/usr/lib/python2.7/site-packages/pcs/cluster.py", line 462, in cluster_setup lib_env.node_communicator(), UnboundLocalError: local variable 'lib_env' referenced before assignment It is not possible to add a node to a stopped cluster: # pcs cluster node add rh73-node3 Disabling SBD service... rh73-node3: sbd disabled Sending booth configuration to cluster nodes... rh73-node3: Booth config(s) (booth.conf, booth.key) saved. Error: unable to get cib Created attachment 1282323 [details]
additional fixes
After fix: [root@rh73-node1:~]# rpm -q pcs pcs-0.9.158-2.el7.x86_64 > the authkey is distributed to a remote node: [root@rh73-node3:~]# ls -l /etc/pacemaker/authkey ls: cannot access /etc/pacemaker/authkey: No such file or directory [root@rh73-node1:~]# pcs cluster node add-remote rh73-node3 Sending remote node configuration files to 'rh73-node3' rh73-node3: successful distribution of the file 'pacemaker_remote authkey' Requesting start of service pacemaker_remote on 'rh73-node3' rh73-node3: successful run of 'pacemaker_remote enable' rh73-node3: successful run of 'pacemaker_remote start' [root@rh73-node3:~]# ls -l /etc/pacemaker/authkey -r--------. 1 hacluster haclient 64 May 26 14:18 /etc/pacemaker/authkey > a remote node removal - the authkey is deleted, node vanishes from status: [root@rh73-node1:~]# pcs status Cluster name: rhel73 Stack: corosync Current DC: rh73-node2 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Fri May 26 14:21:55 2017 Last change: Fri May 26 14:18:43 2017 by root via cibadmin on rh73-node1 3 nodes configured 5 resources configured Online: [ rh73-node1 rh73-node2 ] RemoteOnline: [ rh73-node3 ] Full list of resources: xvmNode1 (stonith:fence_xvm): Started rh73-node2 xvmNode2 (stonith:fence_xvm): Started rh73-node1 xvmNode3 (stonith:fence_xvm): Started rh73-node2 dummy (ocf::pacemaker:Dummy): Started rh73-node3 rh73-node3 (ocf::pacemaker:remote): Started rh73-node1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@rh73-node1:~]# pcs cluster node remove-remote rh73-node3 Attempting to stop: rh73-node3...Stopped Requesting stop of service pacemaker_remote on 'rh73-node3' rh73-node3: successful run of 'pacemaker_remote disable' rh73-node3: successful run of 'pacemaker_remote stop' Requesting remove remote node files from 'rh73-node3' rh73-node3: successful removal of the file 'pacemaker_remote authkey' [root@rh73-node1:~]# pcs status Cluster name: rhel73 Stack: corosync Current DC: rh73-node2 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Fri May 26 14:22:11 2017 Last change: Fri May 26 14:22:07 2017 by root via cibadmin on rh73-node1 2 nodes configured 4 resources configured Online: [ rh73-node1 rh73-node2 ] Full list of resources: xvmNode1 (stonith:fence_xvm): Started rh73-node2 xvmNode2 (stonith:fence_xvm): Started rh73-node1 xvmNode3 (stonith:fence_xvm): Started rh73-node2 dummy (ocf::pacemaker:Dummy): Started rh73-node1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@rh73-node3:~]# ls -l /etc/pacemaker/authkey ls: cannot access /etc/pacemaker/authkey: No such file or directory > the authkey is distributed to a guest node: [root@rh73-node3:~]# ls -l /etc/pacemaker/authkey ls: cannot access /etc/pacemaker/authkey: No such file or directory [root@rh73-node1:~]# pcs cluster node add-guest rh73-node3 dummy-guest Sending remote node configuration files to 'rh73-node3' rh73-node3: successful distribution of the file 'pacemaker_remote authkey' Requesting start of service pacemaker_remote on 'rh73-node3' rh73-node3: successful run of 'pacemaker_remote enable' rh73-node3: successful run of 'pacemaker_remote start' [root@rh73-node3:~]# ls -l /etc/pacemaker/authkey -r--------. 1 hacluster haclient 64 May 26 14:23 /etc/pacemaker/authkey > a guest node removal - the authkey is deleted, node vanishes from status: [root@rh73-node1:~]# pcs status Cluster name: rhel73 Stack: corosync Current DC: rh73-node2 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Fri May 26 14:24:04 2017 Last change: Fri May 26 14:23:37 2017 by root via cibadmin on rh73-node1 3 nodes configured 6 resources configured Online: [ rh73-node1 rh73-node2 ] GuestOnline: [ rh73-node3@rh73-node1 ] Full list of resources: xvmNode1 (stonith:fence_xvm): Started rh73-node2 xvmNode2 (stonith:fence_xvm): Started rh73-node2 xvmNode3 (stonith:fence_xvm): Started rh73-node2 dummy (ocf::pacemaker:Dummy): Started rh73-node3 dummy-guest (ocf::pacemaker:Dummy): Started rh73-node1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@rh73-node1:~]# pcs cluster node remove-guest rh73-node3 Requesting stop of service pacemaker_remote on 'rh73-node3' rh73-node3: successful run of 'pacemaker_remote disable' rh73-node3: successful run of 'pacemaker_remote stop' Requesting remove remote node files from 'rh73-node3' rh73-node3: successful removal of the file 'pacemaker_remote authkey' [root@rh73-node1:~]# pcs status Cluster name: rhel73 Stack: corosync Current DC: rh73-node2 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Fri May 26 14:24:38 2017 Last change: Fri May 26 14:24:15 2017 by root via cibadmin on rh73-node1 2 nodes configured 5 resources configured Online: [ rh73-node1 rh73-node2 ] Full list of resources: xvmNode1 (stonith:fence_xvm): Started rh73-node2 xvmNode2 (stonith:fence_xvm): Started rh73-node1 xvmNode3 (stonith:fence_xvm): Started rh73-node1 dummy (ocf::pacemaker:Dummy): Started rh73-node2 dummy-guest (ocf::pacemaker:Dummy): Started rh73-node1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@rh73-node3:~]# ls -l /etc/pacemaker/authkey ls: cannot access /etc/pacemaker/authkey: No such file or directory > cluster destroy deletes the authkey: [root@rh73-node1:~]# ll /etc/pacemaker/authkey -r--------. 1 hacluster haclient 64 May 25 18:08 /etc/pacemaker/authkey [root@rh73-node1:~]# pcs cluster destroy --all rh73-node1: Stopping Cluster (pacemaker)... rh73-node2: Stopping Cluster (pacemaker)... rh73-node1: Successfully destroyed cluster rh73-node2: Successfully destroyed cluster [root@rh73-node1:~]# ll /etc/pacemaker/authkey ls: cannot access /etc/pacemaker/authkey: No such file or directory > it is possible do destroy a stopped cluster: [root@rh73-node1:~]# pcs cluster start --all --wait rh73-node2: Starting Cluster... rh73-node1: Starting Cluster... Waiting for node(s) to start... rh73-node1: Started rh73-node2: Started [root@rh73-node1:~]# pcs cluster stop --all rh73-node1: Stopping Cluster (pacemaker)... rh73-node2: Stopping Cluster (pacemaker)... rh73-node2: Stopping Cluster (corosync)... rh73-node1: Stopping Cluster (corosync)... [root@rh73-node1:~]# pcs cluster destroy --all Warning: Unable to load CIB to get guest and remote nodes from it, those nodes will not be deconfigured. rh73-node2: Stopping Cluster (pacemaker)... rh73-node1: Stopping Cluster (pacemaker)... rh73-node2: Successfully destroyed cluster rh73-node1: Successfully destroyed cluster [root@rh73-node1:~]# echo $? 0 > pcs cluster setup with --force works: [root@rh73-node1:~]# pcs cluster setup --name rhel73 rh73-node1 rh73-node2 --force Destroying cluster on nodes: rh73-node1, rh73-node2... rh73-node2: Stopping Cluster (pacemaker)... rh73-node1: Stopping Cluster (pacemaker)... rh73-node2: Successfully destroyed cluster rh73-node1: Successfully destroyed cluster Sending 'corosync authkey', 'pacemaker_remote authkey' to 'rh73-node1', 'rh73-node2' rh73-node1: successful distribution of the file 'corosync authkey' rh73-node1: successful distribution of the file 'pacemaker_remote authkey' rh73-node2: successful distribution of the file 'corosync authkey' rh73-node2: successful distribution of the file 'pacemaker_remote authkey' Sending cluster config files to the nodes... rh73-node1: Succeeded rh73-node2: Succeeded Synchronizing pcsd certificates on nodes rh73-node1, rh73-node2... rh73-node1: Success rh73-node2: Success Restarting pcsd on the nodes in order to reload the certificates... rh73-node1: Success rh73-node2: Success [root@rh73-node1:~]# echo $? 0 > adding a node to a stopped cluster works: [root@rh73-node1:~]# pcs status Error: cluster is not currently running on this node [root@rh73-node1:~]# pcs cluster node add rh73-node3 Disabling SBD service... rh73-node3: sbd disabled Sending booth configuration to cluster nodes... rh73-node3: Booth config(s) (booth.conf, booth.key) saved. Sending 'corosync authkey' to 'rh73-node3' rh73-node3: successful distribution of the file 'corosync authkey' Sending remote node configuration files to 'rh73-node3' rh73-node3: successful distribution of the file 'pacemaker_remote authkey' rh73-node1: Corosync updated rh73-node2: Corosync updated Setting up corosync... rh73-node3: Succeeded Synchronizing pcsd certificates on nodes rh73-node3... rh73-node3: Success Restarting pcsd on the nodes in order to reload the certificates... rh73-node3: Success [root@rh73-node1:~]# echo $? 0 > the authkey is part of backup / restore procedure [root@rh73-node1:~]# cat /etc/corosync/authkey c92c34c422808663f15bfc811faf12f84984bc270fc846f31601431d1c47ae3f9658b4a9ac55194503a039f145f6293e90f414d7ff78e54956f3a140c12fe00c8e5441d9ab73d0aeaf88f5d822098f93c8f591f94e27c16aa8626efa5017461d39e4f9cfddf16648636a465110813760044e4de3ee33f33dfdbfa464ce02cc22[root@rh73-node1:~]# [root@rh73-node1:~]# cat /etc/pacemaker/authkey eca4d1834c7619b535a19432e90c3107fa3e9a82d06a513ff33ae32d701d00d4[root@rh73-node1:~]# [root@rh73-node1:~]# pcs config backup cluster.tar.bz2 [root@rh73-node1:~]# tar -tf cluster.tar.bz2 | grep authkey corosync_authkey pacemaker_authkey [root@rh73-node1:~]# pcs cluster destroy --all rh73-node1: Stopping Cluster (pacemaker)... rh73-node2: Stopping Cluster (pacemaker)... rh73-node2: Successfully destroyed cluster rh73-node1: Successfully destroyed cluster [root@rh73-node1:~]# cat /etc/corosync/authkey cat: /etc/corosync/authkey: No such file or directory [root@rh73-node1:~]# cat /etc/pacemaker/authkey cat: /etc/pacemaker/authkey: No such file or directory [root@rh73-node1:~]# pcs config restore cluster.tar.bz2 rh73-node1: Succeeded rh73-node2: Succeeded [root@rh73-node1:~]# cat /etc/corosync/authkey c92c34c422808663f15bfc811faf12f84984bc270fc846f31601431d1c47ae3f9658b4a9ac55194503a039f145f6293e90f414d7ff78e54956f3a140c12fe00c8e5441d9ab73d0aeaf88f5d822098f93c8f591f94e27c16aa8626efa5017461d39e4f9cfddf16648636a465110813760044e4de3ee33f33dfdbfa464ce02cc22[root@rh73-node1:~]# [root@rh73-node1:~]# cat /etc/pacemaker/authkey eca4d1834c7619b535a19432e90c3107fa3e9a82d06a513ff33ae32d701d00d4[root@rh73-node1:~]# There are additional problems: > flag --skip-offline is ignored [vm-rhel72-1 ~] $ pcs cluster node add-guest no-host D Error: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known), use --skip-offline to override [vm-rhel72-1 ~] $ pcs cluster node add-guest no-host D --skip-offline Error: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known), use --skip-offline to override > guest is removed from cib even if the command does not succeeded. [vm-rhel72-1 ~] $ pcs cluster node remove-guest no-host Requesting stop of service pacemaker_remote on 'no-host' Error: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known), use --skip-offline to override [vm-rhel72-1 ~] $ pcs cluster node remove-guest no-host Error: guest node 'no-host' does not appear to exist in configuration Created attachment 1283759 [details]
proposed fix (part6)
After Fix: [vm-rhel72-1 ~] $ rpm -q pcs pcs-0.9.158-3.el7.x86_64 > flag --skip-offline [vm-rhel72-1 ~] $ pcs cluster node add-guest no-host D Error: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known), use --skip-offline to override [vm-rhel72-1 ~] $ pcs cluster node add-guest no-host D --skip-offline Warning: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known) Sending remote node configuration files to 'no-host' Warning: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known) Requesting start of service pacemaker_remote on 'no-host' Warning: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known) > guest is removed from cib even if the command does not succeeded. [vm-rhel72-1 ~] $ pcs cluster node remove-guest no-host Requesting stop of service pacemaker_remote on 'no-host' Error: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known), use --skip-offline to override [vm-rhel72-1 ~] $ pcs cluster node remove-guest no-host --skip-offline Requesting stop of service pacemaker_remote on 'no-host' Warning: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known) Requesting remove remote node files from 'no-host' Warning: Unable to connect to no-host (Could not resolve host: no-host; Name or service not known) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1958 |