Bug 1323544
Summary: | Better handling of remote nodes when generating crm_reports | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andrew Beekhof <abeekhof> |
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 7.2 | CC: | abeekhof, cfeist, cluster-maint, jkortus, lmiksik, phagara |
Target Milestone: | rc | ||
Target Release: | 7.3 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pacemaker-1.1.15-11.el7 | Doc Type: | No Doc Update |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-03 18:59:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andrew Beekhof
2016-04-04 02:03:18 UTC
It is only by accident that crm_report works with remote nodes at all. Some of the issues that need to be addressed: 1. Unless --single-node is used, crm_report must be initiated from a cluster node, because it must determine the cluster stack type and nodes via the cluster stack runtime, configuration and/or logs. 2. Without --single-node or --nodes, crm_report will detect remote nodes only if they have ever had a permanent node attribute, because that is what gets them an entry in the nodes section of the CIB. Ideally, it would also check the CIB for remote node resources in the configuration section. 3. crm_report will only work with remote nodes whose name in the cluster is resolvable and usable via ssh. Ideally, it could use any name specified with server= for remote nodes or remote-addr= for guest nodes. 4. crm_report will only work with remote nodes that have the full cluster stack installed, because it requires that pengine and its input directory exist. Ideally, it would not require these to exist on remote nodes. 5. crm_report will print the "config not found" warning and dirname error on remote nodes because no CIB exists there. It will also choose wrong directories to search for blackboxes and cores as a side effect of this. Ideally, it would not require the CIB to exist, but would still find the blackbox and cores directories correctly, on remote nodes. 6. Without --cluster, crm_report will assume a cluster stack of heartbeat when collecting files from remote nodes, even if it correctly determines the stack type on the initiating node. Even with --cluster, crm_report will print the "cluster configuration" warning (and might take a very long time to do so) if there is no cluster stack configuration on the remote node. Ideally, it would not look for stack configuration on remote nodes. 7. crm_report will not detect the pacemaker detail log used on remote nodes, because it searches the stack configuration for log settings. Ideally, it would search /etc/sysconfig/pacemaker for log settings on remote nodes. 8. When crm_report checks for installed packages, it doesn't look for pacemaker_remote. It should. 9. When crm_report creates permissions.txt, it may report that certain directories do not exist, which is not a problem on remote nodes. Ideally, it would not print any messages for directories that aren't required on remote nodes. 10. crm_report will always add remote nodes to the STOPPED file, because they do not run crmd. It should add them to RUNNING if pacemaker_remote is running. Not all of these will be addressable anytime soon, but we can knock out a few low-hanging fruit and document the rest as limitations, to close this bug. Fixed upstream as of commit 90b675e2 In particular, the upstream changes completely address points 4, 5, 6, 8, and 10 from Comment 2, partially address points 1 and 7, and improve the help text to indicate limitations that still exist. They also make some general improvements that apply to all nodes. QA: We only need to test "crm_report --single-node --from <DATE-TIME>", as that is similar to what sosreport will use. When run on a Pacemaker Remote node, it should not print any of the errors mentioned in the Description, it should correctly grab any cores from /var/lib/pacemaker/cores, and the sysinfo.txt file should include "Verifying installation of: pacemaker-remote". > [root@virt-265 ~]# pcs status
> Cluster name: foo
> Stack: corosync
> Current DC: virt-243 (version 1.1.15-1.2c148ac.git.el7-2c148ac) - partition with quorum
> Last updated: Mon Sep 12 21:40:38 2016 Last change: Mon Sep 12 21:22:37 2016 by root via cibadmin on virt-242
>
> 4 nodes and 1 resource configured
>
> Online: [ virt-242 virt-243 virt-265 ]
> RemoteOnline: [ virt-266 ]
>
> Full list of resources:
>
> virt-266 (ocf::pacemaker:remote): Started virt-242
>
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
> [root@virt-266 ~]# rpm -q pacemaker-remote
> pacemaker-remote-1.1.15-1.2c148ac.git.el7.x86_64
> [root@virt-266 ~]# date && crm_report --single-node --from "2016-09-11 00:00:00" -V
> Mon Sep 12 21:29:14 CEST 2016
> virt-266: Debug: Detected the 'any' cluster stack
> virt-266: Debug: We are a cluster node
> virt-266: Collecting data from virt-266.cluster-qe.lab.eng.brq.redhat.com (09/11/16 00:00:00 to 09/12/16 21:29:14)
> virt-266: Debug: Using full path to working directory: /root/pcmk-Mon-12-Sep-2016
> virt-266: Debug: Machine runtime directory: /var
> virt-266: Debug: Pacemaker runtime data located in: /var/run/crm
> virt-266: Searching for where Pacemaker daemons live... this may take a while
> ^C
> [root@virt-266 ~]# date
> Mon Sep 12 21:44:21 CEST 2016
Searching for pacemaker daemons takes too long / never completes.
Currently, crm_report does: - look for pengine in the standard locations - if not found, do a global search for pengine - if still not found, look for pacemaker_remoted - if still nothing found, error We can move the pacemaker_remoted check before the global search. Fixed upstream (using the approach in Comment 10) as of commit 700c800 > In particular, the upstream changes completely address points 4, 5, 6, 8, and 10 from Comment 2, > 4. crm_report will only work with remote nodes that have the full cluster stack installed, because it requires that pengine and its input directory exist. Ideally, it would not require these to exist on remote nodes. crm_report on pacemaker_remote node no longer requires installing the whole cluster stack, only those packages on which pacemaker-remote package depends > 5. crm_report will print the "config not found" warning and dirname error on remote nodes because no CIB exists there. It will also choose wrong directories to search for blackboxes and cores as a side effect of this. Ideally, it would not require the CIB to exist, but would still find the blackbox and cores directories correctly, on remote nodes. "config not found" warning is not printed when CIB does not exist, both blackboxes and cores are successfully collected > 6. Without --cluster, crm_report will assume a cluster stack of heartbeat when collecting files from remote nodes, even if it correctly determines the stack type on the initiating node. Even with --cluster, crm_report will print the "cluster configuration" warning (and might take a very long time to do so) if there is no cluster stack configuration on the remote node. Ideally, it would not look for stack configuration on remote nodes. no "cluster configuration" warning printed, pacemaker_remote detection now completes in a timely fashion > 8. When crm_report checks for installed packages, it doesn't look for pacemaker_remote. It should. the line `Verifying installation of: resource-agents` is present in sysinfo.txt > 10. crm_report will always add remote nodes to the STOPPED file, because they do not run crmd. It should add them to RUNNING if pacemaker_remote is running. remote node is now listed in the RUNNING file Marking as verified in pacemaker{,-remote}-1.1.15-11.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2578.html |