Bug 1323544

Summary: Better handling of remote nodes when generating crm_reports
Product: Red Hat Enterprise Linux 7 Reporter: Andrew Beekhof <abeekhof>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: low Docs Contact:
Priority: medium    
Version: 7.2CC: abeekhof, cfeist, cluster-maint, jkortus, lmiksik, phagara
Target Milestone: rc   
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.15-11.el7 Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 18:59:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Beekhof 2016-04-04 02:03:18 UTC
[root@airfrance-2 ~]# crm_report --from 11:00
airfrance-2:  Calculated node list: airfrance-1 airfrance-2 airfrance-3 

^^^^-- sweet

vvvv-- not so sweet

airfrance-3:  WARN: Non-standard Pacemaker installation: config not found
dirname: missing operand
Try 'dirname --help' for more information.
airfrance-3:  Detecting where Pacemaker keeps Policy Engine inputs... this may take a while
airfrance-3:  Found: /var/lib/pacemaker/pengine
airfrance-3:  WARN: Could not determine the location of your cluster configuration

Comment 2 Ken Gaillot 2016-04-04 23:57:16 UTC
It is only by accident that crm_report works with remote nodes at all. Some of the issues that need to be addressed:

1. Unless --single-node is used, crm_report must be initiated from a cluster node, because it must determine the cluster stack type and nodes via the cluster stack runtime, configuration and/or logs.

2. Without --single-node or --nodes, crm_report will detect remote nodes only if they have ever had a permanent node attribute, because that is what gets them an entry in the nodes section of the CIB. Ideally, it would also check the CIB for remote node resources in the configuration section.

3. crm_report will only work with remote nodes whose name in the cluster is resolvable and usable via ssh. Ideally, it could use any name specified with server= for remote nodes or remote-addr= for guest nodes.

4. crm_report will only work with remote nodes that have the full cluster stack installed, because it requires that pengine and its input directory exist. Ideally, it would not require these to exist on remote nodes.

5. crm_report will print the "config not found" warning and dirname error on remote nodes because no CIB exists there. It will also choose wrong directories to search for blackboxes and cores as a side effect of this. Ideally, it would not require the CIB to exist, but would still find the blackbox and cores directories correctly, on remote nodes.

6. Without --cluster, crm_report will assume a cluster stack of heartbeat when collecting files from remote nodes, even if it correctly determines the stack type on the initiating node. Even with --cluster, crm_report will print the "cluster configuration" warning (and might take a very long time to do so) if there is no cluster stack configuration on the remote node. Ideally, it would not look for stack configuration on remote nodes.

7. crm_report will not detect the pacemaker detail log used on remote nodes, because it searches the stack configuration for log settings. Ideally, it would search /etc/sysconfig/pacemaker for log settings on remote nodes.

8. When crm_report checks for installed packages, it doesn't look for pacemaker_remote. It should.

9. When crm_report creates permissions.txt, it may report that certain directories do not exist, which is not a problem on remote nodes. Ideally, it would not print any messages for directories that aren't required on remote nodes.

10. crm_report will always add remote nodes to the STOPPED file, because they do not run crmd. It should add them to RUNNING if pacemaker_remote is running.

Not all of these will be addressable anytime soon, but we can knock out a few low-hanging fruit and document the rest as limitations, to close this bug.

Comment 3 Ken Gaillot 2016-04-20 20:39:07 UTC
Fixed upstream as of commit 90b675e2

In particular, the upstream changes completely address points 4, 5, 6, 8, and 10 from Comment 2, partially address points 1 and 7, and improve the help text to indicate limitations that still exist. They also make some general improvements that apply to all nodes.

QA: We only need to test "crm_report --single-node --from <DATE-TIME>", as that is similar to what sosreport will use. When run on a Pacemaker Remote node, it should not print any of the errors mentioned in the Description, it should correctly grab any cores from /var/lib/pacemaker/cores, and the sysinfo.txt file should include "Verifying installation of: pacemaker-remote".

Comment 6 Patrik Hagara 2016-09-12 19:47:47 UTC
> [root@virt-265 ~]# pcs status
> Cluster name: foo
> Stack: corosync
> Current DC: virt-243 (version 1.1.15-1.2c148ac.git.el7-2c148ac) - partition with quorum
> Last updated: Mon Sep 12 21:40:38 2016          Last change: Mon Sep 12 21:22:37 2016 by root via cibadmin on virt-242
> 
> 4 nodes and 1 resource configured
> 
> Online: [ virt-242 virt-243 virt-265 ]
> RemoteOnline: [ virt-266 ]
> 
> Full list of resources:
> 
>  virt-266     (ocf::pacemaker:remote):        Started virt-242
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> [root@virt-266 ~]# rpm -q pacemaker-remote
> pacemaker-remote-1.1.15-1.2c148ac.git.el7.x86_64
> [root@virt-266 ~]# date && crm_report --single-node --from "2016-09-11 00:00:00" -V
> Mon Sep 12 21:29:14 CEST 2016
> virt-266:   Debug: Detected the 'any' cluster stack
> virt-266:   Debug: We are a cluster node
> virt-266:   Collecting data from virt-266.cluster-qe.lab.eng.brq.redhat.com (09/11/16 00:00:00 to 09/12/16 21:29:14)
> virt-266:   Debug: Using full path to working directory: /root/pcmk-Mon-12-Sep-2016
> virt-266:   Debug: Machine runtime directory: /var
> virt-266:   Debug: Pacemaker runtime data located in: /var/run/crm
> virt-266:   Searching for where Pacemaker daemons live... this may take a while
> ^C
> [root@virt-266 ~]# date
> Mon Sep 12 21:44:21 CEST 2016


Searching for pacemaker daemons takes too long / never completes.

Comment 10 Ken Gaillot 2016-09-14 22:39:44 UTC
Currently, crm_report does:
- look for pengine in the standard locations
- if not found, do a global search for pengine 
- if still not found, look for pacemaker_remoted
- if still nothing found, error

We can move the pacemaker_remoted check before the global search.

Comment 11 Ken Gaillot 2016-09-15 20:57:59 UTC
Fixed upstream (using the approach in Comment 10) as of commit 700c800

Comment 13 Patrik Hagara 2016-09-26 15:25:57 UTC
> In particular, the upstream changes completely address points 4, 5, 6, 8, and 10 from Comment 2,

> 4. crm_report will only work with remote nodes that have the full cluster stack installed, because it requires that pengine and its input directory exist. Ideally, it would not require these to exist on remote nodes.

crm_report on pacemaker_remote node no longer requires installing the whole cluster stack, only those packages on which pacemaker-remote package depends

> 5. crm_report will print the "config not found" warning and dirname error on remote nodes because no CIB exists there. It will also choose wrong directories to search for blackboxes and cores as a side effect of this. Ideally, it would not require the CIB to exist, but would still find the blackbox and cores directories correctly, on remote nodes.

"config not found" warning is not printed when CIB does not exist, both blackboxes and cores are successfully collected

> 6. Without --cluster, crm_report will assume a cluster stack of heartbeat when collecting files from remote nodes, even if it correctly determines the stack type on the initiating node. Even with --cluster, crm_report will print the "cluster configuration" warning (and might take a very long time to do so) if there is no cluster stack configuration on the remote node. Ideally, it would not look for stack configuration on remote nodes.

no "cluster configuration" warning printed, pacemaker_remote detection now completes in a timely fashion

> 8. When crm_report checks for installed packages, it doesn't look for pacemaker_remote. It should.

the line `Verifying installation of: resource-agents` is present in sysinfo.txt

> 10. crm_report will always add remote nodes to the STOPPED file, because they do not run crmd. It should add them to RUNNING if pacemaker_remote is running.

remote node is now listed in the RUNNING file


Marking as verified in pacemaker{,-remote}-1.1.15-11.el7

Comment 15 errata-xmlrpc 2016-11-03 18:59:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2578.html