Bug 2186460

Summary: [RHEL8] sos collector does not collect a sosreport from localhost in a Pacemaker cluster
Product: Red Hat Enterprise Linux 8 Reporter: Jesús Serrano Sánchez-Toscano <jserrano>
Component: sosAssignee: Pavel Moravec <pmoravec>
Status: CLOSED ERRATA QA Contact: Miroslav Hradílek <mhradile>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.7CC: agk, jcastillo, jjansky, mhradile, nwahl, plambri, sbradley, theute
Target Milestone: rcKeywords: Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: sos-4.5.4-1.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-26 13:55:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Deadline: 2023-07-17   

Description Jesús Serrano Sánchez-Toscano 2023-04-13 10:57:42 UTC
Description of problem:
When executing "sos collect" from a Pacemaker cluster node, the list of the nodes to collect a sosreport from is correct but, in practice, it doesn't collect a sosreport from the local node where the command is executed, i.e., it collects a sosreport from all the other cluster nodes except from localhost.

Version-Release number of selected component (if applicable):
sos-4.5.1-3.el8.noarch

How reproducible:
Always

Steps to Reproduce:
1. Configure a Pacemaker cluster and install the latest version of sos (sos-4.5.1-3.el8.noarch).
2. Execute "sos collect" from one of the nodes.

Actual results:
The list of nodes from the Pacemaker cluster is correctly printed but the tarball generated by "sos" does not contain a sosreport from the local node (where "sos collect" was executed from).

Expected results:
A single tarball containing the sosreports from _all_ the nodes in the cluster is generated.

Additional info:
Here is an example from my lab, running a Pacemaker cluster on freshly installed RHEL 8.7:

[root@fastvm-rhel-8-7-201 ~]# rpm -qa | grep sos
sos-4.5.1-3.el8.noarch

[root@fastvm-rhel-8-7-201 ~]# crm_node -l
1 fastvm-rhel-8-7-201 member
2 fastvm-rhel-8-7-202 member

[root@fastvm-rhel-8-7-201 ~]# sos collect --password
...
The following is a list of nodes to collect from:
	fastvm-rhel-8-7-201
	fastvm-rhel-8-7-202


Press ENTER to continue with these nodes, or press CTRL-C to quit



Connecting to nodes...

Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-8-7-202  : Generating sos report...
fastvm-rhel-8-7-202  : Retrieving sos report...
fastvm-rhel-8-7-202  : Successfully collected sos report

The following archive has been created. Please provide it to your support team.
	/var/tmp/sos-collector-testcase-2023-04-13-cdlmq.tar.xz

[root@fastvm-rhel-8-7-201 ~]# tar --list -f /var/tmp/sos-collector-testcase-2023-04-13-cdlmq.tar.xz
sos-collector-testcase-2023-04-13-cdlmq/
sos-collector-testcase-2023-04-13-cdlmq/sos_logs/
sos-collector-testcase-2023-04-13-cdlmq/sos_logs/sos.log
sos-collector-testcase-2023-04-13-cdlmq/sos_logs/ui.log
sos-collector-testcase-2023-04-13-cdlmq/sosreport-fastvm-rhel-8-7-202-2023-04-13-futyafa.tar.xz
sos-collector-testcase-2023-04-13-cdlmq/sos_reports/
sos-collector-testcase-2023-04-13-cdlmq/sos_reports/manifest.json

Comment 2 Pavel Moravec 2023-04-21 06:24:14 UTC
Thanks a lot for the reproducer. It allowed me to quickly prove the root cause is a regression introduced by https://github.com/sosreport/sos/pull/3096 / fix for https://bugzilla.redhat.com/show_bug.cgi?id=2065821 .

So I am going to dance between eggs of technical requirements from these two BZs.. :)

Comment 3 Pavel Moravec 2023-04-21 07:42:11 UTC
This should be a patch for this use case and I hope *also* not breaking the https://bugzilla.redhat.com/show_bug.cgi?id=2065821 use case:

collect sosreport from primary node, if (we are connected to it and) either we dont forcibly remove localhost from collection (self.cluster.strict_node_list=False), or if we already evaluated it to be in node_list :

--- a/sos/collector/__init__.py
+++ b/sos/collector/__init__.py
@@ -1179,11 +1179,15 @@ this utility or remote systems that it c
     def collect(self):
         """ For each node, start a collection thread and then tar all
         collected sosreports """
-        if self.primary.connected and not self.cluster.strict_node_list:
+        filters = set([self.primary.address, self.primary.hostname])  # or self.opts.primary, like in reduce_node_list "remove the primary node" section?
+        # add primary if:
+        # - we are connected to it and
+        #   - its hostname is in node_list, or
+        #   - we dont forcibly remove local host from collection (i.e. strict_node_list=False)
+        if self.primary.connected and (filters.intersection(set(self.node_list)) or not self.cluster.strict_node_list):
             self.client_list.append(self.primary)
 
         self.ui_log.info("\nConnecting to nodes...")
-        filters = [self.primary.address, self.primary.hostname]
         nodes = [(n, None) for n in self.node_list if n not in filters]
 
         if self.opts.password_per_node:


Reid: would you be so kind and test this patch on your reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=2065821 ? Just create /tmp/bz2186460.patch with above patch, and run:

cd /usr/lib/python3.6/site-packages
cat /tmp/bz2186460.patch | patch -p1

on the system where you invoke sos collect from.

Comment 4 Pavel Moravec 2023-04-21 07:47:01 UTC
Jake,
when preparing patch for this BZ (see https://bugzilla.redhat.com/show_bug.cgi?id=2186460#c3), I spot one possible misalignment:

in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter?

Comment 5 Jake Hunsaker 2023-05-05 16:54:46 UTC
> in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter


I don't think there's a functional difference there...

primary.address is the value given to connect to the node, i.e. --primary.
primary.hostname is what we get from running the `hostname` command on the node.

It is entirely possible that these are the same value, but they can be different. In the first case we're directly checking --primary, and in the second we're checking a value we set early on based on --primary. So, I don't think there's a functional difference here, despite referencing two different vars.

Comment 6 Reid Wahl 2023-05-09 23:07:42 UTC
I missed this bug because it happens only when the node name matches the hostname. BZ 2065821 was for a case where the node name does not match the hostname.

-----

BEFORE:

[root@fastvm-rhel-9-0-42 ~]# sos collect --batch
...
The following is a list of nodes to collect from:
	fastvm-rhel-9-0-42
	fastvm-rhel-9-0-43


Connecting to nodes...

Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-9-0-43  : Generating sos report...

-----

AFTER:

[root@fastvm-rhel-9-0-42 ~]# sos collect --batch
...
The following is a list of nodes to collect from:
	fastvm-rhel-9-0-42
	fastvm-rhel-9-0-43


Connecting to nodes...

Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-9-0-42  : Generating sos report...
fastvm-rhel-9-0-43  : Generating sos report...

-----

In the AFTER case, it still works correctly for the BZ 2065821 case where the node names don't match the hostnames:

[root@fastvm-rhel-9-0-42 ~]# sos collect --batch
...
The following is a list of nodes to collect from:
	node2             
	node3             


Connecting to nodes...

Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-9-0-43  : Generating sos report...
fastvm-rhel-9-0-42  : Generating sos report...

Comment 7 Pavel Moravec 2023-05-16 13:59:29 UTC
(In reply to Reid Wahl from comment #6)
> I missed this bug because it happens only when the node name matches the
> hostname. BZ 2065821 was for a case where the node name does not match the
> hostname.
> 
> -----
> 
> BEFORE:
> 
> [root@fastvm-rhel-9-0-42 ~]# sos collect --batch
> ...
> The following is a list of nodes to collect from:
> 	fastvm-rhel-9-0-42
> 	fastvm-rhel-9-0-43
> 
> 
> Connecting to nodes...
> 
> Beginning collection of sosreports from 1 nodes, collecting a maximum of 4
> concurrently
> 
> fastvm-rhel-9-0-43  : Generating sos report...
> 
> -----
> 
> AFTER:
> 
> [root@fastvm-rhel-9-0-42 ~]# sos collect --batch
> ...
> The following is a list of nodes to collect from:
> 	fastvm-rhel-9-0-42
> 	fastvm-rhel-9-0-43
> 
> 
> Connecting to nodes...
> 
> Beginning collection of sosreports from 2 nodes, collecting a maximum of 4
> concurrently
> 
> fastvm-rhel-9-0-42  : Generating sos report...
> fastvm-rhel-9-0-43  : Generating sos report...
> 
> -----
> 
> In the AFTER case, it still works correctly for the BZ 2065821 case where
> the node names don't match the hostnames:
> 
> [root@fastvm-rhel-9-0-42 ~]# sos collect --batch
> ...
> The following is a list of nodes to collect from:
> 	node2             
> 	node3             
> 
> 
> Connecting to nodes...
> 
> Beginning collection of sosreports from 2 nodes, collecting a maximum of 4
> concurrently
> 
> fastvm-rhel-9-0-43  : Generating sos report...
> fastvm-rhel-9-0-42  : Generating sos report...

Hello,
do I understand you correctly that the patch from #c3 :
- does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821
- was not tested by you against the reproducer in this BZ (where I successfully tested it on jserrano's reproducer

If I got you right, I will raise PR with the patch #c3.

Thanks in advance for info / double-check.

Comment 8 Reid Wahl 2023-05-16 17:59:59 UTC
(In reply to Pavel Moravec from comment #7)
> Hello,
> do I understand you correctly that the patch from #c3 :
> - does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821

Correct


> - was not tested by you against the reproducer in this BZ (where I
> successfully tested it on jserrano's reproducer

I tested the patch from comment 3 against a similar reproducer. Before the patch, I reproduced the bad behavior. With the patch, sos collector behaved correctly in both:
* jserrano's test case (my similar reproducer)
* the original BZ 2065821 test case
 

> If I got you right, I will raise PR with the patch #c3.
> 
> Thanks in advance for info / double-check.

Sounds good to me :)

Comment 17 errata-xmlrpc 2023-06-26 13:55:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3801