RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 2186460 - [RHEL8] sos collector does not collect a sosreport from localhost in a Pacemaker cluster
Summary: [RHEL8] sos collector does not collect a sosreport from localhost in a Pacema...
Keywords:
Status: CLOSED ERRATA
Alias: None
Deadline: 2023-07-17
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: sos
Version: 8.7
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Pavel Moravec
QA Contact: Miroslav Hradílek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-13 10:57 UTC by Jesús Serrano Sánchez-Toscano
Modified: 2023-06-29 11:26 UTC (History)
8 users (show)

Fixed In Version: sos-4.5.4-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-06-26 13:55:53 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github sosreport sos pull 3240 0 None open [collector] collect report from primary node if in node_list 2023-05-16 20:06:53 UTC
Red Hat Issue Tracker RHELPLAN-154613 0 None None None 2023-04-13 16:37:56 UTC

Description Jesús Serrano Sánchez-Toscano 2023-04-13 10:57:42 UTC
Description of problem:
When executing "sos collect" from a Pacemaker cluster node, the list of the nodes to collect a sosreport from is correct but, in practice, it doesn't collect a sosreport from the local node where the command is executed, i.e., it collects a sosreport from all the other cluster nodes except from localhost.

Version-Release number of selected component (if applicable):
sos-4.5.1-3.el8.noarch

How reproducible:
Always

Steps to Reproduce:
1. Configure a Pacemaker cluster and install the latest version of sos (sos-4.5.1-3.el8.noarch).
2. Execute "sos collect" from one of the nodes.

Actual results:
The list of nodes from the Pacemaker cluster is correctly printed but the tarball generated by "sos" does not contain a sosreport from the local node (where "sos collect" was executed from).

Expected results:
A single tarball containing the sosreports from _all_ the nodes in the cluster is generated.

Additional info:
Here is an example from my lab, running a Pacemaker cluster on freshly installed RHEL 8.7:

[root@fastvm-rhel-8-7-201 ~]# rpm -qa | grep sos
sos-4.5.1-3.el8.noarch

[root@fastvm-rhel-8-7-201 ~]# crm_node -l
1 fastvm-rhel-8-7-201 member
2 fastvm-rhel-8-7-202 member

[root@fastvm-rhel-8-7-201 ~]# sos collect --password
...
The following is a list of nodes to collect from:
	fastvm-rhel-8-7-201
	fastvm-rhel-8-7-202


Press ENTER to continue with these nodes, or press CTRL-C to quit



Connecting to nodes...

Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-8-7-202  : Generating sos report...
fastvm-rhel-8-7-202  : Retrieving sos report...
fastvm-rhel-8-7-202  : Successfully collected sos report

The following archive has been created. Please provide it to your support team.
	/var/tmp/sos-collector-testcase-2023-04-13-cdlmq.tar.xz

[root@fastvm-rhel-8-7-201 ~]# tar --list -f /var/tmp/sos-collector-testcase-2023-04-13-cdlmq.tar.xz
sos-collector-testcase-2023-04-13-cdlmq/
sos-collector-testcase-2023-04-13-cdlmq/sos_logs/
sos-collector-testcase-2023-04-13-cdlmq/sos_logs/sos.log
sos-collector-testcase-2023-04-13-cdlmq/sos_logs/ui.log
sos-collector-testcase-2023-04-13-cdlmq/sosreport-fastvm-rhel-8-7-202-2023-04-13-futyafa.tar.xz
sos-collector-testcase-2023-04-13-cdlmq/sos_reports/
sos-collector-testcase-2023-04-13-cdlmq/sos_reports/manifest.json

Comment 2 Pavel Moravec 2023-04-21 06:24:14 UTC
Thanks a lot for the reproducer. It allowed me to quickly prove the root cause is a regression introduced by https://github.com/sosreport/sos/pull/3096 / fix for https://bugzilla.redhat.com/show_bug.cgi?id=2065821 .

So I am going to dance between eggs of technical requirements from these two BZs.. :)

Comment 3 Pavel Moravec 2023-04-21 07:42:11 UTC
This should be a patch for this use case and I hope *also* not breaking the https://bugzilla.redhat.com/show_bug.cgi?id=2065821 use case:

collect sosreport from primary node, if (we are connected to it and) either we dont forcibly remove localhost from collection (self.cluster.strict_node_list=False), or if we already evaluated it to be in node_list :

--- a/sos/collector/__init__.py
+++ b/sos/collector/__init__.py
@@ -1179,11 +1179,15 @@ this utility or remote systems that it c
     def collect(self):
         """ For each node, start a collection thread and then tar all
         collected sosreports """
-        if self.primary.connected and not self.cluster.strict_node_list:
+        filters = set([self.primary.address, self.primary.hostname])  # or self.opts.primary, like in reduce_node_list "remove the primary node" section?
+        # add primary if:
+        # - we are connected to it and
+        #   - its hostname is in node_list, or
+        #   - we dont forcibly remove local host from collection (i.e. strict_node_list=False)
+        if self.primary.connected and (filters.intersection(set(self.node_list)) or not self.cluster.strict_node_list):
             self.client_list.append(self.primary)
 
         self.ui_log.info("\nConnecting to nodes...")
-        filters = [self.primary.address, self.primary.hostname]
         nodes = [(n, None) for n in self.node_list if n not in filters]
 
         if self.opts.password_per_node:


Reid: would you be so kind and test this patch on your reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=2065821 ? Just create /tmp/bz2186460.patch with above patch, and run:

cd /usr/lib/python3.6/site-packages
cat /tmp/bz2186460.patch | patch -p1

on the system where you invoke sos collect from.

Comment 4 Pavel Moravec 2023-04-21 07:47:01 UTC
Jake,
when preparing patch for this BZ (see https://bugzilla.redhat.com/show_bug.cgi?id=2186460#c3), I spot one possible misalignment:

in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter?

Comment 5 Jake Hunsaker 2023-05-05 16:54:46 UTC
> in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter


I don't think there's a functional difference there...

primary.address is the value given to connect to the node, i.e. --primary.
primary.hostname is what we get from running the `hostname` command on the node.

It is entirely possible that these are the same value, but they can be different. In the first case we're directly checking --primary, and in the second we're checking a value we set early on based on --primary. So, I don't think there's a functional difference here, despite referencing two different vars.

Comment 6 Reid Wahl 2023-05-09 23:07:42 UTC
I missed this bug because it happens only when the node name matches the hostname. BZ 2065821 was for a case where the node name does not match the hostname.

-----

BEFORE:

[root@fastvm-rhel-9-0-42 ~]# sos collect --batch
...
The following is a list of nodes to collect from:
	fastvm-rhel-9-0-42
	fastvm-rhel-9-0-43


Connecting to nodes...

Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-9-0-43  : Generating sos report...

-----

AFTER:

[root@fastvm-rhel-9-0-42 ~]# sos collect --batch
...
The following is a list of nodes to collect from:
	fastvm-rhel-9-0-42
	fastvm-rhel-9-0-43


Connecting to nodes...

Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-9-0-42  : Generating sos report...
fastvm-rhel-9-0-43  : Generating sos report...

-----

In the AFTER case, it still works correctly for the BZ 2065821 case where the node names don't match the hostnames:

[root@fastvm-rhel-9-0-42 ~]# sos collect --batch
...
The following is a list of nodes to collect from:
	node2             
	node3             


Connecting to nodes...

Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-9-0-43  : Generating sos report...
fastvm-rhel-9-0-42  : Generating sos report...

Comment 7 Pavel Moravec 2023-05-16 13:59:29 UTC
(In reply to Reid Wahl from comment #6)
> I missed this bug because it happens only when the node name matches the
> hostname. BZ 2065821 was for a case where the node name does not match the
> hostname.
> 
> -----
> 
> BEFORE:
> 
> [root@fastvm-rhel-9-0-42 ~]# sos collect --batch
> ...
> The following is a list of nodes to collect from:
> 	fastvm-rhel-9-0-42
> 	fastvm-rhel-9-0-43
> 
> 
> Connecting to nodes...
> 
> Beginning collection of sosreports from 1 nodes, collecting a maximum of 4
> concurrently
> 
> fastvm-rhel-9-0-43  : Generating sos report...
> 
> -----
> 
> AFTER:
> 
> [root@fastvm-rhel-9-0-42 ~]# sos collect --batch
> ...
> The following is a list of nodes to collect from:
> 	fastvm-rhel-9-0-42
> 	fastvm-rhel-9-0-43
> 
> 
> Connecting to nodes...
> 
> Beginning collection of sosreports from 2 nodes, collecting a maximum of 4
> concurrently
> 
> fastvm-rhel-9-0-42  : Generating sos report...
> fastvm-rhel-9-0-43  : Generating sos report...
> 
> -----
> 
> In the AFTER case, it still works correctly for the BZ 2065821 case where
> the node names don't match the hostnames:
> 
> [root@fastvm-rhel-9-0-42 ~]# sos collect --batch
> ...
> The following is a list of nodes to collect from:
> 	node2             
> 	node3             
> 
> 
> Connecting to nodes...
> 
> Beginning collection of sosreports from 2 nodes, collecting a maximum of 4
> concurrently
> 
> fastvm-rhel-9-0-43  : Generating sos report...
> fastvm-rhel-9-0-42  : Generating sos report...

Hello,
do I understand you correctly that the patch from #c3 :
- does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821
- was not tested by you against the reproducer in this BZ (where I successfully tested it on jserrano's reproducer

If I got you right, I will raise PR with the patch #c3.

Thanks in advance for info / double-check.

Comment 8 Reid Wahl 2023-05-16 17:59:59 UTC
(In reply to Pavel Moravec from comment #7)
> Hello,
> do I understand you correctly that the patch from #c3 :
> - does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821

Correct


> - was not tested by you against the reproducer in this BZ (where I
> successfully tested it on jserrano's reproducer

I tested the patch from comment 3 against a similar reproducer. Before the patch, I reproduced the bad behavior. With the patch, sos collector behaved correctly in both:
* jserrano's test case (my similar reproducer)
* the original BZ 2065821 test case
 

> If I got you right, I will raise PR with the patch #c3.
> 
> Thanks in advance for info / double-check.

Sounds good to me :)

Comment 17 errata-xmlrpc 2023-06-26 13:55:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (sos bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3801


Note You need to log in before you can comment on or make changes to this bug.