Description of problem: When executing "sos collect" from a Pacemaker cluster node, the list of the nodes to collect a sosreport from is correct but, in practice, it doesn't collect a sosreport from the local node where the command is executed, i.e., it collects a sosreport from all the other cluster nodes except from localhost. Version-Release number of selected component (if applicable): sos-4.5.1-3.el8.noarch How reproducible: Always Steps to Reproduce: 1. Configure a Pacemaker cluster and install the latest version of sos (sos-4.5.1-3.el8.noarch). 2. Execute "sos collect" from one of the nodes. Actual results: The list of nodes from the Pacemaker cluster is correctly printed but the tarball generated by "sos" does not contain a sosreport from the local node (where "sos collect" was executed from). Expected results: A single tarball containing the sosreports from _all_ the nodes in the cluster is generated. Additional info: Here is an example from my lab, running a Pacemaker cluster on freshly installed RHEL 8.7: [root@fastvm-rhel-8-7-201 ~]# rpm -qa | grep sos sos-4.5.1-3.el8.noarch [root@fastvm-rhel-8-7-201 ~]# crm_node -l 1 fastvm-rhel-8-7-201 member 2 fastvm-rhel-8-7-202 member [root@fastvm-rhel-8-7-201 ~]# sos collect --password ... The following is a list of nodes to collect from: fastvm-rhel-8-7-201 fastvm-rhel-8-7-202 Press ENTER to continue with these nodes, or press CTRL-C to quit Connecting to nodes... Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently fastvm-rhel-8-7-202 : Generating sos report... fastvm-rhel-8-7-202 : Retrieving sos report... fastvm-rhel-8-7-202 : Successfully collected sos report The following archive has been created. Please provide it to your support team. /var/tmp/sos-collector-testcase-2023-04-13-cdlmq.tar.xz [root@fastvm-rhel-8-7-201 ~]# tar --list -f /var/tmp/sos-collector-testcase-2023-04-13-cdlmq.tar.xz sos-collector-testcase-2023-04-13-cdlmq/ sos-collector-testcase-2023-04-13-cdlmq/sos_logs/ sos-collector-testcase-2023-04-13-cdlmq/sos_logs/sos.log sos-collector-testcase-2023-04-13-cdlmq/sos_logs/ui.log sos-collector-testcase-2023-04-13-cdlmq/sosreport-fastvm-rhel-8-7-202-2023-04-13-futyafa.tar.xz sos-collector-testcase-2023-04-13-cdlmq/sos_reports/ sos-collector-testcase-2023-04-13-cdlmq/sos_reports/manifest.json
Thanks a lot for the reproducer. It allowed me to quickly prove the root cause is a regression introduced by https://github.com/sosreport/sos/pull/3096 / fix for https://bugzilla.redhat.com/show_bug.cgi?id=2065821 . So I am going to dance between eggs of technical requirements from these two BZs.. :)
This should be a patch for this use case and I hope *also* not breaking the https://bugzilla.redhat.com/show_bug.cgi?id=2065821 use case: collect sosreport from primary node, if (we are connected to it and) either we dont forcibly remove localhost from collection (self.cluster.strict_node_list=False), or if we already evaluated it to be in node_list : --- a/sos/collector/__init__.py +++ b/sos/collector/__init__.py @@ -1179,11 +1179,15 @@ this utility or remote systems that it c def collect(self): """ For each node, start a collection thread and then tar all collected sosreports """ - if self.primary.connected and not self.cluster.strict_node_list: + filters = set([self.primary.address, self.primary.hostname]) # or self.opts.primary, like in reduce_node_list "remove the primary node" section? + # add primary if: + # - we are connected to it and + # - its hostname is in node_list, or + # - we dont forcibly remove local host from collection (i.e. strict_node_list=False) + if self.primary.connected and (filters.intersection(set(self.node_list)) or not self.cluster.strict_node_list): self.client_list.append(self.primary) self.ui_log.info("\nConnecting to nodes...") - filters = [self.primary.address, self.primary.hostname] nodes = [(n, None) for n in self.node_list if n not in filters] if self.opts.password_per_node: Reid: would you be so kind and test this patch on your reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=2065821 ? Just create /tmp/bz2186460.patch with above patch, and run: cd /usr/lib/python3.6/site-packages cat /tmp/bz2186460.patch | patch -p1 on the system where you invoke sos collect from.
Jake, when preparing patch for this BZ (see https://bugzilla.redhat.com/show_bug.cgi?id=2186460#c3), I spot one possible misalignment: in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter?
> in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter I don't think there's a functional difference there... primary.address is the value given to connect to the node, i.e. --primary. primary.hostname is what we get from running the `hostname` command on the node. It is entirely possible that these are the same value, but they can be different. In the first case we're directly checking --primary, and in the second we're checking a value we set early on based on --primary. So, I don't think there's a functional difference here, despite referencing two different vars.
I missed this bug because it happens only when the node name matches the hostname. BZ 2065821 was for a case where the node name does not match the hostname. ----- BEFORE: [root@fastvm-rhel-9-0-42 ~]# sos collect --batch ... The following is a list of nodes to collect from: fastvm-rhel-9-0-42 fastvm-rhel-9-0-43 Connecting to nodes... Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently fastvm-rhel-9-0-43 : Generating sos report... ----- AFTER: [root@fastvm-rhel-9-0-42 ~]# sos collect --batch ... The following is a list of nodes to collect from: fastvm-rhel-9-0-42 fastvm-rhel-9-0-43 Connecting to nodes... Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently fastvm-rhel-9-0-42 : Generating sos report... fastvm-rhel-9-0-43 : Generating sos report... ----- In the AFTER case, it still works correctly for the BZ 2065821 case where the node names don't match the hostnames: [root@fastvm-rhel-9-0-42 ~]# sos collect --batch ... The following is a list of nodes to collect from: node2 node3 Connecting to nodes... Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently fastvm-rhel-9-0-43 : Generating sos report... fastvm-rhel-9-0-42 : Generating sos report...
(In reply to Reid Wahl from comment #6) > I missed this bug because it happens only when the node name matches the > hostname. BZ 2065821 was for a case where the node name does not match the > hostname. > > ----- > > BEFORE: > > [root@fastvm-rhel-9-0-42 ~]# sos collect --batch > ... > The following is a list of nodes to collect from: > fastvm-rhel-9-0-42 > fastvm-rhel-9-0-43 > > > Connecting to nodes... > > Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 > concurrently > > fastvm-rhel-9-0-43 : Generating sos report... > > ----- > > AFTER: > > [root@fastvm-rhel-9-0-42 ~]# sos collect --batch > ... > The following is a list of nodes to collect from: > fastvm-rhel-9-0-42 > fastvm-rhel-9-0-43 > > > Connecting to nodes... > > Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 > concurrently > > fastvm-rhel-9-0-42 : Generating sos report... > fastvm-rhel-9-0-43 : Generating sos report... > > ----- > > In the AFTER case, it still works correctly for the BZ 2065821 case where > the node names don't match the hostnames: > > [root@fastvm-rhel-9-0-42 ~]# sos collect --batch > ... > The following is a list of nodes to collect from: > node2 > node3 > > > Connecting to nodes... > > Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 > concurrently > > fastvm-rhel-9-0-43 : Generating sos report... > fastvm-rhel-9-0-42 : Generating sos report... Hello, do I understand you correctly that the patch from #c3 : - does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821 - was not tested by you against the reproducer in this BZ (where I successfully tested it on jserrano's reproducer If I got you right, I will raise PR with the patch #c3. Thanks in advance for info / double-check.
(In reply to Pavel Moravec from comment #7) > Hello, > do I understand you correctly that the patch from #c3 : > - does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821 Correct > - was not tested by you against the reproducer in this BZ (where I > successfully tested it on jserrano's reproducer I tested the patch from comment 3 against a similar reproducer. Before the patch, I reproduced the bad behavior. With the patch, sos collector behaved correctly in both: * jserrano's test case (my similar reproducer) * the original BZ 2065821 test case > If I got you right, I will raise PR with the patch #c3. > > Thanks in advance for info / double-check. Sounds good to me :)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sos bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3801