Bug 2186460
| Summary: | [RHEL8] sos collector does not collect a sosreport from localhost in a Pacemaker cluster | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Jesús Serrano Sánchez-Toscano <jserrano> |
| Component: | sos | Assignee: | Pavel Moravec <pmoravec> |
| Status: | CLOSED ERRATA | QA Contact: | Miroslav Hradílek <mhradile> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.7 | CC: | agk, jcastillo, jjansky, mhradile, nwahl, plambri, sbradley, theute |
| Target Milestone: | rc | Keywords: | Triaged |
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | sos-4.5.4-1.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-06-26 13:55:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Deadline: | 2023-07-17 | ||
|
Description
Jesús Serrano Sánchez-Toscano
2023-04-13 10:57:42 UTC
Thanks a lot for the reproducer. It allowed me to quickly prove the root cause is a regression introduced by https://github.com/sosreport/sos/pull/3096 / fix for https://bugzilla.redhat.com/show_bug.cgi?id=2065821 . So I am going to dance between eggs of technical requirements from these two BZs.. :) This should be a patch for this use case and I hope *also* not breaking the https://bugzilla.redhat.com/show_bug.cgi?id=2065821 use case: collect sosreport from primary node, if (we are connected to it and) either we dont forcibly remove localhost from collection (self.cluster.strict_node_list=False), or if we already evaluated it to be in node_list : --- a/sos/collector/__init__.py +++ b/sos/collector/__init__.py @@ -1179,11 +1179,15 @@ this utility or remote systems that it c def collect(self): """ For each node, start a collection thread and then tar all collected sosreports """ - if self.primary.connected and not self.cluster.strict_node_list: + filters = set([self.primary.address, self.primary.hostname]) # or self.opts.primary, like in reduce_node_list "remove the primary node" section? + # add primary if: + # - we are connected to it and + # - its hostname is in node_list, or + # - we dont forcibly remove local host from collection (i.e. strict_node_list=False) + if self.primary.connected and (filters.intersection(set(self.node_list)) or not self.cluster.strict_node_list): self.client_list.append(self.primary) self.ui_log.info("\nConnecting to nodes...") - filters = [self.primary.address, self.primary.hostname] nodes = [(n, None) for n in self.node_list if n not in filters] if self.opts.password_per_node: Reid: would you be so kind and test this patch on your reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=2065821 ? Just create /tmp/bz2186460.patch with above patch, and run: cd /usr/lib/python3.6/site-packages cat /tmp/bz2186460.patch | patch -p1 on the system where you invoke sos collect from. Jake, when preparing patch for this BZ (see https://bugzilla.redhat.com/show_bug.cgi?id=2186460#c3), I spot one possible misalignment: in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter? > in https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1045 , we reduce self.primary.hostname andself.opts.primary from node_list under some circumstances. BUT we filter out slightly different nodes at https://github.com/sosreport/sos/blob/4.5.2/sos/collector/__init__.py#L1193 - does not the diff matter
I don't think there's a functional difference there...
primary.address is the value given to connect to the node, i.e. --primary.
primary.hostname is what we get from running the `hostname` command on the node.
It is entirely possible that these are the same value, but they can be different. In the first case we're directly checking --primary, and in the second we're checking a value we set early on based on --primary. So, I don't think there's a functional difference here, despite referencing two different vars.
I missed this bug because it happens only when the node name matches the hostname. BZ 2065821 was for a case where the node name does not match the hostname. ----- BEFORE: [root@fastvm-rhel-9-0-42 ~]# sos collect --batch ... The following is a list of nodes to collect from: fastvm-rhel-9-0-42 fastvm-rhel-9-0-43 Connecting to nodes... Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 concurrently fastvm-rhel-9-0-43 : Generating sos report... ----- AFTER: [root@fastvm-rhel-9-0-42 ~]# sos collect --batch ... The following is a list of nodes to collect from: fastvm-rhel-9-0-42 fastvm-rhel-9-0-43 Connecting to nodes... Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently fastvm-rhel-9-0-42 : Generating sos report... fastvm-rhel-9-0-43 : Generating sos report... ----- In the AFTER case, it still works correctly for the BZ 2065821 case where the node names don't match the hostnames: [root@fastvm-rhel-9-0-42 ~]# sos collect --batch ... The following is a list of nodes to collect from: node2 node3 Connecting to nodes... Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently fastvm-rhel-9-0-43 : Generating sos report... fastvm-rhel-9-0-42 : Generating sos report... (In reply to Reid Wahl from comment #6) > I missed this bug because it happens only when the node name matches the > hostname. BZ 2065821 was for a case where the node name does not match the > hostname. > > ----- > > BEFORE: > > [root@fastvm-rhel-9-0-42 ~]# sos collect --batch > ... > The following is a list of nodes to collect from: > fastvm-rhel-9-0-42 > fastvm-rhel-9-0-43 > > > Connecting to nodes... > > Beginning collection of sosreports from 1 nodes, collecting a maximum of 4 > concurrently > > fastvm-rhel-9-0-43 : Generating sos report... > > ----- > > AFTER: > > [root@fastvm-rhel-9-0-42 ~]# sos collect --batch > ... > The following is a list of nodes to collect from: > fastvm-rhel-9-0-42 > fastvm-rhel-9-0-43 > > > Connecting to nodes... > > Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 > concurrently > > fastvm-rhel-9-0-42 : Generating sos report... > fastvm-rhel-9-0-43 : Generating sos report... > > ----- > > In the AFTER case, it still works correctly for the BZ 2065821 case where > the node names don't match the hostnames: > > [root@fastvm-rhel-9-0-42 ~]# sos collect --batch > ... > The following is a list of nodes to collect from: > node2 > node3 > > > Connecting to nodes... > > Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 > concurrently > > fastvm-rhel-9-0-43 : Generating sos report... > fastvm-rhel-9-0-42 : Generating sos report... Hello, do I understand you correctly that the patch from #c3 : - does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821 - was not tested by you against the reproducer in this BZ (where I successfully tested it on jserrano's reproducer If I got you right, I will raise PR with the patch #c3. Thanks in advance for info / double-check. (In reply to Pavel Moravec from comment #7) > Hello, > do I understand you correctly that the patch from #c3 : > - does not break https://bugzilla.redhat.com/show_bug.cgi?id=2065821 Correct > - was not tested by you against the reproducer in this BZ (where I > successfully tested it on jserrano's reproducer I tested the patch from comment 3 against a similar reproducer. Before the patch, I reproduced the bad behavior. With the patch, sos collector behaved correctly in both: * jserrano's test case (my similar reproducer) * the original BZ 2065821 test case > If I got you right, I will raise PR with the patch #c3. > > Thanks in advance for info / double-check. Sounds good to me :) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (sos bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3801 |