2065821 – sos collector pulls duplicate sosreport if local node's hostname does not match its pacemaker node name

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2065821 - sos collector pulls duplicate sosreport if local node's hostname does not match its pacemaker node name

Summary: sos collector pulls duplicate sosreport if local node's hostname does not mat...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	sos
Sub Component:
Version:	8.5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Pavel Moravec
QA Contact:	Adriana Jurkechova
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-18 19:57 UTC by Reid Wahl
Modified:	2023-03-16 21:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:	sos-4.5.0-1.el8.noarch
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-16 21:37:06 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	sosreport/sos/commit/6701a7d7	None	None	None	2022-05-17 09:46:52 UTC
Github	sosreport sos pull 2891	None	Merged	[pacemaker] Update collect cluster profile for pacemaker	2022-08-29 06:37:02 UTC
Github	sosreport sos pull 3096	None	open	[collector] Prevent appending local host in strict_node_list mode	2022-12-28 13:45:18 UTC
Red Hat Issue Tracker	RHELPLAN-116160	None	None	None	2022-11-21 17:35:54 UTC

Description Reid Wahl 2022-03-18 19:57:05 UTC

Description of problem:

In my test cluster, node 2's hostname is "fastvm-rhel-8-0-24". Its node name (from pacemaker's point of view) is "node2". The result is that we try to get duplicate sosreports.

[root@fastvm-rhel-8-0-24 ~]# sos collect --batch
...
Cluster type set to Pacemaker High Availability Cluster Manager

The following is a list of nodes to collect from:
	fastvm-rhel-8-0-24
	node2


Connecting to nodes...

Beginning collection of sosreports from 2 nodes, collecting a maximum of 4 concurrently

fastvm-rhel-8-0-24  : Generating sosreport...
fastvm-rhel-8-0-24 : Generating sosreport...
client_loop: send disconnect: Broken pipe

(The "Broken pipe" is because the system hung and got fenced.)

-----

Version-Release number of selected component (if applicable):

sos-4.1-9.el8_5

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Configure a pacemaker cluster where a node's hostname differs from its node name.
2. Run `sos collect` from that node.

-----

Actual results:

sos collector tries to get two sosreports from the local host: one from its hostname, one from its node name.

-----

Expected results:

sos collector tries to get only one sosreport from the local host.

-----

Additional info:

I suggest that we abandon adding the local host to our node list altogether. Instead, we could get a list from pacemaker (falling back to corosync) in every case. That would actually be a lot easier than fixing the "filter out the local hostname" process, if there's no problem with that. Here's why.

If pacemaker is running on the local host, then you can easily get the local pacemaker node name by running `crm_node -n`. For example:
~~~
[root@fastvm-rhel-8-0-24 ~]# crm_node -n
node2
~~~

However, if pacemaker is not running, then Bug 2065811 proposes that we fall back to corosync.conf. In that case, mapping the local host's hostname to its pacemaker node name is not nearly as straightforward. `crm_node -n` doesn't work if pacemaker isn't running. And the corosync config display commands don't work if corosync isn't running.

This is the nodelist section of corosync.conf:
~~~
nodelist {
    node {
        ring0_addr: node1
        name: node1
        nodeid: 1
    }

    node {
        ring0_addr: node2
        name: node2
        nodeid: 2
    }
}
~~~


We need to discover that fastvm-rhel-8-0-24 is the same thing as the "name" node2. "name" is what we need, but we can't count on using it during the search. "name" is not necessarily a hostname that resolves. ringX_addr (there can be multiple addrs, with ring0 being the primary) is either a hostname that should resolve or an IP address. In the below example I've configured the cluster with names "bad_node1" and "bad_node2" (which don't resolve) and with ring_addrs "node1" and "node2" (which do resolve).
~~~
[root@fastvm-rhel-8-0-24 ~]# pcs status | grep Online
  * Online: [ bad_node1 bad_node2 ]

[root@fastvm-rhel-8-0-24 ~]# cat /etc/corosync/corosync.conf
...
nodelist {
    node {
        ring0_addr: node1
        name: bad_node1
        nodeid: 1
    }

    node {
        ring0_addr: node2
        name: bad_node2
        nodeid: 2
    }
}
~~~


A naive idea is connecting to each ringX_addr in the list to determine which one is the local node, and then grabbing the corresponding "name"... which is slow, sounds terrible, and will fail if the address is down.

The corosync.conf man page says:
~~~
       name   This option is used mainly with knet transport to identify local node.  It's also used by client software (pacemaker).  Algorithm for identifying local node is following:

              1.     Looks up $HOSTNAME in the nodelist

              2.     If this fails strip the domain name from $HOSTNAME and looks up that in the nodelist

              3.     If this fails look in the nodelist for a fully-qualified name whose short version matches the short version of $HOSTNAME

              4.     If all this fails then search the interfaces list for an address that matches a name in the nodelist
~~~

Comment 5 Reid Wahl 2022-09-30 20:48:35 UTC

Still broken :(

fastvm-rhel-8-0-23 == node1 and fastvm-rhel-8-0-24 == node2.

BEFORE:

[root@fastvm-rhel-8-0-23 ~]# rpm -q sos
sos-4.2-19.el8_6.noarch

[root@fastvm-rhel-8-0-23 ~]# sos collect --batch -o pacemaker
...
The following is a list of nodes to collect from:
	fastvm-rhel-8-0-23
	node1             
	node2             


AFTER:

[root@fastvm-rhel-8-0-23 ~]# rpm -q sos
sos-4.4-1.el8.noarch

[root@fastvm-rhel-8-0-23 ~]# sos collect --batch -o pacemaker
...
The following is a list of nodes to collect from:
	fastvm-rhel-8-0-23
	node1             
	node2

Comment 6 Reid Wahl 2022-09-30 20:58:15 UTC

I tried adding debug statements to collector.get_nodes() and none of them are printing. Also tried writing them to a file in case of any output redirection. So I'm not sure what's appending these nodes to the node list.

Comment 7 Reid Wahl 2022-09-30 20:59:47 UTC

Disregard the previous comment. It'll use get_nodes_from_cluster

Comment 8 Pavel Moravec 2022-12-01 07:53:25 UTC

Sorry for a late response, this felt off my table.

Could you please prepare a reproducer that would last for a week (to ensure when I look at it a few days after you ping me, it is still available)? I will debug it there by myself.

From resolution timeframe perspective: this will probably miss 8.8/9.2 (until we find a fix soon and we would do a respin (not planned now)).

Comment 12 Reid Wahl 2022-12-06 08:58:06 UTC

(In reply to Pavel Moravec from comment #8)
> Sorry for a late response, this felt off my table.
> 
> Could you please prepare a reproducer that would last for a week (to ensure
> when I look at it a few days after you ping me, it is still available)? I
> will debug it there by myself.
> 
> From resolution timeframe perspective: this will probably miss 8.8/9.2
> (until we find a fix soon and we would do a respin (not planned now)).

Hey, no problem, these BZs tend to fall of my table too. I moved from the support team back in July :)

I always have a bad time with Beaker so I decided to take another look myself.

SoSCollector.collect() and SoSCollector.display_nodes() both need to consider self.cluster.strict_node_list. Currently they only consider self.opts.no_local. There may be other places that need self.cluster.strict_node_list, but adding it in those two places seems to suffice from my perspective as a user.

It may be better to set no_local if strict_node_list is set, and rely on that. I haven't looked into whether that would have any undesirable side effects.

Comment 13 Pavel Moravec 2022-12-15 10:37:10 UTC

That idea seems reasonable - I will check (in next weeks) in detail if there cant be some gotchas.

Anyway a cluster where I could test it would be great..

Re-scheduling to RHEL8.9 as we would hardly squeeze it to 8.8.

Comment 14 Pavel Moravec 2022-12-28 13:45:18 UTC

It seems sos collector does not properly respect strict_node_list specified in https://github.com/sosreport/sos/blob/main/sos/collector/clusters/pacemaker.py#L23 - the collector "arbitrarily" adds primary's hostname when it fails to spot it in nodes list.

https://github.com/sosreport/sos/pull/3096 is an attempt to fix it, though I feel the PR might break some use case - review pending.

Comment 16 Adriana Jurkechova 2023-03-02 15:38:59 UTC

I can confirm that changes from https://github.com/sosreport/sos/pull/3096 are included in the build.

Since we are in time pressure I am switching this bugzilla to Tested,SanityOnly but Reid please feel free to do OtherQA later on.

Comment 17 Pavel Moravec 2023-03-16 21:37:06 UTC

Closing the bugzilla as the fix has been delivered in sos-4.5.0-1.el8 released via https://access.redhat.com/errata/RHBA-2023:1300 errata.

Note You need to log in before you can comment on or make changes to this bug.