Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2011378

Summary: Nodes lose their hostname after a reboot.
Product: OpenShift Container Platform Reporter: Christian LaPolt <christian.lapolt>
Component: NetworkingAssignee: Patryk Diak <pdiak>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: low    
Priority: unspecified CC: bpickard, christian.lapolt, danili, Holger.Wolf, jwi, krmoser, tdale, vpickard
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: s390x   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-20 12:50:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2009709    
Attachments:
Description Flags
NetworkManager journalctl log from a failed node
none
NetworkManager journalctl log from a failed node - debug mode none

Description Christian LaPolt 2021-10-06 14:10:42 UTC
Description of problem:


Version-Release number of selected component (if applicable):

4.7.13 - 4.9.0-rc.5
How reproducible:

Fairly
Steps to Reproduce:
1.Install any version of OCP 4.7.13 or higher on z/VM
2.Selectively reboot either control plane or compute nodes ( may take several reboots as this is kind of random ) 
3.See node never come back to a ready state. Log into the node and observe the hostname is reset to localhost

Actual results:
hostname is reset to localhost

Expected results:
Hostname remains as actual node name

Additional info:
setting the hostname via hostnamectl then a reboot sometimes works.  Wondering why the hostnamectl sets the hostname as transient and the node does not have an /etc/hostname file to make it persistent?

Comment 1 Christian LaPolt 2021-10-06 14:45:26 UTC
This is the output of a login after an upgrade attempt.

[systemd]
Failed Units: 1
  node-valid-hostname.service
[core@localhost ~]$

Should be master-2

Comment 2 Holger Wolf 2021-10-06 17:05:17 UTC
Christian, 

Since this only a low bug, what is the mitigation option to get the node back into the cluster?

Comment 3 Christian LaPolt 2021-10-06 19:56:00 UTC
hostnamectl set-hostname --transient <FQDN>. then reboot.  This works most of the time and the node rejoins the cluster and gets to a ready state.
If that doesn't hold for whatever reason I have been setting the /etc/hostname file and rebooting, this works very reliably.

Comment 4 Patryk Diak 2021-10-08 09:35:03 UTC
This is most likely an infrastructure issue.

node-valid-hostname waits up to 5min for a correct hostname, it times out here.
AFAIK the hostname on z/VM should be handled by NetworkManager, please provide the journal that includes the issue occurrence: journalctl -u NetworkManager.

Comment 5 Christian LaPolt 2021-10-12 13:14:34 UTC
Created attachment 1832209 [details]
NetworkManager journalctl log from a failed node

Comment 6 Patryk Diak 2021-10-18 08:20:51 UTC
As suspected it seems that the issue is related to NetworkManager.

The logs after the last reboot are missing the following:
policy: set-hostname: set hostname to 'worker-0.pok-72.ocptest.pok.stglabs.ibm.com' (from address lookup)
The hostname was obtained through a reverse DNS entry and it seems that this process was not successful after the last reboot.


Please increase the NetworkManagers log level to TRACE and provide the logs of the issue occurrence.

Comment 7 Christian LaPolt 2021-11-17 22:50:02 UTC
Created attachment 1842444 [details]
NetworkManager journalctl log from a failed node - debug mode

Comment 8 Christian LaPolt 2021-11-17 23:22:17 UTC
Looking at this log it does appear that the node is attempting to get the hostname from DNS. DNS lookup on the bastion does return the correct name.
[root@ospamgr4 ~]# nslookup 10.20.116.95
95.116.20.10.in-addr.arpa	name = worker-1.pok-72.ocptest.pok.stglabs.ibm.com.
So I am not sure if this is a timing issue.
I will ask once again, why is /etc/hostname not set on the nodes? That would make sense to me.

Comment 9 Christian LaPolt 2021-12-07 16:01:34 UTC
Are there any updates on this bug?

Comment 10 Patryk Diak 2021-12-08 12:23:08 UTC
If you increase the log level to trace we could determine if this is the same issue as observed here:
https://bugzilla.redhat.com/show_bug.cgi?id=2014077#c39

It looks like networkmanager sent a rDNS request and didn't get a reply.
I am not sure if this related to ovn-kubernetes or cluster networking.

Please provide a sosreport gathered after the issue occurrence.

Comment 11 Christian LaPolt 2021-12-30 16:19:11 UTC
I have the 2 log levels already attached. If something else is really needed please send me the exact commands or config changes to make to get the needed info. This was not on a cluster using OVN-kubernetes.

Comment 12 Patryk Diak 2022-01-17 12:19:24 UTC
The logs attached are at DEBUG/default level, it would be great if you could provide the logs at TRACE level to confirm that we are seeing the same issue.
From the logs provided the cluster is using ovn-kubernetes, what do you mean by "This was not on a cluster using OVN-kubernetes."?

You can follow https://docs.openshift.com/container-platform/4.7/support/gathering-cluster-data.html#support-generating-a-sosreport-archive_gathering-cluster-data to get the sosreport after the issue becomes apparent.

I think it would be beneficial if I could access a cluster that experiences the issue to further investigate, do you think that would be possible?

Comment 13 Dan Li 2022-01-26 15:41:13 UTC
Making Comment 12 un-private as Christian cannot see Private comments.

Comment 14 Patryk Diak 2022-04-20 12:50:17 UTC
closing this issue as there was no update for a long time.
Please create a new issue with the requested data if it still reproducible.

Comment 15 Red Hat Bugzilla 2023-09-15 01:15:57 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days