Bug 2011378
| Summary: | Nodes lose their hostname after a reboot. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Christian LaPolt <christian.lapolt> | ||||||
| Component: | Networking | Assignee: | Patryk Diak <pdiak> | ||||||
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | ||||||
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |||||||
| Severity: | low | ||||||||
| Priority: | unspecified | CC: | bpickard, christian.lapolt, danili, Holger.Wolf, jwi, krmoser, tdale, vpickard | ||||||
| Version: | 4.9 | ||||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | s390x | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2022-04-20 12:50:17 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 2009709 | ||||||||
| Attachments: |
|
||||||||
|
Description
Christian LaPolt
2021-10-06 14:10:42 UTC
This is the output of a login after an upgrade attempt. [systemd] Failed Units: 1 node-valid-hostname.service [core@localhost ~]$ Should be master-2 Christian, Since this only a low bug, what is the mitigation option to get the node back into the cluster? hostnamectl set-hostname --transient <FQDN>. then reboot. This works most of the time and the node rejoins the cluster and gets to a ready state. If that doesn't hold for whatever reason I have been setting the /etc/hostname file and rebooting, this works very reliably. This is most likely an infrastructure issue. node-valid-hostname waits up to 5min for a correct hostname, it times out here. AFAIK the hostname on z/VM should be handled by NetworkManager, please provide the journal that includes the issue occurrence: journalctl -u NetworkManager. Created attachment 1832209 [details]
NetworkManager journalctl log from a failed node
As suspected it seems that the issue is related to NetworkManager. The logs after the last reboot are missing the following: policy: set-hostname: set hostname to 'worker-0.pok-72.ocptest.pok.stglabs.ibm.com' (from address lookup) The hostname was obtained through a reverse DNS entry and it seems that this process was not successful after the last reboot. Please increase the NetworkManagers log level to TRACE and provide the logs of the issue occurrence. Created attachment 1842444 [details]
NetworkManager journalctl log from a failed node - debug mode
Looking at this log it does appear that the node is attempting to get the hostname from DNS. DNS lookup on the bastion does return the correct name. [root@ospamgr4 ~]# nslookup 10.20.116.95 95.116.20.10.in-addr.arpa name = worker-1.pok-72.ocptest.pok.stglabs.ibm.com. So I am not sure if this is a timing issue. I will ask once again, why is /etc/hostname not set on the nodes? That would make sense to me. Are there any updates on this bug? If you increase the log level to trace we could determine if this is the same issue as observed here: https://bugzilla.redhat.com/show_bug.cgi?id=2014077#c39 It looks like networkmanager sent a rDNS request and didn't get a reply. I am not sure if this related to ovn-kubernetes or cluster networking. Please provide a sosreport gathered after the issue occurrence. I have the 2 log levels already attached. If something else is really needed please send me the exact commands or config changes to make to get the needed info. This was not on a cluster using OVN-kubernetes. The logs attached are at DEBUG/default level, it would be great if you could provide the logs at TRACE level to confirm that we are seeing the same issue. From the logs provided the cluster is using ovn-kubernetes, what do you mean by "This was not on a cluster using OVN-kubernetes."? You can follow https://docs.openshift.com/container-platform/4.7/support/gathering-cluster-data.html#support-generating-a-sosreport-archive_gathering-cluster-data to get the sosreport after the issue becomes apparent. I think it would be beneficial if I could access a cluster that experiences the issue to further investigate, do you think that would be possible? Making Comment 12 un-private as Christian cannot see Private comments. closing this issue as there was no update for a long time. Please create a new issue with the requested data if it still reproducible. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |