Bug 1809345
Summary: | OpenShift Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael McNeill <mmcneill> | ||||
Component: | RHCOS | Assignee: | Ben Howard <behoward> | ||||
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.3.z | CC: | bbreard, behoward, dornelas, dustymabe, imcleod, jerzhang, jligon, miabbott, mnguyen, mwoodson, nstielau, razevedo, rludva, sdodson, smilner, vjaypurk, wking, yanyang | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | 4.5.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1823882 1823883 (view as bug list) | Environment: | |||||
Last Closed: | 2020-07-13 17:17:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1186913, 1823882, 1837122, 1837642 | ||||||
Attachments: |
|
Description
Michael McNeill
2020-03-02 21:56:49 UTC
It appears that this is also impacting 4.2. I had to downgrade my installer to 4.2.9 to get it to properly assign a hostname to these nodes. Correction, it appears that the hostname is initially set correctly upon boot, and then it is changed to localhost when the etcd initialization process begins. After first boot of the node (10.0.0.2): [core@cluste-m464q-b ~]$ ssh core.0.2 The authenticity of host '10.0.0.2 (10.0.0.2)' can't be established. ECDSA key fingerprint is SHA256:VO5vg00iV09pfQY2jcjqa4CstW+BzHeIl7EYdGb/iAk. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.0.0.2' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 42.80.20191002.0 WARNING: Direct SSH access to machines is not recommended. --- [core@cluste-m464q-m-0 ~]$ exit logout Connection to 10.0.0.2 closed. After the we see bootkube say "Waiting for etcd cluster...": [core@cluste-m464q-b ~]$ ssh core.0.2 Red Hat Enterprise Linux CoreOS 42.81.20191126.0 WARNING: Direct SSH access to machines is not recommended. --- Last login: Tue Mar 3 01:58:33 2020 from 10.0.0.5 [core@localhost ~]$ exit logout Connection to 10.0.0.2 closed. There's two other BZs that are reporting similar issues with the hostname reverting to localhost: https://bugzilla.redhat.com/show_bug.cgi?id=1803962 https://bugzilla.redhat.com/show_bug.cgi?id=1804426 The recommended workaround for now is to write out the desired hostname in /etc/hostname as part of the Ignition config. Neither of those issues look to be related to this beyond the localhost connection. In both of those cases, static IPs are attempting to be used and hostname -f responds correctly, which is not happening in this case. In addition, both of those issues are related to UPI deployments and this is a IPI deployment. And finally, NetworkManager-wait-online.service is not failing, as it is in both of those other bugs. It appears that creating a new project in GCP has made this issue go away. I am unsure if this is a wider issue or something just specific to that project configuration. I have kept the project up for debugging, if someone would like to work with me to directly debug the issue. Otherwise, this BZ can be resolved. With the last comment, coupled with the evidence on the openshift-sme thread about other folks having successful installs, I'm going to chalk this up as a mis-configured environment. Closing as WORKSFORME. Please re-open if it occurs again. We have been testing the GCP installer in OSD. We have been testing two different ways: 1. download the installer. I have tried the following versions of the installer: 4.3.5, 4.3.8, 4.3.0. I have this issue 2. Hive is having issues installing this as well. From my logs: ---------------------------------------------------------------- control-plane/10.0.0.4/journals/kubelet.log Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.980543 9453 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.981792 9453 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.080143 9453 kubelet.go:2275] node "localhost" not found Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.087134 9453 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: Get https://api-int.mw.privtest.mwoodson.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/localhost?timeout=10s: dial tcp 10.0.0.2:6443: connect: connection refused Mar 23 16:11:55 localhost hyperkube[9453]: F0323 16:11:55.128852 9453 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'. Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Consumed 399ms CPU time ---------------------------------------------------------------- logs from installation through hive: Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.039697 2030 kubelet.go:2275] node "localhost" not found Mar 23 16:11:47 localhost hyperkube[2030]: I0323 16:11:47.039701 2030 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.139992 2030 kubelet.go:2275] node "localhost" not found Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148081 2030 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost) Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148876 2030 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io) Mar 23 16:11:47 localhost hyperkube[2030]: F0323 16:11:47.213602 2030 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found Mar 23 16:11:47 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a attaching the full dump of journal This is new: Mar 23 19:24:30 localhost dbus-daemon[1844]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found. Mar 23 19:24:30 localhost NetworkManager[1852]: <info> [1584991470.7793] device (ens4): Activation: successful, device activated. Mar 23 19:24:30 localhost NetworkManager[1852]: <info> [1584991470.7802] manager: NetworkManager state is now CONNECTED_GLOBAL Mar 23 19:24:30 localhost NetworkManager[1852]: <info> [1584991470.7816] manager: startup complete Mar 23 19:24:30 localhost NetworkManager[1852]: <warn> [1584991470.7821] hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal' using hostnamed: GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: Invalid hostname 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal' Mar 23 19:24:30 localhost NetworkManager[1852]: <warn> [1584991470.7821] policy: set-hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal': (1) Operation not permitted Mar 23 19:24:30 localhost NetworkManager[1852]: <warn> [1584991470.7821] policy: set-hostname: you should use hostnamed when systemd hardening is in effect! This is a bug with the MCO script, which uses `hostname` instead of `hostnamectl` We need to cherry-pick https://github.com/openshift/machine-config-operator/commit/cef48075d676ddeac751da2d40433e59f46d14cb PR's for cherry-picks back to 4.3 and 4.4 https://github.com/openshift/machine-config-operator/pull/1650 https://github.com/openshift/machine-config-operator/pull/1651 (In reply to Ben Howard from comment #16) > PR's for cherry-picks back to 4.3 and 4.4 > https://github.com/openshift/machine-config-operator/pull/1650 > https://github.com/openshift/machine-config-operator/pull/1651 One bug per release. Since this targets 4.5 we should link this to the PR which fixed this on master branch and move this to MODIFIED. We'd then clone this bug to 4.4, clone that bug to 4.3 and link the relevant PRs to each of those bugs. Thank you Scott. Retargetted to 4.5. The linked MCO PR (https://github.com/openshift/machine-config-operator/pull/1622) is for bare-metal hosts only and does not apply to the GCP case. Moving back to ASSIGNED and dropping from the errata. Submitted PR https://github.com/openshift/machine-config-operator/pull/1711. The messages about `policy: set-hostname: you should use hostnamed when systemd hardening is in effect!` are a complete red-herring. Verified on RHCOS 45.81.202005181029-0 which is a part of 4.5.0-0.nightly-2020-05-19-021142. I was unable to verify on an OpenShift instance because the current project names are not long enough to trigger this issue. I verified instead on an RHCOS 45.81.202005181029-0 instance on GCP with a long host name and using ignition to add /etc/NetworkManager/dispatcher.d/90-long-hostname. Without the dispatcher script RHCOS will reboot and get a hostname of localhost. With the dispatcher script RHCOS will reboot and have a set hostname. { "ignition": {"version": "2.2.0"}, "storage": { "files": [{ "filesystem": "root", "path": "/etc/sudoers", "append": true, "contents": { "source": "data:,core%20ALL=(ALL)%20NOPASSWD:%20ALL" } }, { "filesystem": "root", "path": "/etc/NetworkManager/dispatcher.d/90-long-hostname", "mode": 493, "contents": { "source": "data:;base64,IyEvYmluL2Jhc2gKIwojIE9uIEdvb2dsZSBDb21wdXRlIFBsYXRmb3JtIChHQ1ApIHRoZSBob3N0bmFtZSBtYXkgYmUgdG9vIGxvbmcgKD42MyBjaGFycykuCiMgRHVyaW5nIGZpcnN0Ym9vdCB0aGUgaG9zdG5hbWUgaXMgc2V0IGluIHRoZSBpbml0cmFtZnMgYmVmb3JlIE5ldHdvcmtNYW5hZ2VyCiMgcnVuczsgb24gcmVib290IGFmZmVjdCBub2RlcyB1c2UgJ2xvY2FsaG9zdCcuIFRoaXMgaG9vayBpcyBhIHNpbXBsZSB3b3JrCiMgYXJvdW5kOiBpZiB0aGUgaG9zdCBuYW1lIGlzIGxvbmdlciB0aGFuIDYzIGNoYXJhY3RlcnMsIHRoZW4gdGhlIGhvc3RuYW1lCiMgaXMgdHJ1bmNhdGVkIGF0IHRoZSBfZmlyc3RfIGRvdC4KIwojIEFkZGl0aW9uYWxseSwgdGhpcyBob29rIGRvZXMgbm90IGJyZWFrIEROUyBvciBjbHVzdGVyIEROUyByZXNvbHV0aW9uLAojIHNpbmNlIE5ldHdvcmtNYW5hZ2VyIHNldHMgdGhlIGFwcHJvcHJpYXRlIC9ldGMvcmVzb2x2LmNvbmYgc2V0dGluZ3MuCklGPSQxClNUQVRVUz0kMgpsb2coKSB7IGxvZ2dlciAtLXRhZyAibmV0d29yay1tYW5hZ2VyLyQoYmFzZW5hbWUgJDApIiAiJHtAfSI7IH0KIyBjYXB0dXJlIGFsbCBlbGlnaWJsZSBob3N0bmFtZXMKaWYgW1sgISAiJCgvYmluL2hvc3RuYW1lKSIgPX4gKGxvY2FsaG9zdHxsb2NhbGhvc3QubG9jYWwpIF1dOyB0aGVuCiAgICBsb2cgImhvc3RuYW1lIGlzIGFscmVhZHkgc2V0IgogICAgZXhpdCAwCmZpCmlmIFtbICEgIiRTVEFUVVMiID1+ICh1cHxob3N0bmFtZXxkaGNwNC1jaGFuZ2V8ZGhjcDYtY2hhbmdlKSBdXTsgdGhlbgogICAgZXhpdCAwCmZpCmRlZmF1bHRfaG9zdD0iJHtESENQNF9IT1NUX05BTUU6LSRESENQNl9IT1NUX05BTUV9IgojIHRydW5jYXRlIHRoZSBob3N0bmFtZSB0byB0aGUgZmlyc3QgZG90IGFuZCB0aGFuIDY0IGNoYXJhY3RlcnMuCmhvc3Q9JChlY2hvICR7ZGVmYXVsdF9ob3N0fSB8IGN1dCAtZjEgLWQnLicgfCBjdXQgLWMgLTYzKQppZiBbICIkeyNkZWZhdWx0X2hvc3R9IiAtZ3QgNjMgXTsgdGhlbgogICAgbG9nICJkaXNjb3ZlcmVkIGhvc3RuYW1lIGlzIGxvbmdlciB0aGFuIHRoYW4gNjMgY2hhcmFjdGVycyIKICAgIGxvZyAidHJ1bmNhdGluZyAke2RlZmF1bHRfaG9zdH0gPT4gJHtob3N0fSIKICAgIC9iaW4vaG9zdG5hbWVjdGwgLS10cmFuc2llbnQgc2V0LWhvc3RuYW1lICIke2hvc3R9IgpmaQo=" } }]}, "passwd": { "users": [ { "name": "core", "sshAuthorizedKeys": [ "ssh-rsa AAA--SNIP--" ] } ] } } *** Bug 1838625 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |