Created attachment 1667075 [details] Log bundle from most recent attempt Description of problem: Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost. Version-Release number of selected component (if applicable): 4.3.0 and 4.3.1 How reproducible: I only have one account to test with, but it happens no matter what I choose for the cluster name, directory, etc. Steps to Reproduce: 1. Attempt to install OpenShift 4.3.z on GCP using the command: openshift-install create cluster --dir=<tmp-dir> --log-level=debug 2. Wait for the cluster installation to fail Actual results: The cluster fails to initalize with the error: "FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" Further investigation shows similar messages to the following: Mar 02 18:33:23 localhost hyperkube[2090]: I0302 18:33:23.962477 2090 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Mar 02 18:33:24 localhost hyperkube[2090]: I0302 18:33:24.025616 2090 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.027207 2090 kubelet.go:2275] node "localhost" not found Mar 02 18:33:24 localhost hyperkube[2090]: I0302 18:33:24.027660 2090 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.127468 2090 kubelet.go:2275] node "localhost" not found Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.132576 2090 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost) Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.133390 2090 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io) Mar 02 18:33:24 localhost hyperkube[2090]: F0302 18:33:24.170404 2090 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'. Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Consumed 400ms CPU time Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart. Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 2. Mar 02 18:33:34 localhost systemd[1]: Stopped Kubernetes Kubelet. Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Consumed 400ms CPU time Mar 02 18:33:34 localhost systemd[1]: Starting Kubernetes Kubelet... Command output from the problematic node: [core@cluste-jdddm-b ~]$ ssh 10.0.0.3 Red Hat Enterprise Linux CoreOS 43.81.202002032142.0 Part of OpenShift 4.3, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.3/architecture/architecture-rhcos.html --- Last login: Mon Mar 2 20:39:56 2020 from 10.0.0.4 [core@localhost ~]$ hostname -f localhost [core@localhost ~]$ systemctl status NetworkManager-wait-online ● NetworkManager-wait-online.service - Network Manager Wait Online Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled) Active: active (exited) since Mon 2020-03-02 18:33:11 UTC; 2h 56min ago Docs: man:nm-online(1) Process: 1888 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=0/SUCCESS) Main PID: 1888 (code=exited, status=0/SUCCESS) CPU: 44ms Mar 02 18:33:10 localhost systemd[1]: Starting Network Manager Wait Online... Mar 02 18:33:11 localhost systemd[1]: Started Network Manager Wait Online. [core@localhost ~]$ systemctl status systemd-hostnamed ● systemd-hostnamed.service - Hostname Service Loaded: loaded (/usr/lib/systemd/system/systemd-hostnamed.service; static; vendor preset: disabled) Active: inactive (dead) Docs: man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5) https://www.freedesktop.org/wiki/Software/systemd/hostnamed Mar 02 18:33:10 localhost systemd[1]: Starting Hostname Service... Mar 02 18:33:11 localhost systemd[1]: Started Hostname Service. Mar 02 18:33:43 localhost systemd[1]: systemd-hostnamed.service: Consumed 284ms CPU time Expected results: Successfully initialize OpenShift 4.3.z cluster on GCP. Additional info: Discussed with teams on #forum-installer before posting to BZ.
It appears that this is also impacting 4.2. I had to downgrade my installer to 4.2.9 to get it to properly assign a hostname to these nodes.
Correction, it appears that the hostname is initially set correctly upon boot, and then it is changed to localhost when the etcd initialization process begins.
After first boot of the node (10.0.0.2): [core@cluste-m464q-b ~]$ ssh core.0.2 The authenticity of host '10.0.0.2 (10.0.0.2)' can't be established. ECDSA key fingerprint is SHA256:VO5vg00iV09pfQY2jcjqa4CstW+BzHeIl7EYdGb/iAk. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.0.0.2' (ECDSA) to the list of known hosts. Red Hat Enterprise Linux CoreOS 42.80.20191002.0 WARNING: Direct SSH access to machines is not recommended. --- [core@cluste-m464q-m-0 ~]$ exit logout Connection to 10.0.0.2 closed. After the we see bootkube say "Waiting for etcd cluster...": [core@cluste-m464q-b ~]$ ssh core.0.2 Red Hat Enterprise Linux CoreOS 42.81.20191126.0 WARNING: Direct SSH access to machines is not recommended. --- Last login: Tue Mar 3 01:58:33 2020 from 10.0.0.5 [core@localhost ~]$ exit logout Connection to 10.0.0.2 closed.
There's two other BZs that are reporting similar issues with the hostname reverting to localhost: https://bugzilla.redhat.com/show_bug.cgi?id=1803962 https://bugzilla.redhat.com/show_bug.cgi?id=1804426 The recommended workaround for now is to write out the desired hostname in /etc/hostname as part of the Ignition config.
Neither of those issues look to be related to this beyond the localhost connection. In both of those cases, static IPs are attempting to be used and hostname -f responds correctly, which is not happening in this case. In addition, both of those issues are related to UPI deployments and this is a IPI deployment. And finally, NetworkManager-wait-online.service is not failing, as it is in both of those other bugs.
It appears that creating a new project in GCP has made this issue go away. I am unsure if this is a wider issue or something just specific to that project configuration. I have kept the project up for debugging, if someone would like to work with me to directly debug the issue. Otherwise, this BZ can be resolved.
With the last comment, coupled with the evidence on the openshift-sme thread about other folks having successful installs, I'm going to chalk this up as a mis-configured environment. Closing as WORKSFORME. Please re-open if it occurs again.
We have been testing the GCP installer in OSD. We have been testing two different ways: 1. download the installer. I have tried the following versions of the installer: 4.3.5, 4.3.8, 4.3.0. I have this issue 2. Hive is having issues installing this as well. From my logs: ---------------------------------------------------------------- control-plane/10.0.0.4/journals/kubelet.log Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.980543 9453 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.981792 9453 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.080143 9453 kubelet.go:2275] node "localhost" not found Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.087134 9453 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: Get https://api-int.mw.privtest.mwoodson.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/localhost?timeout=10s: dial tcp 10.0.0.2:6443: connect: connection refused Mar 23 16:11:55 localhost hyperkube[9453]: F0323 16:11:55.128852 9453 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'. Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Consumed 399ms CPU time ----------------------------------------------------------------
logs from installation through hive: Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.039697 2030 kubelet.go:2275] node "localhost" not found Mar 23 16:11:47 localhost hyperkube[2030]: I0323 16:11:47.039701 2030 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.139992 2030 kubelet.go:2275] node "localhost" not found Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148081 2030 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost) Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148876 2030 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io) Mar 23 16:11:47 localhost hyperkube[2030]: F0323 16:11:47.213602 2030 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found Mar 23 16:11:47 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a attaching the full dump of journal
This is new: Mar 23 19:24:30 localhost dbus-daemon[1844]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found. Mar 23 19:24:30 localhost NetworkManager[1852]: <info> [1584991470.7793] device (ens4): Activation: successful, device activated. Mar 23 19:24:30 localhost NetworkManager[1852]: <info> [1584991470.7802] manager: NetworkManager state is now CONNECTED_GLOBAL Mar 23 19:24:30 localhost NetworkManager[1852]: <info> [1584991470.7816] manager: startup complete Mar 23 19:24:30 localhost NetworkManager[1852]: <warn> [1584991470.7821] hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal' using hostnamed: GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: Invalid hostname 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal' Mar 23 19:24:30 localhost NetworkManager[1852]: <warn> [1584991470.7821] policy: set-hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal': (1) Operation not permitted Mar 23 19:24:30 localhost NetworkManager[1852]: <warn> [1584991470.7821] policy: set-hostname: you should use hostnamed when systemd hardening is in effect! This is a bug with the MCO script, which uses `hostname` instead of `hostnamectl`
We need to cherry-pick https://github.com/openshift/machine-config-operator/commit/cef48075d676ddeac751da2d40433e59f46d14cb
PR's for cherry-picks back to 4.3 and 4.4 https://github.com/openshift/machine-config-operator/pull/1650 https://github.com/openshift/machine-config-operator/pull/1651
(In reply to Ben Howard from comment #16) > PR's for cherry-picks back to 4.3 and 4.4 > https://github.com/openshift/machine-config-operator/pull/1650 > https://github.com/openshift/machine-config-operator/pull/1651 One bug per release. Since this targets 4.5 we should link this to the PR which fixed this on master branch and move this to MODIFIED. We'd then clone this bug to 4.4, clone that bug to 4.3 and link the relevant PRs to each of those bugs.
Thank you Scott. Retargetted to 4.5.
The linked MCO PR (https://github.com/openshift/machine-config-operator/pull/1622) is for bare-metal hosts only and does not apply to the GCP case. Moving back to ASSIGNED and dropping from the errata.
Submitted PR https://github.com/openshift/machine-config-operator/pull/1711. The messages about `policy: set-hostname: you should use hostnamed when systemd hardening is in effect!` are a complete red-herring.
Verified on RHCOS 45.81.202005181029-0 which is a part of 4.5.0-0.nightly-2020-05-19-021142. I was unable to verify on an OpenShift instance because the current project names are not long enough to trigger this issue. I verified instead on an RHCOS 45.81.202005181029-0 instance on GCP with a long host name and using ignition to add /etc/NetworkManager/dispatcher.d/90-long-hostname. Without the dispatcher script RHCOS will reboot and get a hostname of localhost. With the dispatcher script RHCOS will reboot and have a set hostname. { "ignition": {"version": "2.2.0"}, "storage": { "files": [{ "filesystem": "root", "path": "/etc/sudoers", "append": true, "contents": { "source": "data:,core%20ALL=(ALL)%20NOPASSWD:%20ALL" } }, { "filesystem": "root", "path": "/etc/NetworkManager/dispatcher.d/90-long-hostname", "mode": 493, "contents": { "source": "data:;base64,IyEvYmluL2Jhc2gKIwojIE9uIEdvb2dsZSBDb21wdXRlIFBsYXRmb3JtIChHQ1ApIHRoZSBob3N0bmFtZSBtYXkgYmUgdG9vIGxvbmcgKD42MyBjaGFycykuCiMgRHVyaW5nIGZpcnN0Ym9vdCB0aGUgaG9zdG5hbWUgaXMgc2V0IGluIHRoZSBpbml0cmFtZnMgYmVmb3JlIE5ldHdvcmtNYW5hZ2VyCiMgcnVuczsgb24gcmVib290IGFmZmVjdCBub2RlcyB1c2UgJ2xvY2FsaG9zdCcuIFRoaXMgaG9vayBpcyBhIHNpbXBsZSB3b3JrCiMgYXJvdW5kOiBpZiB0aGUgaG9zdCBuYW1lIGlzIGxvbmdlciB0aGFuIDYzIGNoYXJhY3RlcnMsIHRoZW4gdGhlIGhvc3RuYW1lCiMgaXMgdHJ1bmNhdGVkIGF0IHRoZSBfZmlyc3RfIGRvdC4KIwojIEFkZGl0aW9uYWxseSwgdGhpcyBob29rIGRvZXMgbm90IGJyZWFrIEROUyBvciBjbHVzdGVyIEROUyByZXNvbHV0aW9uLAojIHNpbmNlIE5ldHdvcmtNYW5hZ2VyIHNldHMgdGhlIGFwcHJvcHJpYXRlIC9ldGMvcmVzb2x2LmNvbmYgc2V0dGluZ3MuCklGPSQxClNUQVRVUz0kMgpsb2coKSB7IGxvZ2dlciAtLXRhZyAibmV0d29yay1tYW5hZ2VyLyQoYmFzZW5hbWUgJDApIiAiJHtAfSI7IH0KIyBjYXB0dXJlIGFsbCBlbGlnaWJsZSBob3N0bmFtZXMKaWYgW1sgISAiJCgvYmluL2hvc3RuYW1lKSIgPX4gKGxvY2FsaG9zdHxsb2NhbGhvc3QubG9jYWwpIF1dOyB0aGVuCiAgICBsb2cgImhvc3RuYW1lIGlzIGFscmVhZHkgc2V0IgogICAgZXhpdCAwCmZpCmlmIFtbICEgIiRTVEFUVVMiID1+ICh1cHxob3N0bmFtZXxkaGNwNC1jaGFuZ2V8ZGhjcDYtY2hhbmdlKSBdXTsgdGhlbgogICAgZXhpdCAwCmZpCmRlZmF1bHRfaG9zdD0iJHtESENQNF9IT1NUX05BTUU6LSRESENQNl9IT1NUX05BTUV9IgojIHRydW5jYXRlIHRoZSBob3N0bmFtZSB0byB0aGUgZmlyc3QgZG90IGFuZCB0aGFuIDY0IGNoYXJhY3RlcnMuCmhvc3Q9JChlY2hvICR7ZGVmYXVsdF9ob3N0fSB8IGN1dCAtZjEgLWQnLicgfCBjdXQgLWMgLTYzKQppZiBbICIkeyNkZWZhdWx0X2hvc3R9IiAtZ3QgNjMgXTsgdGhlbgogICAgbG9nICJkaXNjb3ZlcmVkIGhvc3RuYW1lIGlzIGxvbmdlciB0aGFuIHRoYW4gNjMgY2hhcmFjdGVycyIKICAgIGxvZyAidHJ1bmNhdGluZyAke2RlZmF1bHRfaG9zdH0gPT4gJHtob3N0fSIKICAgIC9iaW4vaG9zdG5hbWVjdGwgLS10cmFuc2llbnQgc2V0LWhvc3RuYW1lICIke2hvc3R9IgpmaQo=" } }]}, "passwd": { "users": [ { "name": "core", "sshAuthorizedKeys": [ "ssh-rsa AAA--SNIP--" ] } ] } }
*** Bug 1838625 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days