Bug 1809345

Summary:

OpenShift Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost

Product:

OpenShift Container Platform

Reporter:

Michael McNeill <mmcneill>

Component:

RHCOS

Assignee:

Ben Howard <behoward>

Status:

CLOSED ERRATA

QA Contact:

Michael Nguyen <mnguyen>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.3.z

CC:

bbreard, behoward, dornelas, dustymabe, imcleod, jerzhang, jligon, miabbott, mnguyen, mwoodson, nstielau, razevedo, rludva, sdodson, smilner, vjaypurk, wking, yanyang

Target Milestone:

---

Keywords:

Reopened

Target Release:

4.5.0

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

1823882 1823883 (view as bug list)

Environment:

Last Closed:

2020-07-13 17:17:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1186913, 1823882, 1837122, 1837642

Attachments:

Description	Flags
Log bundle from most recent attempt	none

Description Michael McNeill 2020-03-02 21:56:49 UTC

Created attachment 1667075 [details]
Log bundle from most recent attempt

Description of problem:
Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost. 

Version-Release number of selected component (if applicable):
4.3.0 and 4.3.1

How reproducible:
I only have one account to test with, but it happens no matter what I choose for the cluster name, directory, etc.

Steps to Reproduce:
1. Attempt to install OpenShift 4.3.z on GCP using the command: openshift-install create cluster --dir=<tmp-dir> --log-level=debug
2. Wait for the cluster installation to fail

Actual results:
The cluster fails to initalize with the error:
"FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" 

Further investigation shows similar messages to the following:

Mar 02 18:33:23 localhost hyperkube[2090]: I0302 18:33:23.962477    2090 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 02 18:33:24 localhost hyperkube[2090]: I0302 18:33:24.025616    2090 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.027207    2090 kubelet.go:2275] node "localhost" not found
Mar 02 18:33:24 localhost hyperkube[2090]: I0302 18:33:24.027660    2090 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.127468    2090 kubelet.go:2275] node "localhost" not found
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.132576    2090 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost)
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.133390    2090 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io)
Mar 02 18:33:24 localhost hyperkube[2090]: F0302 18:33:24.170404    2090 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found
Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'.
Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Consumed 400ms CPU time
Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 2.
Mar 02 18:33:34 localhost systemd[1]: Stopped Kubernetes Kubelet.
Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Consumed 400ms CPU time
Mar 02 18:33:34 localhost systemd[1]: Starting Kubernetes Kubelet...

Command output from the problematic node:

[core@cluste-jdddm-b ~]$ ssh 10.0.0.3
Red Hat Enterprise Linux CoreOS 43.81.202002032142.0
  Part of OpenShift 4.3, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.3/architecture/architecture-rhcos.html

---
Last login: Mon Mar  2 20:39:56 2020 from 10.0.0.4
[core@localhost ~]$ hostname -f
localhost
[core@localhost ~]$ systemctl status NetworkManager-wait-online
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: active (exited) since Mon 2020-03-02 18:33:11 UTC; 2h 56min ago
     Docs: man:nm-online(1)
  Process: 1888 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=0/SUCCESS)
 Main PID: 1888 (code=exited, status=0/SUCCESS)
      CPU: 44ms

Mar 02 18:33:10 localhost systemd[1]: Starting Network Manager Wait Online...
Mar 02 18:33:11 localhost systemd[1]: Started Network Manager Wait Online.
[core@localhost ~]$ systemctl status systemd-hostnamed
● systemd-hostnamed.service - Hostname Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-hostnamed.service; static; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:systemd-hostnamed.service(8)
           man:hostname(5)
           man:machine-info(5)
           https://www.freedesktop.org/wiki/Software/systemd/hostnamed

Mar 02 18:33:10 localhost systemd[1]: Starting Hostname Service...
Mar 02 18:33:11 localhost systemd[1]: Started Hostname Service.
Mar 02 18:33:43 localhost systemd[1]: systemd-hostnamed.service: Consumed 284ms CPU time

Expected results:

Successfully initialize OpenShift 4.3.z cluster on GCP. 

Additional info:
Discussed with teams on #forum-installer before posting to BZ.

Comment 1 Michael McNeill 2020-03-03 01:57:34 UTC

It appears that this is also impacting 4.2. I had to downgrade my installer to 4.2.9 to get it to properly assign a hostname to these nodes.

Comment 2 Michael McNeill 2020-03-03 01:59:42 UTC

Correction, it appears that the hostname is initially set correctly upon boot, and then it is changed to localhost when the etcd initialization process begins.

Comment 3 Michael McNeill 2020-03-03 02:02:21 UTC

After first boot of the node (10.0.0.2):

[core@cluste-m464q-b ~]$ ssh core.0.2
The authenticity of host '10.0.0.2 (10.0.0.2)' can't be established.
ECDSA key fingerprint is SHA256:VO5vg00iV09pfQY2jcjqa4CstW+BzHeIl7EYdGb/iAk.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.0.2' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191002.0
WARNING: Direct SSH access to machines is not recommended.

---
[core@cluste-m464q-m-0 ~]$ exit
logout
Connection to 10.0.0.2 closed.

After the we see bootkube say "Waiting for etcd cluster...":

[core@cluste-m464q-b ~]$ ssh core.0.2
Red Hat Enterprise Linux CoreOS 42.81.20191126.0
WARNING: Direct SSH access to machines is not recommended.

---
Last login: Tue Mar  3 01:58:33 2020 from 10.0.0.5
[core@localhost ~]$ exit
logout
Connection to 10.0.0.2 closed.

Comment 4 Micah Abbott 2020-03-03 13:43:21 UTC

There's two other BZs that are reporting similar issues with the hostname reverting to localhost:

https://bugzilla.redhat.com/show_bug.cgi?id=1803962
https://bugzilla.redhat.com/show_bug.cgi?id=1804426

The recommended workaround for now is to write out the desired hostname in /etc/hostname as part of the Ignition config.

Comment 5 Michael McNeill 2020-03-03 13:48:22 UTC

Neither of those issues look to be related to this beyond the localhost connection. In both of those cases, static IPs are attempting to be used and hostname -f responds correctly, which is not happening in this case. In addition, both of those issues are related to UPI deployments and this is a IPI deployment. And finally, NetworkManager-wait-online.service is not failing, as it is in both of those other bugs.

Comment 6 Michael McNeill 2020-03-10 17:35:36 UTC

It appears that creating a new project in GCP has made this issue go away. I am unsure if this is a wider issue or something just specific to that project configuration. I have kept the project up for debugging, if someone would like to work with me to directly debug the issue. Otherwise, this BZ can be resolved.

Comment 7 Micah Abbott 2020-03-10 18:50:43 UTC

With the last comment, coupled with the evidence on the openshift-sme thread about other folks having successful installs, I'm going to chalk this up as a mis-configured environment.  Closing as WORKSFORME.

Please re-open if it occurs again.

Comment 8 Matt Woodson 2020-03-23 18:41:46 UTC

We have been testing the GCP installer in OSD.  We have been testing two different ways:

1.  download the installer.  I have tried the following versions of the installer: 4.3.5, 4.3.8, 4.3.0.  I have this issue


2. Hive is having issues installing this as well.


From my logs:

----------------------------------------------------------------
control-plane/10.0.0.4/journals/kubelet.log


Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.980543    9453 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet
Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.981792    9453 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.080143    9453 kubelet.go:2275] node "localhost" not found
Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.087134    9453 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: Get https://api-int.mw.privtest.mwoodson.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/localhost?timeout=10s: dial tcp 10.0.0.2:6443: connect: connection refused
Mar 23 16:11:55 localhost hyperkube[9453]: F0323 16:11:55.128852    9453 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found
Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'.
Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Consumed 399ms CPU time
----------------------------------------------------------------

Comment 9 Rafael Azevedo 2020-03-23 18:50:03 UTC

logs from installation through hive: 

Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.039697    2030 kubelet.go:2275] node "localhost" not found
Mar 23 16:11:47 localhost hyperkube[2030]: I0323 16:11:47.039701    2030 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.139992    2030 kubelet.go:2275] node "localhost" not found
Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148081    2030 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost)
Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148876    2030 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io)
Mar 23 16:11:47 localhost hyperkube[2030]: F0323 16:11:47.213602    2030 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found
Mar 23 16:11:47 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a

attaching the full dump of journal

Comment 14 Ben Howard 2020-04-14 16:37:11 UTC

This is new:
Mar 23 19:24:30 localhost dbus-daemon[1844]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Mar 23 19:24:30 localhost NetworkManager[1852]: <info>  [1584991470.7793] device (ens4): Activation: successful, device activated.
Mar 23 19:24:30 localhost NetworkManager[1852]: <info>  [1584991470.7802] manager: NetworkManager state is now CONNECTED_GLOBAL
Mar 23 19:24:30 localhost NetworkManager[1852]: <info>  [1584991470.7816] manager: startup complete
Mar 23 19:24:30 localhost NetworkManager[1852]: <warn>  [1584991470.7821] hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal' using hostnamed: GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: Invalid hostname 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal'
Mar 23 19:24:30 localhost NetworkManager[1852]: <warn>  [1584991470.7821] policy: set-hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal': (1) Operation not permitted
Mar 23 19:24:30 localhost NetworkManager[1852]: <warn>  [1584991470.7821] policy: set-hostname: you should use hostnamed when systemd hardening is in effect!


This is a bug with the MCO script, which uses `hostname` instead of `hostnamectl`

Comment 15 Ben Howard 2020-04-14 16:43:35 UTC

We need to cherry-pick https://github.com/openshift/machine-config-operator/commit/cef48075d676ddeac751da2d40433e59f46d14cb

Comment 16 Ben Howard 2020-04-14 16:51:28 UTC

PR's for cherry-picks back to 4.3 and 4.4
https://github.com/openshift/machine-config-operator/pull/1650
https://github.com/openshift/machine-config-operator/pull/1651

Comment 17 Scott Dodson 2020-04-14 17:11:12 UTC

(In reply to Ben Howard from comment #16)
> PR's for cherry-picks back to 4.3 and 4.4
> https://github.com/openshift/machine-config-operator/pull/1650
> https://github.com/openshift/machine-config-operator/pull/1651

One bug per release. Since this targets 4.5 we should link this to the PR which fixed this on master branch and move this to MODIFIED. We'd then clone this bug to 4.4, clone that bug to 4.3 and link the relevant PRs to each of those bugs.

Comment 18 Ben Howard 2020-04-14 17:35:38 UTC

Thank you Scott. Retargetted to 4.5.

Comment 22 Micah Abbott 2020-04-15 18:32:30 UTC

The linked MCO PR (https://github.com/openshift/machine-config-operator/pull/1622) is for bare-metal hosts only and does not apply to the GCP case.  Moving back to ASSIGNED and dropping from the errata.

Comment 27 Ben Howard 2020-05-04 17:36:50 UTC

Submitted PR https://github.com/openshift/machine-config-operator/pull/1711. 

The messages about `policy: set-hostname: you should use hostnamed when systemd hardening is in effect!` are a complete red-herring.

Comment 31 Michael Nguyen 2020-05-19 15:00:13 UTC

Verified on RHCOS 45.81.202005181029-0 which is a part of 4.5.0-0.nightly-2020-05-19-021142.

I was unable to verify on an OpenShift instance because the current project names are not long enough to trigger this issue.  I verified instead on an RHCOS 45.81.202005181029-0 instance on GCP with a long host name and using ignition to add /etc/NetworkManager/dispatcher.d/90-long-hostname.  Without the dispatcher script RHCOS will reboot and get a hostname of localhost.  With the dispatcher script RHCOS will reboot and have a set hostname.

{
    "ignition": {"version": "2.2.0"},
    "storage": {
    "files": [{
      "filesystem": "root",
      "path": "/etc/sudoers",
      "append": true,
      "contents": { "source": "data:,core%20ALL=(ALL)%20NOPASSWD:%20ALL" }
    },
    {
      "filesystem": "root",
      "path": "/etc/NetworkManager/dispatcher.d/90-long-hostname",
      "mode": 493,
      "contents": { "source": "data:;base64,IyEvYmluL2Jhc2gKIwojIE9uIEdvb2dsZSBDb21wdXRlIFBsYXRmb3JtIChHQ1ApIHRoZSBob3N0bmFtZSBtYXkgYmUgdG9vIGxvbmcgKD42MyBjaGFycykuCiMgRHVyaW5nIGZpcnN0Ym9vdCB0aGUgaG9zdG5hbWUgaXMgc2V0IGluIHRoZSBpbml0cmFtZnMgYmVmb3JlIE5ldHdvcmtNYW5hZ2VyCiMgcnVuczsgb24gcmVib290IGFmZmVjdCBub2RlcyB1c2UgJ2xvY2FsaG9zdCcuIFRoaXMgaG9vayBpcyBhIHNpbXBsZSB3b3JrCiMgYXJvdW5kOiBpZiB0aGUgaG9zdCBuYW1lIGlzIGxvbmdlciB0aGFuIDYzIGNoYXJhY3RlcnMsIHRoZW4gdGhlIGhvc3RuYW1lCiMgaXMgdHJ1bmNhdGVkIGF0IHRoZSBfZmlyc3RfIGRvdC4KIwojIEFkZGl0aW9uYWxseSwgdGhpcyBob29rIGRvZXMgbm90IGJyZWFrIEROUyBvciBjbHVzdGVyIEROUyByZXNvbHV0aW9uLAojIHNpbmNlIE5ldHdvcmtNYW5hZ2VyIHNldHMgdGhlIGFwcHJvcHJpYXRlIC9ldGMvcmVzb2x2LmNvbmYgc2V0dGluZ3MuCklGPSQxClNUQVRVUz0kMgpsb2coKSB7IGxvZ2dlciAtLXRhZyAibmV0d29yay1tYW5hZ2VyLyQoYmFzZW5hbWUgJDApIiAiJHtAfSI7IH0KIyBjYXB0dXJlIGFsbCBlbGlnaWJsZSBob3N0bmFtZXMKaWYgW1sgISAiJCgvYmluL2hvc3RuYW1lKSIgPX4gKGxvY2FsaG9zdHxsb2NhbGhvc3QubG9jYWwpIF1dOyB0aGVuCiAgICBsb2cgImhvc3RuYW1lIGlzIGFscmVhZHkgc2V0IgogICAgZXhpdCAwCmZpCmlmIFtbICEgIiRTVEFUVVMiID1+ICh1cHxob3N0bmFtZXxkaGNwNC1jaGFuZ2V8ZGhjcDYtY2hhbmdlKSBdXTsgdGhlbgogICAgZXhpdCAwCmZpCmRlZmF1bHRfaG9zdD0iJHtESENQNF9IT1NUX05BTUU6LSRESENQNl9IT1NUX05BTUV9IgojIHRydW5jYXRlIHRoZSBob3N0bmFtZSB0byB0aGUgZmlyc3QgZG90IGFuZCB0aGFuIDY0IGNoYXJhY3RlcnMuCmhvc3Q9JChlY2hvICR7ZGVmYXVsdF9ob3N0fSB8IGN1dCAtZjEgLWQnLicgfCBjdXQgLWMgLTYzKQppZiBbICIkeyNkZWZhdWx0X2hvc3R9IiAtZ3QgNjMgXTsgdGhlbgogICAgbG9nICJkaXNjb3ZlcmVkIGhvc3RuYW1lIGlzIGxvbmdlciB0aGFuIHRoYW4gNjMgY2hhcmFjdGVycyIKICAgIGxvZyAidHJ1bmNhdGluZyAke2RlZmF1bHRfaG9zdH0gPT4gJHtob3N0fSIKICAgIC9iaW4vaG9zdG5hbWVjdGwgLS10cmFuc2llbnQgc2V0LWhvc3RuYW1lICIke2hvc3R9IgpmaQo=" }
    }]},
    "passwd": {
      "users": [
        {
          "name": "core",
          "sshAuthorizedKeys": [
            "ssh-rsa AAA--SNIP--"
          ]
        }
      ]
    }
}

Comment 34 Ben Howard 2020-06-10 16:58:02 UTC

*** Bug 1838625 has been marked as a duplicate of this bug. ***

Comment 37 errata-xmlrpc 2020-07-13 17:17:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 38 Red Hat Bugzilla 2023-09-14 05:53:44 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days