Bug 1809345 - OpenShift Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost [NEEDINFO]
Summary: OpenShift Cluster fails to initialize on 4.3.z install due to a node with a h...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.z
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Ben Howard
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1838625 (view as bug list)
Depends On:
Blocks: 1186913 1823882 1837122 1837642
TreeView+ depends on / blocked
 
Reported: 2020-03-02 21:56 UTC by Michael McNeill
Modified: 2020-07-23 01:35 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1823882 1823883 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:17:29 UTC
Target Upstream Version:
vjaypurk: needinfo? (mnguyen)
vjaypurk: needinfo? (mnguyen)


Attachments (Terms of Use)
Log bundle from most recent attempt (13.45 MB, application/gzip)
2020-03-02 21:56 UTC, Michael McNeill
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-gcp pull 88 0 None closed Bug 1809345: Add the machine's name as a known NodeInternalDNS 2021-02-17 11:41:11 UTC
Github openshift machine-config-operator pull 1711 0 None closed Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml 2021-02-17 11:41:11 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:17:57 UTC

Internal Links: 1843565

Description Michael McNeill 2020-03-02 21:56:49 UTC
Created attachment 1667075 [details]
Log bundle from most recent attempt

Description of problem:
Cluster fails to initialize on 4.3.z install due to a node with a hostname of localhost. 

Version-Release number of selected component (if applicable):
4.3.0 and 4.3.1

How reproducible:
I only have one account to test with, but it happens no matter what I choose for the cluster name, directory, etc.

Steps to Reproduce:
1. Attempt to install OpenShift 4.3.z on GCP using the command: openshift-install create cluster --dir=<tmp-dir> --log-level=debug
2. Wait for the cluster installation to fail

Actual results:
The cluster fails to initalize with the error:
"FATAL Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded" 

Further investigation shows similar messages to the following:

Mar 02 18:33:23 localhost hyperkube[2090]: I0302 18:33:23.962477    2090 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 02 18:33:24 localhost hyperkube[2090]: I0302 18:33:24.025616    2090 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.027207    2090 kubelet.go:2275] node "localhost" not found
Mar 02 18:33:24 localhost hyperkube[2090]: I0302 18:33:24.027660    2090 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.127468    2090 kubelet.go:2275] node "localhost" not found
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.132576    2090 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost)
Mar 02 18:33:24 localhost hyperkube[2090]: E0302 18:33:24.133390    2090 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io)
Mar 02 18:33:24 localhost hyperkube[2090]: F0302 18:33:24.170404    2090 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found
Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'.
Mar 02 18:33:24 localhost systemd[1]: kubelet.service: Consumed 400ms CPU time
Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Service RestartSec=10s expired, scheduling restart.
Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 2.
Mar 02 18:33:34 localhost systemd[1]: Stopped Kubernetes Kubelet.
Mar 02 18:33:34 localhost systemd[1]: kubelet.service: Consumed 400ms CPU time
Mar 02 18:33:34 localhost systemd[1]: Starting Kubernetes Kubelet...

Command output from the problematic node:

[core@cluste-jdddm-b ~]$ ssh 10.0.0.3
Red Hat Enterprise Linux CoreOS 43.81.202002032142.0
  Part of OpenShift 4.3, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.3/architecture/architecture-rhcos.html

---
Last login: Mon Mar  2 20:39:56 2020 from 10.0.0.4
[core@localhost ~]$ hostname -f
localhost
[core@localhost ~]$ systemctl status NetworkManager-wait-online
● NetworkManager-wait-online.service - Network Manager Wait Online
   Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
   Active: active (exited) since Mon 2020-03-02 18:33:11 UTC; 2h 56min ago
     Docs: man:nm-online(1)
  Process: 1888 ExecStart=/usr/bin/nm-online -s -q --timeout=30 (code=exited, status=0/SUCCESS)
 Main PID: 1888 (code=exited, status=0/SUCCESS)
      CPU: 44ms

Mar 02 18:33:10 localhost systemd[1]: Starting Network Manager Wait Online...
Mar 02 18:33:11 localhost systemd[1]: Started Network Manager Wait Online.
[core@localhost ~]$ systemctl status systemd-hostnamed
● systemd-hostnamed.service - Hostname Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-hostnamed.service; static; vendor preset: disabled)
   Active: inactive (dead)
     Docs: man:systemd-hostnamed.service(8)
           man:hostname(5)
           man:machine-info(5)
           https://www.freedesktop.org/wiki/Software/systemd/hostnamed

Mar 02 18:33:10 localhost systemd[1]: Starting Hostname Service...
Mar 02 18:33:11 localhost systemd[1]: Started Hostname Service.
Mar 02 18:33:43 localhost systemd[1]: systemd-hostnamed.service: Consumed 284ms CPU time

Expected results:

Successfully initialize OpenShift 4.3.z cluster on GCP. 

Additional info:
Discussed with teams on #forum-installer before posting to BZ.

Comment 1 Michael McNeill 2020-03-03 01:57:34 UTC
It appears that this is also impacting 4.2. I had to downgrade my installer to 4.2.9 to get it to properly assign a hostname to these nodes.

Comment 2 Michael McNeill 2020-03-03 01:59:42 UTC
Correction, it appears that the hostname is initially set correctly upon boot, and then it is changed to localhost when the etcd initialization process begins.

Comment 3 Michael McNeill 2020-03-03 02:02:21 UTC
After first boot of the node (10.0.0.2):

[core@cluste-m464q-b ~]$ ssh core@10.0.0.2
The authenticity of host '10.0.0.2 (10.0.0.2)' can't be established.
ECDSA key fingerprint is SHA256:VO5vg00iV09pfQY2jcjqa4CstW+BzHeIl7EYdGb/iAk.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.0.2' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 42.80.20191002.0
WARNING: Direct SSH access to machines is not recommended.

---
[core@cluste-m464q-m-0 ~]$ exit
logout
Connection to 10.0.0.2 closed.

After the we see bootkube say "Waiting for etcd cluster...":

[core@cluste-m464q-b ~]$ ssh core@10.0.0.2
Red Hat Enterprise Linux CoreOS 42.81.20191126.0
WARNING: Direct SSH access to machines is not recommended.

---
Last login: Tue Mar  3 01:58:33 2020 from 10.0.0.5
[core@localhost ~]$ exit
logout
Connection to 10.0.0.2 closed.

Comment 4 Micah Abbott 2020-03-03 13:43:21 UTC
There's two other BZs that are reporting similar issues with the hostname reverting to localhost:

https://bugzilla.redhat.com/show_bug.cgi?id=1803962
https://bugzilla.redhat.com/show_bug.cgi?id=1804426

The recommended workaround for now is to write out the desired hostname in /etc/hostname as part of the Ignition config.

Comment 5 Michael McNeill 2020-03-03 13:48:22 UTC
Neither of those issues look to be related to this beyond the localhost connection. In both of those cases, static IPs are attempting to be used and hostname -f responds correctly, which is not happening in this case. In addition, both of those issues are related to UPI deployments and this is a IPI deployment. And finally, NetworkManager-wait-online.service is not failing, as it is in both of those other bugs.

Comment 6 Michael McNeill 2020-03-10 17:35:36 UTC
It appears that creating a new project in GCP has made this issue go away. I am unsure if this is a wider issue or something just specific to that project configuration. I have kept the project up for debugging, if someone would like to work with me to directly debug the issue. Otherwise, this BZ can be resolved.

Comment 7 Micah Abbott 2020-03-10 18:50:43 UTC
With the last comment, coupled with the evidence on the openshift-sme thread about other folks having successful installs, I'm going to chalk this up as a mis-configured environment.  Closing as WORKSFORME.

Please re-open if it occurs again.

Comment 8 Matt Woodson 2020-03-23 18:41:46 UTC
We have been testing the GCP installer in OSD.  We have been testing two different ways:

1.  download the installer.  I have tried the following versions of the installer: 4.3.5, 4.3.8, 4.3.0.  I have this issue


2. Hive is having issues installing this as well.


From my logs:

----------------------------------------------------------------
control-plane/10.0.0.4/journals/kubelet.log


Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.980543    9453 kubelet.go:1846] skipping pod synchronization - container runtime status check may not have completed yet
Mar 23 16:11:54 localhost hyperkube[9453]: I0323 16:11:54.981792    9453 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.080143    9453 kubelet.go:2275] node "localhost" not found
Mar 23 16:11:55 localhost hyperkube[9453]: E0323 16:11:55.087134    9453 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: Get https://api-int.mw.privtest.mwoodson.openshift.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/localhost?timeout=10s: dial tcp 10.0.0.2:6443: connect: connection refused
Mar 23 16:11:55 localhost hyperkube[9453]: F0323 16:11:55.128852    9453 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found
Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Failed with result 'exit-code'.
Mar 23 16:11:55 localhost systemd[1]: kubelet.service: Consumed 399ms CPU time
----------------------------------------------------------------

Comment 9 Rafael Azevedo 2020-03-23 18:50:03 UTC
logs from installation through hive: 

Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.039697    2030 kubelet.go:2275] node "localhost" not found
Mar 23 16:11:47 localhost hyperkube[2030]: I0323 16:11:47.039701    2030 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.139992    2030 kubelet.go:2275] node "localhost" not found
Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148081    2030 controller.go:220] failed to get node "localhost" when trying to set owner ref to the node lease: the server could not find the requested resource (get nodes localhost)
Mar 23 16:11:47 localhost hyperkube[2030]: E0323 16:11:47.148876    2030 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: the server could not find the requested resource (post leases.coordination.k8s.io)
Mar 23 16:11:47 localhost hyperkube[2030]: F0323 16:11:47.213602    2030 kubelet.go:1375] Kubelet failed to get node info: failed to get instance ID from cloud provider: instance not found
Mar 23 16:11:47 localhost systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a

attaching the full dump of journal

Comment 14 Ben Howard 2020-04-14 16:37:11 UTC
This is new:
Mar 23 19:24:30 localhost dbus-daemon[1844]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Mar 23 19:24:30 localhost NetworkManager[1852]: <info>  [1584991470.7793] device (ens4): Activation: successful, device activated.
Mar 23 19:24:30 localhost NetworkManager[1852]: <info>  [1584991470.7802] manager: NetworkManager state is now CONNECTED_GLOBAL
Mar 23 19:24:30 localhost NetworkManager[1852]: <info>  [1584991470.7816] manager: startup complete
Mar 23 19:24:30 localhost NetworkManager[1852]: <warn>  [1584991470.7821] hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal' using hostnamed: GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: Invalid hostname 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal'
Mar 23 19:24:30 localhost NetworkManager[1852]: <warn>  [1584991470.7821] policy: set-hostname: couldn't set the system hostname to 'mw-gfrr7-m-0.us-east1-b.c.osd-cd2c992ca0e7a7c574ec3e7226.internal': (1) Operation not permitted
Mar 23 19:24:30 localhost NetworkManager[1852]: <warn>  [1584991470.7821] policy: set-hostname: you should use hostnamed when systemd hardening is in effect!


This is a bug with the MCO script, which uses `hostname` instead of `hostnamectl`

Comment 17 Scott Dodson 2020-04-14 17:11:12 UTC
(In reply to Ben Howard from comment #16)
> PR's for cherry-picks back to 4.3 and 4.4
> https://github.com/openshift/machine-config-operator/pull/1650
> https://github.com/openshift/machine-config-operator/pull/1651

One bug per release. Since this targets 4.5 we should link this to the PR which fixed this on master branch and move this to MODIFIED. We'd then clone this bug to 4.4, clone that bug to 4.3 and link the relevant PRs to each of those bugs.

Comment 18 Ben Howard 2020-04-14 17:35:38 UTC
Thank you Scott. Retargetted to 4.5.

Comment 22 Micah Abbott 2020-04-15 18:32:30 UTC
The linked MCO PR (https://github.com/openshift/machine-config-operator/pull/1622) is for bare-metal hosts only and does not apply to the GCP case.  Moving back to ASSIGNED and dropping from the errata.

Comment 27 Ben Howard 2020-05-04 17:36:50 UTC
Submitted PR https://github.com/openshift/machine-config-operator/pull/1711. 

The messages about `policy: set-hostname: you should use hostnamed when systemd hardening is in effect!` are a complete red-herring.

Comment 31 Michael Nguyen 2020-05-19 15:00:13 UTC
Verified on RHCOS 45.81.202005181029-0 which is a part of 4.5.0-0.nightly-2020-05-19-021142.

I was unable to verify on an OpenShift instance because the current project names are not long enough to trigger this issue.  I verified instead on an RHCOS 45.81.202005181029-0 instance on GCP with a long host name and using ignition to add /etc/NetworkManager/dispatcher.d/90-long-hostname.  Without the dispatcher script RHCOS will reboot and get a hostname of localhost.  With the dispatcher script RHCOS will reboot and have a set hostname.

{
    "ignition": {"version": "2.2.0"},
    "storage": {
    "files": [{
      "filesystem": "root",
      "path": "/etc/sudoers",
      "append": true,
      "contents": { "source": "data:,core%20ALL=(ALL)%20NOPASSWD:%20ALL" }
    },
    {
      "filesystem": "root",
      "path": "/etc/NetworkManager/dispatcher.d/90-long-hostname",
      "mode": 493,
      "contents": { "source": "data:;base64,IyEvYmluL2Jhc2gKIwojIE9uIEdvb2dsZSBDb21wdXRlIFBsYXRmb3JtIChHQ1ApIHRoZSBob3N0bmFtZSBtYXkgYmUgdG9vIGxvbmcgKD42MyBjaGFycykuCiMgRHVyaW5nIGZpcnN0Ym9vdCB0aGUgaG9zdG5hbWUgaXMgc2V0IGluIHRoZSBpbml0cmFtZnMgYmVmb3JlIE5ldHdvcmtNYW5hZ2VyCiMgcnVuczsgb24gcmVib290IGFmZmVjdCBub2RlcyB1c2UgJ2xvY2FsaG9zdCcuIFRoaXMgaG9vayBpcyBhIHNpbXBsZSB3b3JrCiMgYXJvdW5kOiBpZiB0aGUgaG9zdCBuYW1lIGlzIGxvbmdlciB0aGFuIDYzIGNoYXJhY3RlcnMsIHRoZW4gdGhlIGhvc3RuYW1lCiMgaXMgdHJ1bmNhdGVkIGF0IHRoZSBfZmlyc3RfIGRvdC4KIwojIEFkZGl0aW9uYWxseSwgdGhpcyBob29rIGRvZXMgbm90IGJyZWFrIEROUyBvciBjbHVzdGVyIEROUyByZXNvbHV0aW9uLAojIHNpbmNlIE5ldHdvcmtNYW5hZ2VyIHNldHMgdGhlIGFwcHJvcHJpYXRlIC9ldGMvcmVzb2x2LmNvbmYgc2V0dGluZ3MuCklGPSQxClNUQVRVUz0kMgpsb2coKSB7IGxvZ2dlciAtLXRhZyAibmV0d29yay1tYW5hZ2VyLyQoYmFzZW5hbWUgJDApIiAiJHtAfSI7IH0KIyBjYXB0dXJlIGFsbCBlbGlnaWJsZSBob3N0bmFtZXMKaWYgW1sgISAiJCgvYmluL2hvc3RuYW1lKSIgPX4gKGxvY2FsaG9zdHxsb2NhbGhvc3QubG9jYWwpIF1dOyB0aGVuCiAgICBsb2cgImhvc3RuYW1lIGlzIGFscmVhZHkgc2V0IgogICAgZXhpdCAwCmZpCmlmIFtbICEgIiRTVEFUVVMiID1+ICh1cHxob3N0bmFtZXxkaGNwNC1jaGFuZ2V8ZGhjcDYtY2hhbmdlKSBdXTsgdGhlbgogICAgZXhpdCAwCmZpCmRlZmF1bHRfaG9zdD0iJHtESENQNF9IT1NUX05BTUU6LSRESENQNl9IT1NUX05BTUV9IgojIHRydW5jYXRlIHRoZSBob3N0bmFtZSB0byB0aGUgZmlyc3QgZG90IGFuZCB0aGFuIDY0IGNoYXJhY3RlcnMuCmhvc3Q9JChlY2hvICR7ZGVmYXVsdF9ob3N0fSB8IGN1dCAtZjEgLWQnLicgfCBjdXQgLWMgLTYzKQppZiBbICIkeyNkZWZhdWx0X2hvc3R9IiAtZ3QgNjMgXTsgdGhlbgogICAgbG9nICJkaXNjb3ZlcmVkIGhvc3RuYW1lIGlzIGxvbmdlciB0aGFuIHRoYW4gNjMgY2hhcmFjdGVycyIKICAgIGxvZyAidHJ1bmNhdGluZyAke2RlZmF1bHRfaG9zdH0gPT4gJHtob3N0fSIKICAgIC9iaW4vaG9zdG5hbWVjdGwgLS10cmFuc2llbnQgc2V0LWhvc3RuYW1lICIke2hvc3R9IgpmaQo=" }
    }]},
    "passwd": {
      "users": [
        {
          "name": "core",
          "sshAuthorizedKeys": [
            "ssh-rsa AAA--SNIP--"
          ]
        }
      ]
    }
}

Comment 34 Ben Howard 2020-06-10 16:58:02 UTC
*** Bug 1838625 has been marked as a duplicate of this bug. ***

Comment 37 errata-xmlrpc 2020-07-13 17:17:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.