Bug 1872885 - infra nodes in GCP don't join the cluster when the hostname is 64 characters long
Summary: infra nodes in GCP don't join the cluster when the hostname is 64 characters ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Ben Howard
QA Contact: Micah Abbott
URL:
Whiteboard: coreos
: 1878065 1879614 (view as bug list)
Depends On:
Blocks: 1879614
TreeView+ depends on / blocked
 
Reported: 2020-08-26 20:03 UTC by Alex Chvatal
Modified: 2020-11-18 17:09 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1879614 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:34:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journals for an affected infra node (2.20 MB, application/gzip)
2020-09-29 16:22 UTC, Alex Chvatal
no flags Details
install log from ocm (173.37 KB, text/plain)
2020-09-29 16:23 UTC, Alex Chvatal
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1938 0 None closed [release 4.4] Bug 1855878: cherry-pick hostname fixes 2021-02-09 15:43:00 UTC
Github openshift machine-config-operator pull 2084 0 None closed Bug 1872885: always check hostname char count 2021-02-09 15:42:59 UTC
Github openshift machine-config-operator pull 2132 0 None closed Bug 1872885: add template for NM to not manage hostname on GCP 2021-02-09 15:43:00 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:34:51 UTC

Description Alex Chvatal 2020-08-26 20:03:09 UTC
A node with a 64-character hostname will not be able to join the cluster.

eg: when using GCP, a cluster name 11 characters long and a region 8 characters long, it yields infra nodes with hostnames that are 64 characters long.

The infra nodes will be created and launch but will be unable to join the cluster, causing a whole slew of problems.

Comment 1 Alex Chvatal 2020-08-26 20:07:06 UTC
Specifically, we're seeing this on 4.4.16

Comment 2 Abhinav Dahiya 2020-08-26 20:38:07 UTC
The node name on GCP is derived from the hostname. Moving to RHCOS as they had previous fixes in place to fix this edge case.

Comment 3 Micah Abbott 2020-08-27 14:53:40 UTC
We included the hostname fixes as part of 4.4 with https://github.com/openshift/machine-config-operator/pull/1938; merged on Aug 6

4.4.16 was built on Aug 5, so it missed that commit.

4.4.17 should have the fix included (built on Aug 12)

@achvatal could you retry your configuration with 4.4.17 (or newer) and see if it is improved?

Comment 4 Alex Chvatal 2020-08-31 19:53:00 UTC
Yeah, it looks like this problem still exists in 4.4.17:

[achvatal@xoth ~]$ oc version
Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe
Server Version: 4.4.17
Kubernetes Version: v1.17.1+20ba474

[achvatal@xoth ~]$ oc get node
NAME                                                          STATUS   ROLES    AGE     VERSION
achvbugtest-hhc4s-master-0.us-east1-b.c.o-4517451f.internal   Ready    master   6h55m   v1.17.1+20ba474
achvbugtest-hhc4s-master-1.us-east1-c.c.o-4517451f.internal   Ready    master   6h55m   v1.17.1+20ba474
achvbugtest-hhc4s-master-2.us-east1-d.c.o-4517451f.internal   Ready    master   6h54m   v1.17.1+20ba474
achvbugtest-hhc4s-worker-b-2nh82                              Ready    worker   6h39m   v1.17.1+20ba474
achvbugtest-hhc4s-worker-b-t2q5x                              Ready    worker   6h39m   v1.17.1+20ba474
achvbugtest-hhc4s-worker-c-n6jjf                              Ready    worker   6h39m   v1.17.1+20ba474
achvbugtest-hhc4s-worker-d-jhzb7                              Ready    worker   6h39m   v1.17.1+20ba474

[achvatal@xoth ~]$ oc get machine -n openshift-machine-api
NAME                               PHASE         TYPE             REGION     ZONE         AGE
achvbugtest-hhc4s-infra-b-wstwl    Provisioned   custom-4-16384   us-east1   us-east1-b   6h28m
achvbugtest-hhc4s-infra-c-zltqf    Provisioned   custom-4-16384   us-east1   us-east1-c   6h28m
achvbugtest-hhc4s-infra-d-5mnrx    Provisioned   custom-4-16384   us-east1   us-east1-d   6h28m
achvbugtest-hhc4s-master-0         Running       custom-4-16384   us-east1   us-east1-b   6h55m
achvbugtest-hhc4s-master-1         Running       custom-4-16384   us-east1   us-east1-c   6h55m
achvbugtest-hhc4s-master-2         Running       custom-4-16384   us-east1   us-east1-d   6h55m
achvbugtest-hhc4s-worker-b-2nh82   Running       custom-4-16384   us-east1   us-east1-b   6h41m
achvbugtest-hhc4s-worker-b-t2q5x   Running       custom-4-16384   us-east1   us-east1-b   6h41m
achvbugtest-hhc4s-worker-c-n6jjf   Running       custom-4-16384   us-east1   us-east1-c   6h41m
achvbugtest-hhc4s-worker-d-jhzb7   Running       custom-4-16384   us-east1   us-east1-d   6h41m


I'll spin up a 4.5.x cluster tomorrow and see if I can replicate this on that version.

Comment 5 Micah Abbott 2020-08-31 20:10:23 UTC
@Ben can you do some triage to make sure this is properly handled in 4.4.z?

Comment 6 Derrick Ornelas 2020-09-04 22:10:30 UTC
Micah, Ben, is this issue different than Bug 1853584?  Just wondering because if they are the same then should this bug have a Target Release of 4.4.z instead of being deferred to 4.7?

Comment 7 Micah Abbott 2020-09-06 13:42:56 UTC
(In reply to Derrick Ornelas from comment #6)
> Micah, Ben, is this issue different than Bug 1853584?  Just wondering
> because if they are the same then should this bug have a Target Release of
> 4.4.z instead of being deferred to 4.7?

It looks very similar, but I am waiting for additional triage to make that call.

Comment 8 Yang Yang 2020-09-07 08:35:36 UTC
I'm facing it on 4.6.0-0.nightly-2020-09-05-015624. The worker node complains node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found

$ hostname
auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal


● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-mco-default-env.conf
   Active: active (running) since Mon 2020-09-07 08:32:33 UTC; 20s ago
  Process: 47159 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
  Process: 47157 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
 Main PID: 47161 (kubelet)
    Tasks: 14 (limit: 95351)
   Memory: 43.4M
      CPU: 1.119s
   CGroup: /system.slice/kubelet.service
           └─47161 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=rhcos --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider=gce --cloud-config=/etc/kubernetes/cloud.conf --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 --v=4

Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.297021   47161 kubelet.go:1894] SyncLoop (housekeeping)
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.297575   47161 controller.go:228] failed to get node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" when trying to set owner ref to the node lease: nodes "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.309840   47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.410105   47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.413214   47161 eviction_manager.go:243] eviction manager: synchronize housekeeping
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.413302   47161 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.415646   47161 kubelet.go:2087] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.415693   47161 kubelet.go:2090] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.510359   47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.610620   47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found

Comment 9 Ben Howard 2020-09-10 15:50:08 UTC
The 64-character hostname was fixed. At the very least I need to see:
- a journal showing the boot
- must-gather
- versions of OCP and the RHCOS boot image

Requesting more information from the reporter

Comment 10 Ben Howard 2020-09-10 15:51:22 UTC
Yang Yang: the error you are hitting is different issue: 
Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.415646   47161 kubelet.go:2087] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?

This would indicate a problem with cluster networking, not the hostname assignement.

Comment 11 Ben Howard 2020-09-14 18:29:37 UTC
Oh, this is fun. Previously `hostnamectl` would refused name the host a name longer than 63-character. Now it will, which bypasses the fix.

Thank you for the journal. 

The problem is that the NetworkManager is setting the hostname before the check for length happens. When the check happens, it sees that the name is already set and no-ops.

Comment 13 Ben Howard 2020-09-14 19:33:17 UTC
*** Bug 1878065 has been marked as a duplicate of this bug. ***

Comment 14 Micah Abbott 2020-09-16 20:11:09 UTC
*** Bug 1879614 has been marked as a duplicate of this bug. ***

Comment 16 Micah Abbott 2020-09-22 20:59:17 UTC
@Alex Are you able to verify this fix with a 4.6 cluster?

Comment 17 Alex Chvatal 2020-09-23 14:34:44 UTC
Looks like it's still present in 4.6.0-0.nightly-2020-09-23-022756:

[achvatal@xoth ~]$ oc get node
NAME                                                          STATUS                     ROLES    AGE   VERSION
achvbugtest-thptp-master-0.us-east1-b.c.o-9cd50532.internal   Ready                      master   69m   v1.19.0+8a39924
achvbugtest-thptp-master-1.us-east1-c.c.o-9cd50532.internal   Ready                      master   69m   v1.19.0+8a39924
achvbugtest-thptp-master-2.us-east1-d.c.o-9cd50532.internal   Ready                      master   69m   v1.19.0+8a39924
achvbugtest-thptp-worker-b-7wbd6                              Ready                      worker   43m   v1.19.0+8a39924
achvbugtest-thptp-worker-b-tqvfl                              Ready                      worker   43m   v1.19.0+8a39924
achvbugtest-thptp-worker-c-mnmzp                              Ready,SchedulingDisabled   worker   43m   v1.19.0+8a39924
achvbugtest-thptp-worker-d-tdvcq                              Ready                      worker   43m   v1.19.0+8a39924

[achvatal@xoth ~]$ oc get machine -n openshift-machine-api
NAME                               PHASE         TYPE             REGION     ZONE         AGE
achvbugtest-thptp-infra-b-sqbv7    Provisioned   custom-4-16384   us-east1   us-east1-b   33m
achvbugtest-thptp-infra-c-kd5lz    Provisioned   custom-4-16384   us-east1   us-east1-c   33m
achvbugtest-thptp-infra-d-7dtjn    Provisioned   custom-4-16384   us-east1   us-east1-d   33m
achvbugtest-thptp-master-0         Running       custom-4-16384   us-east1   us-east1-b   74m
achvbugtest-thptp-master-1         Running       custom-4-16384   us-east1   us-east1-c   74m
achvbugtest-thptp-master-2         Running       custom-4-16384   us-east1   us-east1-d   74m
achvbugtest-thptp-worker-b-7wbd6   Running       custom-4-16384   us-east1   us-east1-b   46m
achvbugtest-thptp-worker-b-tqvfl   Running       custom-4-16384   us-east1   us-east1-b   46m
achvbugtest-thptp-worker-c-mnmzp   Running       custom-4-16384   us-east1   us-east1-c   46m
achvbugtest-thptp-worker-d-tdvcq   Running       custom-4-16384   us-east1   us-east1-d   46m

[achvatal@xoth ~]$ oc version
Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe
Server Version: 4.6.0-0.nightly-2020-09-23-022756
Kubernetes Version: v1.19.0+8a39924

Comment 20 To Hung Sze 2020-09-28 00:37:08 UTC
Is this fix supposed to allow a cluster with name 19 char (or longer) to finish installation?
I am still not able to get a cluster with 19 char name to finish installation:
cluster name: tszelongname-123456
Version: 4.6.0-0.nightly-2020-09-25-150713

Comment 23 To Hung Sze 2020-09-28 15:49:06 UTC
I don't know if it is related but cluster with long name will fail installation:
https://bugzilla.redhat.com/show_bug.cgi?id=1843722

Comment 24 Alex Chvatal 2020-09-29 16:22:14 UTC
Created attachment 1717584 [details]
journals for an affected infra node

Comment 25 Alex Chvatal 2020-09-29 16:23:01 UTC
Created attachment 1717585 [details]
install log from ocm

not sure if the install log will help but i doubt it will hurt

Comment 28 Ben Howard 2020-09-30 19:58:06 UTC
Okay, now NetworkManager is changing the hostname back. 

The journal shows:
- the long hostname fix is applied
- NetworkManager detects the change
- NetworkManager sets the hostname back to the long hostname

Evaluating potential fixes.

Comment 29 Colin Walters 2020-09-30 20:43:57 UTC
Technically fixes for this land in the MCO, though the RHCOS team helps out with the code.

I think we need a test for this for the next try.

Ideas:
- Change the MCO's e2e-gcp run to always exercise this case (patch to openshift/release)
- Change *all* GCP runs in OpenShift core CI to exercise this (patch to openshift/release)
- Extend the MCO's e2e-gcp create a custom machineset that triggers this

Comment 30 Colin Walters 2020-09-30 21:16:26 UTC
I get confused by this so I'm trying to gather prior code/discussions to understand:

A lot of the original code here landed in
https://github.com/openshift/machine-config-operator/pull/1914
(I had a question there around whether this is GCP specific:
 https://github.com/openshift/machine-config-operator/pull/1914#issuecomment-657258556
 would like to know that)

There's a "source of truth" question here around the hostname - from Ben's analysis it sounds
like NM and set-valid-hostname are fighting.  If that's indeed the problem, I don't
see how we can do anything other than tell NM not to set the hostname, and we manually
extract it from DHCP and set it ourselves.

From https://developer.gnome.org/NetworkManager/stable/NetworkManager.conf.html
we can set `hostname-mode = none`.

Comment 31 Colin Walters 2020-09-30 21:25:46 UTC
Ultimately though, don't we want something like:

```
if ignition.platform.id = "gcp" {
  writefile("/etc/NetworkManager.conf.d/mco-gcp.conf", "hostname-truncation = "firstdot")
}
```

The idea here is that the "hostname-truncation" algorithm must be shared with the network provider, since it depends on their DNS server also using the same algorithm right?

It may be somewhat risky to try to scope this to *just* GCP at this point - maybe some bare metal user is somehow relying on this.  Perhaps we can try to gather
some telemetry around whether long-hostname is triggering elsewhere.  It's understandable why they're doing what they are but it violates networking standards
and I don't think we should encourage that.

Comment 32 Ben Howard 2020-09-30 23:13:45 UTC
@Walter's is correct. 

What's happening is that NM calls the dispatcher (which sets the transient hostname) and then NM sees that the hostname changed and sets to the hostname that NM expects. 

When the hostname is longer than 65 characters, NM fails to set the hostname. However, if the hostname is in cluster-invalid but Linux valid zone of 63-65 characters, then the dispatcher script is bypassed. 

Filed https://github.com/openshift/machine-config-operator/pull/2132 which tells NM to allow the dispatcher script to set the hostname on GCP. 

I scoped this only GCP for safety since its the only platform...but I really don't like doing this.

Comment 35 Alex Chvatal 2020-10-08 20:00:17 UTC
i was able to spin up a 4.6 cluster in int that successfully started up infra nodes with 4.6.0-0.nightly-2020-10-08-081226

example infra node hostname: achvbugtest-hdbdw-infra-b-jtmr5.us-east1-b.c.o-a07db612.internal (exactly 64 characters)

Comment 36 Micah Abbott 2020-10-09 13:23:52 UTC
Marking verified based on comment #35 and additional testing done by the RHCOS/MCO QE folks.

Comment 38 errata-xmlrpc 2020-10-27 16:34:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 39 Ricardo Noriega 2020-11-18 17:09:22 UTC
(In reply to Alex Chvatal from comment #4)
> Yeah, it looks like this problem still exists in 4.4.17:
> 
> [achvatal@xoth ~]$ oc version
> Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe
> Server Version: 4.4.17
> Kubernetes Version: v1.17.1+20ba474
> 
> [achvatal@xoth ~]$ oc get node
> NAME                                                          STATUS   ROLES
> AGE     VERSION
> achvbugtest-hhc4s-master-0.us-east1-b.c.o-4517451f.internal   Ready   
> master   6h55m   v1.17.1+20ba474
> achvbugtest-hhc4s-master-1.us-east1-c.c.o-4517451f.internal   Ready   
> master   6h55m   v1.17.1+20ba474
> achvbugtest-hhc4s-master-2.us-east1-d.c.o-4517451f.internal   Ready   
> master   6h54m   v1.17.1+20ba474
> achvbugtest-hhc4s-worker-b-2nh82                              Ready   
> worker   6h39m   v1.17.1+20ba474
> achvbugtest-hhc4s-worker-b-t2q5x                              Ready   
> worker   6h39m   v1.17.1+20ba474
> achvbugtest-hhc4s-worker-c-n6jjf                              Ready   
> worker   6h39m   v1.17.1+20ba474
> achvbugtest-hhc4s-worker-d-jhzb7                              Ready   
> worker   6h39m   v1.17.1+20ba474
> 
> [achvatal@xoth ~]$ oc get machine -n openshift-machine-api
> NAME                               PHASE         TYPE             REGION    
> ZONE         AGE
> achvbugtest-hhc4s-infra-b-wstwl    Provisioned   custom-4-16384   us-east1  
> us-east1-b   6h28m
> achvbugtest-hhc4s-infra-c-zltqf    Provisioned   custom-4-16384   us-east1  
> us-east1-c   6h28m
> achvbugtest-hhc4s-infra-d-5mnrx    Provisioned   custom-4-16384   us-east1  
> us-east1-d   6h28m
> achvbugtest-hhc4s-master-0         Running       custom-4-16384   us-east1  
> us-east1-b   6h55m
> achvbugtest-hhc4s-master-1         Running       custom-4-16384   us-east1  
> us-east1-c   6h55m
> achvbugtest-hhc4s-master-2         Running       custom-4-16384   us-east1  
> us-east1-d   6h55m
> achvbugtest-hhc4s-worker-b-2nh82   Running       custom-4-16384   us-east1  
> us-east1-b   6h41m
> achvbugtest-hhc4s-worker-b-t2q5x   Running       custom-4-16384   us-east1  
> us-east1-b   6h41m
> achvbugtest-hhc4s-worker-c-n6jjf   Running       custom-4-16384   us-east1  
> us-east1-c   6h41m
> achvbugtest-hhc4s-worker-d-jhzb7   Running       custom-4-16384   us-east1  
> us-east1-d   6h41m
> 
> 
> I'll spin up a 4.5.x cluster tomorrow and see if I can replicate this on
> that version.

Hello, 

I've hit this issue today with latest-4.5. Are we targeting to fix it in 4.5 or only in 4.6?


Note You need to log in before you can comment on or make changes to this bug.