A node with a 64-character hostname will not be able to join the cluster. eg: when using GCP, a cluster name 11 characters long and a region 8 characters long, it yields infra nodes with hostnames that are 64 characters long. The infra nodes will be created and launch but will be unable to join the cluster, causing a whole slew of problems.
Specifically, we're seeing this on 4.4.16
The node name on GCP is derived from the hostname. Moving to RHCOS as they had previous fixes in place to fix this edge case.
We included the hostname fixes as part of 4.4 with https://github.com/openshift/machine-config-operator/pull/1938; merged on Aug 6 4.4.16 was built on Aug 5, so it missed that commit. 4.4.17 should have the fix included (built on Aug 12) @achvatal could you retry your configuration with 4.4.17 (or newer) and see if it is improved?
Yeah, it looks like this problem still exists in 4.4.17: [achvatal@xoth ~]$ oc version Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe Server Version: 4.4.17 Kubernetes Version: v1.17.1+20ba474 [achvatal@xoth ~]$ oc get node NAME STATUS ROLES AGE VERSION achvbugtest-hhc4s-master-0.us-east1-b.c.o-4517451f.internal Ready master 6h55m v1.17.1+20ba474 achvbugtest-hhc4s-master-1.us-east1-c.c.o-4517451f.internal Ready master 6h55m v1.17.1+20ba474 achvbugtest-hhc4s-master-2.us-east1-d.c.o-4517451f.internal Ready master 6h54m v1.17.1+20ba474 achvbugtest-hhc4s-worker-b-2nh82 Ready worker 6h39m v1.17.1+20ba474 achvbugtest-hhc4s-worker-b-t2q5x Ready worker 6h39m v1.17.1+20ba474 achvbugtest-hhc4s-worker-c-n6jjf Ready worker 6h39m v1.17.1+20ba474 achvbugtest-hhc4s-worker-d-jhzb7 Ready worker 6h39m v1.17.1+20ba474 [achvatal@xoth ~]$ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE achvbugtest-hhc4s-infra-b-wstwl Provisioned custom-4-16384 us-east1 us-east1-b 6h28m achvbugtest-hhc4s-infra-c-zltqf Provisioned custom-4-16384 us-east1 us-east1-c 6h28m achvbugtest-hhc4s-infra-d-5mnrx Provisioned custom-4-16384 us-east1 us-east1-d 6h28m achvbugtest-hhc4s-master-0 Running custom-4-16384 us-east1 us-east1-b 6h55m achvbugtest-hhc4s-master-1 Running custom-4-16384 us-east1 us-east1-c 6h55m achvbugtest-hhc4s-master-2 Running custom-4-16384 us-east1 us-east1-d 6h55m achvbugtest-hhc4s-worker-b-2nh82 Running custom-4-16384 us-east1 us-east1-b 6h41m achvbugtest-hhc4s-worker-b-t2q5x Running custom-4-16384 us-east1 us-east1-b 6h41m achvbugtest-hhc4s-worker-c-n6jjf Running custom-4-16384 us-east1 us-east1-c 6h41m achvbugtest-hhc4s-worker-d-jhzb7 Running custom-4-16384 us-east1 us-east1-d 6h41m I'll spin up a 4.5.x cluster tomorrow and see if I can replicate this on that version.
@Ben can you do some triage to make sure this is properly handled in 4.4.z?
Micah, Ben, is this issue different than Bug 1853584? Just wondering because if they are the same then should this bug have a Target Release of 4.4.z instead of being deferred to 4.7?
(In reply to Derrick Ornelas from comment #6) > Micah, Ben, is this issue different than Bug 1853584? Just wondering > because if they are the same then should this bug have a Target Release of > 4.4.z instead of being deferred to 4.7? It looks very similar, but I am waiting for additional triage to make that call.
I'm facing it on 4.6.0-0.nightly-2020-09-05-015624. The worker node complains node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found $ hostname auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-mco-default-env.conf Active: active (running) since Mon 2020-09-07 08:32:33 UTC; 20s ago Process: 47159 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS) Process: 47157 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS) Main PID: 47161 (kubelet) Tasks: 14 (limit: 95351) Memory: 43.4M CPU: 1.119s CGroup: /system.slice/kubelet.service └─47161 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=rhcos --minimum-container-ttl-duration=6m0s --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --cloud-provider=gce --cloud-config=/etc/kubernetes/cloud.conf --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0489c92a0845a2330062c4f10e45a5dcdc8b6d7982ea41c5b7e5485960943be9 --v=4 Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.297021 47161 kubelet.go:1894] SyncLoop (housekeeping) Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.297575 47161 controller.go:228] failed to get node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" when trying to set owner ref to the node lease: nodes "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.309840 47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.410105 47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.413214 47161 eviction_manager.go:243] eviction manager: synchronize housekeeping Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.413302 47161 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.415646 47161 kubelet.go:2087] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.415693 47161 kubelet.go:2090] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.510359 47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: E0907 08:32:53.610620 47161 kubelet.go:2170] node "auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal" not found
The 64-character hostname was fixed. At the very least I need to see: - a journal showing the boot - must-gather - versions of OCP and the RHCOS boot image Requesting more information from the reporter
Yang Yang: the error you are hitting is different issue: Sep 07 08:32:53 auto-yanyang-935787-kdb26-worker-a-2w8tw.c.openshift-qe.internal hyperkube[47161]: I0907 08:32:53.415646 47161 kubelet.go:2087] Container runtime status: Runtime Conditions: RuntimeReady=true reason: message:, NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? This would indicate a problem with cluster networking, not the hostname assignement.
Oh, this is fun. Previously `hostnamectl` would refused name the host a name longer than 63-character. Now it will, which bypasses the fix. Thank you for the journal. The problem is that the NetworkManager is setting the hostname before the check for length happens. When the check happens, it sees that the name is already set and no-ops.
*** Bug 1878065 has been marked as a duplicate of this bug. ***
*** Bug 1879614 has been marked as a duplicate of this bug. ***
@Alex Are you able to verify this fix with a 4.6 cluster?
Looks like it's still present in 4.6.0-0.nightly-2020-09-23-022756: [achvatal@xoth ~]$ oc get node NAME STATUS ROLES AGE VERSION achvbugtest-thptp-master-0.us-east1-b.c.o-9cd50532.internal Ready master 69m v1.19.0+8a39924 achvbugtest-thptp-master-1.us-east1-c.c.o-9cd50532.internal Ready master 69m v1.19.0+8a39924 achvbugtest-thptp-master-2.us-east1-d.c.o-9cd50532.internal Ready master 69m v1.19.0+8a39924 achvbugtest-thptp-worker-b-7wbd6 Ready worker 43m v1.19.0+8a39924 achvbugtest-thptp-worker-b-tqvfl Ready worker 43m v1.19.0+8a39924 achvbugtest-thptp-worker-c-mnmzp Ready,SchedulingDisabled worker 43m v1.19.0+8a39924 achvbugtest-thptp-worker-d-tdvcq Ready worker 43m v1.19.0+8a39924 [achvatal@xoth ~]$ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE achvbugtest-thptp-infra-b-sqbv7 Provisioned custom-4-16384 us-east1 us-east1-b 33m achvbugtest-thptp-infra-c-kd5lz Provisioned custom-4-16384 us-east1 us-east1-c 33m achvbugtest-thptp-infra-d-7dtjn Provisioned custom-4-16384 us-east1 us-east1-d 33m achvbugtest-thptp-master-0 Running custom-4-16384 us-east1 us-east1-b 74m achvbugtest-thptp-master-1 Running custom-4-16384 us-east1 us-east1-c 74m achvbugtest-thptp-master-2 Running custom-4-16384 us-east1 us-east1-d 74m achvbugtest-thptp-worker-b-7wbd6 Running custom-4-16384 us-east1 us-east1-b 46m achvbugtest-thptp-worker-b-tqvfl Running custom-4-16384 us-east1 us-east1-b 46m achvbugtest-thptp-worker-c-mnmzp Running custom-4-16384 us-east1 us-east1-c 46m achvbugtest-thptp-worker-d-tdvcq Running custom-4-16384 us-east1 us-east1-d 46m [achvatal@xoth ~]$ oc version Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe Server Version: 4.6.0-0.nightly-2020-09-23-022756 Kubernetes Version: v1.19.0+8a39924
Is this fix supposed to allow a cluster with name 19 char (or longer) to finish installation? I am still not able to get a cluster with 19 char name to finish installation: cluster name: tszelongname-123456 Version: 4.6.0-0.nightly-2020-09-25-150713
I don't know if it is related but cluster with long name will fail installation: https://bugzilla.redhat.com/show_bug.cgi?id=1843722
Created attachment 1717584 [details] journals for an affected infra node
Created attachment 1717585 [details] install log from ocm not sure if the install log will help but i doubt it will hurt
Okay, now NetworkManager is changing the hostname back. The journal shows: - the long hostname fix is applied - NetworkManager detects the change - NetworkManager sets the hostname back to the long hostname Evaluating potential fixes.
Technically fixes for this land in the MCO, though the RHCOS team helps out with the code. I think we need a test for this for the next try. Ideas: - Change the MCO's e2e-gcp run to always exercise this case (patch to openshift/release) - Change *all* GCP runs in OpenShift core CI to exercise this (patch to openshift/release) - Extend the MCO's e2e-gcp create a custom machineset that triggers this
I get confused by this so I'm trying to gather prior code/discussions to understand: A lot of the original code here landed in https://github.com/openshift/machine-config-operator/pull/1914 (I had a question there around whether this is GCP specific: https://github.com/openshift/machine-config-operator/pull/1914#issuecomment-657258556 would like to know that) There's a "source of truth" question here around the hostname - from Ben's analysis it sounds like NM and set-valid-hostname are fighting. If that's indeed the problem, I don't see how we can do anything other than tell NM not to set the hostname, and we manually extract it from DHCP and set it ourselves. From https://developer.gnome.org/NetworkManager/stable/NetworkManager.conf.html we can set `hostname-mode = none`.
Ultimately though, don't we want something like: ``` if ignition.platform.id = "gcp" { writefile("/etc/NetworkManager.conf.d/mco-gcp.conf", "hostname-truncation = "firstdot") } ``` The idea here is that the "hostname-truncation" algorithm must be shared with the network provider, since it depends on their DNS server also using the same algorithm right? It may be somewhat risky to try to scope this to *just* GCP at this point - maybe some bare metal user is somehow relying on this. Perhaps we can try to gather some telemetry around whether long-hostname is triggering elsewhere. It's understandable why they're doing what they are but it violates networking standards and I don't think we should encourage that.
@Walter's is correct. What's happening is that NM calls the dispatcher (which sets the transient hostname) and then NM sees that the hostname changed and sets to the hostname that NM expects. When the hostname is longer than 65 characters, NM fails to set the hostname. However, if the hostname is in cluster-invalid but Linux valid zone of 63-65 characters, then the dispatcher script is bypassed. Filed https://github.com/openshift/machine-config-operator/pull/2132 which tells NM to allow the dispatcher script to set the hostname on GCP. I scoped this only GCP for safety since its the only platform...but I really don't like doing this.
i was able to spin up a 4.6 cluster in int that successfully started up infra nodes with 4.6.0-0.nightly-2020-10-08-081226 example infra node hostname: achvbugtest-hdbdw-infra-b-jtmr5.us-east1-b.c.o-a07db612.internal (exactly 64 characters)
Marking verified based on comment #35 and additional testing done by the RHCOS/MCO QE folks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
(In reply to Alex Chvatal from comment #4) > Yeah, it looks like this problem still exists in 4.4.17: > > [achvatal@xoth ~]$ oc version > Client Version: openshift-clients-4.3.0-201910250623-88-g6a937dfe > Server Version: 4.4.17 > Kubernetes Version: v1.17.1+20ba474 > > [achvatal@xoth ~]$ oc get node > NAME STATUS ROLES > AGE VERSION > achvbugtest-hhc4s-master-0.us-east1-b.c.o-4517451f.internal Ready > master 6h55m v1.17.1+20ba474 > achvbugtest-hhc4s-master-1.us-east1-c.c.o-4517451f.internal Ready > master 6h55m v1.17.1+20ba474 > achvbugtest-hhc4s-master-2.us-east1-d.c.o-4517451f.internal Ready > master 6h54m v1.17.1+20ba474 > achvbugtest-hhc4s-worker-b-2nh82 Ready > worker 6h39m v1.17.1+20ba474 > achvbugtest-hhc4s-worker-b-t2q5x Ready > worker 6h39m v1.17.1+20ba474 > achvbugtest-hhc4s-worker-c-n6jjf Ready > worker 6h39m v1.17.1+20ba474 > achvbugtest-hhc4s-worker-d-jhzb7 Ready > worker 6h39m v1.17.1+20ba474 > > [achvatal@xoth ~]$ oc get machine -n openshift-machine-api > NAME PHASE TYPE REGION > ZONE AGE > achvbugtest-hhc4s-infra-b-wstwl Provisioned custom-4-16384 us-east1 > us-east1-b 6h28m > achvbugtest-hhc4s-infra-c-zltqf Provisioned custom-4-16384 us-east1 > us-east1-c 6h28m > achvbugtest-hhc4s-infra-d-5mnrx Provisioned custom-4-16384 us-east1 > us-east1-d 6h28m > achvbugtest-hhc4s-master-0 Running custom-4-16384 us-east1 > us-east1-b 6h55m > achvbugtest-hhc4s-master-1 Running custom-4-16384 us-east1 > us-east1-c 6h55m > achvbugtest-hhc4s-master-2 Running custom-4-16384 us-east1 > us-east1-d 6h55m > achvbugtest-hhc4s-worker-b-2nh82 Running custom-4-16384 us-east1 > us-east1-b 6h41m > achvbugtest-hhc4s-worker-b-t2q5x Running custom-4-16384 us-east1 > us-east1-b 6h41m > achvbugtest-hhc4s-worker-c-n6jjf Running custom-4-16384 us-east1 > us-east1-c 6h41m > achvbugtest-hhc4s-worker-d-jhzb7 Running custom-4-16384 us-east1 > us-east1-d 6h41m > > > I'll spin up a 4.5.x cluster tomorrow and see if I can replicate this on > that version. Hello, I've hit this issue today with latest-4.5. Are we targeting to fix it in 4.5 or only in 4.6?