Description of problem: When the hostname of a node has a period in the 65th character, it gets truncated at 65 characters and then the hostname ends in a period. This causes kubelet.service to complain about the DNS-1123 compatibility of the spec.Nodename in some pod manifests. How reproducible: Always, when the circumstances are met. Steps to Reproduce: 1. Generate a cluster name that will cause the bootstrap node hostname to have a period at the 65th character. (e.g. 'js-tu-jstuever-123456' in project openshift-dev-installer). 2. Deploy a GCP IPI cluster using this cluster name. Actual results: INFO Waiting up to 20m0s for the Kubernetes API at https://api.js-tu-jstuever-123456.installer.gcp.devcluster.openshift.com:6443... ERROR Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: the server could not find the requested resource (get clusteroperators.config.openshift.io) INFO Pulling debug logs from the bootstrap machine INFO Bootstrap gather logs captured here "/home/jstuever/gcp_upi/assetsipi/log-bundle-20200603152251.tar.gz" FATAL Bootstrap failed to complete: failed waiting for Kubernetes API: the server could not find the requested resource bootstrap/journals/kubelet.log: Jun 03 22:03:21 js-tu-jstuever-123456-pss9n-bootstrap.c.openshift-dev-installer. hyperkube[2239]: E0603 22:03:21.999500 2239 file.go:108] Unable to process watch event: can't process config file "/etc/kubernetes/manifests/etcd-member-pod.yaml": invalid pod: [metadata.name: Invalid value: "etcd-bootstrap-member-js-tu-jstuever-123456-pss9n-bootstrap.c.openshift-dev-installer.a DNS-1123 subdomain": must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*') spec.nodeName: Invalid value: "js-tu-jstuever-123456-pss9n-bootstrap.c.openshift-dev-installer.": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')] Expected results: Cluster bootstrap should be successful. Additional info: This was previously masked by a more strict truncation of the cluster name, which was relaxed in https://github.com/openshift/installer/pull/3544.
For static pods the kubelet creates a name for the pod using `metadata.name` from static pod manifest + node name. Now since the max length of pod names is 63 characters, it truncates this to that length. The truncation should fix keep to prevent this error.. ``` metadata.name: Invalid value: "etcd-bootstrap-member-js-tu-jstuever-123456-pss9n-bootstrap.c.openshift-dev-installer.a DNS-1123 subdomain": must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*') ```
I will look into it in coming sprint.
Moving this installer team to make sure the valid cluster name is generated that doesn't cause kubelet validation to fail. Maybe it would be worth taking a closer look at effects of, https://github.com/openshift/installer/pull/3544
The installer is generating valid cluster names. The kubelet is adding pod name to the hostname and wrongly truncating when creating mirror pods for static pods. Moving back to Node
I used openshift-install-linux-4.6.0-0.nightly-2020-09-18-071428 with cluster name: tszegcp91820192char and the cluster failed to complete install. Bootstrap finished but install failed with this: NFO Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available. ERROR Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default INFO Cluster operator insights Disabled is False with AsExpected: INFO Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available INFO Cluster operator monitoring Available is False with : FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring Worker's journalctl -u kubelet showed: Sep 18 20:30:05 tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal hyperkube[20508]: E0918 20:30:05.496933 20508 kubelet_node_status.go:92] Unable to register node "tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal" with API server: Node "tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal" is invalid: metadata.labels: Invalid value: "tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal": must be no more than 63 characters Sep 18 20:30:05 tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal hyperkube[20508]: E0918 20:30:05.582516 20508 kubelet.go:2190] node "tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal" not found Sep 18 20:30:05 tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal hyperkube[20508]: E0918 20:30:05.682685 20508 kubelet.go:2190] node "tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal" not found Sep 18 20:30:05 tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal hyperkube[20508]: E0918 20:30:05.782873 20508 kubelet.go:2190] node "tszegcp91820192char-psf6h-worker-c-kk5mg.c.openshift-qe.internal" not found https://bugzilla.redhat.com/show_bug.cgi?id=1844613 indicates that name itself is not a problem. Hence adding the info here.
There is another defect about long name just in case its related to this one: https://bugzilla.redhat.com/show_bug.cgi?id=1872885