Description of problem: When attempting to install OCP 4.5 on OpenStack with IPI with cluster name containing period (e.g. ocp-4.5), the boostraping stage of the installer fails with the following message from the installer: E0714 09:20:34.017695 106955 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.ocp4.5.dynamic.quarkus:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=1&timeoutSeconds=479&watch=true: dial tcp 10.0.103.126:6443: connect: connection refused A suspect message reoccurs in control plane kubelet.logs: Jul 13 08:43:25 ocp-4-5-6545h-master-0 hyperkube[1581]: E0703 08:43:25.355407 1581 kubelet.go:2285] node "ocp-4-5-6545h-master-0" not found How reproducible: Always Steps to Reproduce: 1. ./openshift-install create cluster with cluster name containing period Actual results: level=debug msg="OpenShift Installer 4.5.0" level=debug msg="Built from commit b714ff8ee8845b86da2dafce8fa5630ef1806f3b" level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.ocp4.5.dynamic.quarkus:6443..." level=debug msg="Still waiting for the Kubernetes API: Get https://api.ocp4.5.dynamic.quarkus:6443/version?timeout=32s: dial tcp 10.0.103.126:6443: connect: no route to host" level=debug msg="Still waiting for the Kubernetes API: Get https://api.ocp4.5.dynamic.quarkus:6443/version?timeout=32s: dial tcp 10.0.103.126:6443: connect: no route to host" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" level=debug msg="Still waiting for the Kubernetes API: Get https://api.ocp4.5.dynamic.quarkus:6443/version?timeout=32s: dial tcp 10.0.103.126:6443: connect: connection refused" level=info msg="API v1.18.3+3415b61 up" level=info msg="Waiting up to 40m0s for bootstrapping to complete..." E0714 08:47:50.061389 106955 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.ocp4.5.dynamic.quarkus:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=1&timeoutSeconds=576&watch=true: dial tcp 10.0.103.126:6443: connect: connection refused Expected results: Successful installation of the cluster Additional info: redacted install-config.yaml: apiVersion: v1 baseDomain: dynamic.quarkus compute: - name: worker platform: openstack: type: ci.m1.large replicas: 3 controlPlane: name: master replicas: 3 metadata: name: ocp4.5 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineCIDR: 172.110.0.0/16 machineNetwork: - cidr: 172.110.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: openstack: cloud: openstack computeFlavor: ci.m1.xlarge externalNetwork: <redacted> lbFloatingIP: 10.0.103.126 octaviaSupport: false region: regionOne trunkSupport: true pullSecret: <redacted> sshKey: <redacted>
This is not urgent and I believe the outcome will simply be to validate that cluster name doesn't contain a dot, so for now simply choose a cluster name without a dot as a workaround.
Moving to openstack to fix the validation for openstack
Is this really just an OpenStack problem, though? It feels like the validation should be common to all platforms.
I suppose that's because the resulting hostname for the nodes wouldn't be valid. In which case, we should match against the [a-z0-9-]+ regex. Possibly related to https://github.com/openshift/installer/pull/3900
@Martin, yep, this is specific to OpenStack. By default in the installer it is allowed to have dots in the cluster name https://github.com/openshift/installer/blob/832a6b5d31641ee99501e4fb5b6bc9acf8188741/pkg/validate/validate_test.go#L29
Lowering the severity as there is an easy workaround. Postponing to an upcoming sprint.
Could this be at least documented ASAP? The installer fails with very unhelpful message and none of the node logs really hint at what's going on, which makes the detection of the underlying cause very problematic for the end user.
(In reply to Mike Fedosin from comment #6) > @Martin, yep, this is specific to OpenStack. By default in the installer it > is allowed to have dots in the cluster name > https://github.com/openshift/installer/blob/ > 832a6b5d31641ee99501e4fb5b6bc9acf8188741/pkg/validate/validate_test.go#L29 Assuming that the unit test actually reflect what the installer supports. Would be good to understand a bit better what is going on for OpenStack. Let's merge the OpenStack check and leave the issue open while we investigate the root cause.
The problem seems to be related to the keepalived configuration file when the cluster name contains a dot. On the bootstrap node, the VRRP instance name correctly contains the cluster name, while it's truncated to the dot for master nodes, resulting in separate VRRP groups. On bootstrap node: VRRP_Instance(m.andre_API) Transition to MASTER STATE On master node: VRRP_Instance(m_API) Transition to MASTER STATE
It turns out we pass a `--cluster-config` argument to runtimecfg render keepalived config file for for the bootstrap node and not for the master nodes. And it first tries to get the cluster name from cluster-config.yml, otherwise it calls the GetKubeconfigClusterNameAndDomain() function that splits on the dot. https://github.com/openshift/baremetal-runtimecfg/blob/d5fd996/pkg/config/node.go#L247-L255 If there is no reliable way of getting the cluster name, we should forbid the dot in the cluster name for all on-prem platforms, like we did for OpenStack in https://github.com/openshift/installer/pull/3934
*** Bug 1859290 has been marked as a duplicate of this bug. ***
Verified on 4.6.0-0.nightly-2020-08-13-091737 08-13 14:20:12 level=fatal msg="failed to fetch Metadata: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"mrnd-134.6\": cluster name can't contain \".\" character"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196