Bug 1857158

Summary: OpenStack IPI bootstrap fails when cluster name contains period
Product: OpenShift Container Platform Reporter: Michal Jurc <mjurc>
Component: InstallerAssignee: Mike Fedosin <mfedosin>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: adahiya, bleanhar, egarcia, m.andre, mfedosin, michal_mazurek, pprinett
Version: 4.5   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: OpenShift can't parse cluster names with the '.' symbol to generate keepalived configuration. All data after the first '.' is truncated. Consequence: When users tried to provision clusters on OpenStack with '.' in their names, installation failed at bootstrap with a vague error message. Fix: Forbid the '.' in cluster names for OpenStack platform. Result: If the cluster name contains '.' users will see a correct error message on the initial stages of installation.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:14:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Michal Jurc 2020-07-15 09:47:52 UTC
Description of problem:
When attempting to install OCP 4.5 on OpenStack with IPI with cluster name containing period (e.g. ocp-4.5), the boostraping stage of the installer fails with the following message from the installer:

E0714 09:20:34.017695  106955 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.ocp4.5.dynamic.quarkus:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=1&timeoutSeconds=479&watch=true: dial tcp connect: connection refused

A suspect message reoccurs in control plane kubelet.logs:

Jul 13 08:43:25 ocp-4-5-6545h-master-0 hyperkube[1581]: E0703 08:43:25.355407    1581 kubelet.go:2285] node "ocp-4-5-6545h-master-0" not found

How reproducible:

Steps to Reproduce:
1. ./openshift-install create cluster with cluster name containing period

Actual results:
level=debug msg="OpenShift Installer 4.5.0"
level=debug msg="Built from commit b714ff8ee8845b86da2dafce8fa5630ef1806f3b"
level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.ocp4.5.dynamic.quarkus:6443..."
level=debug msg="Still waiting for the Kubernetes API: Get https://api.ocp4.5.dynamic.quarkus:6443/version?timeout=32s: dial tcp connect: no route to host"
level=debug msg="Still waiting for the Kubernetes API: Get https://api.ocp4.5.dynamic.quarkus:6443/version?timeout=32s: dial tcp connect: no route to host"
level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
level=debug msg="Still waiting for the Kubernetes API: Get https://api.ocp4.5.dynamic.quarkus:6443/version?timeout=32s: dial tcp connect: connection refused"
level=info msg="API v1.18.3+3415b61 up"
level=info msg="Waiting up to 40m0s for bootstrapping to complete..."
E0714 08:47:50.061389  106955 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ConfigMap: Get https://api.ocp4.5.dynamic.quarkus:6443/api/v1/namespaces/kube-system/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dbootstrap&resourceVersion=1&timeoutSeconds=576&watch=true: dial tcp connect: connection refused

Expected results:
Successful installation of the cluster

Additional info:
redacted install-config.yaml:
apiVersion: v1
baseDomain: dynamic.quarkus
- name: worker
      type: ci.m1.large
  replicas: 3
  name: master
  replicas: 3
  name: ocp4.5
  - cidr:
    hostPrefix: 23
  - cidr:
  networkType: OpenShiftSDN
    cloud: openstack
    computeFlavor: ci.m1.xlarge
    externalNetwork: <redacted>
    octaviaSupport: false
    region: regionOne
    trunkSupport: true
pullSecret: <redacted>
sshKey: <redacted>

Comment 1 Scott Dodson 2020-07-15 13:08:49 UTC
This is not urgent and I believe the outcome will simply be to validate that cluster name doesn't contain a dot, so for now simply choose a cluster name without a dot as a workaround.

Comment 2 Abhinav Dahiya 2020-07-16 17:39:55 UTC
Moving to openstack to fix the validation for openstack

Comment 4 Martin André 2020-07-21 14:43:54 UTC
Is this really just an OpenStack problem, though? It feels like the validation should be common to all platforms.

Comment 5 Martin André 2020-07-21 15:30:28 UTC
I suppose that's because the resulting hostname for the nodes wouldn't be valid. In which case, we should match against the [a-z0-9-]+ regex.

Possibly related to https://github.com/openshift/installer/pull/3900

Comment 6 Mike Fedosin 2020-07-21 16:20:14 UTC
@Martin, yep, this is specific to OpenStack. By default in the installer it is allowed to have dots in the cluster name https://github.com/openshift/installer/blob/832a6b5d31641ee99501e4fb5b6bc9acf8188741/pkg/validate/validate_test.go#L29

Comment 7 Pierre Prinetti 2020-07-23 14:20:31 UTC
Lowering the severity as there is an easy workaround. Postponing to an upcoming sprint.

Comment 8 Michal Jurc 2020-07-24 07:28:19 UTC
Could this be at least documented ASAP? The installer fails with very unhelpful message and none of the node logs really hint at what's going on, which makes the detection of the underlying cause very problematic for the end user.

Comment 9 Martin André 2020-07-24 07:57:18 UTC
(In reply to Mike Fedosin from comment #6)
> @Martin, yep, this is specific to OpenStack. By default in the installer it
> is allowed to have dots in the cluster name
> https://github.com/openshift/installer/blob/
> 832a6b5d31641ee99501e4fb5b6bc9acf8188741/pkg/validate/validate_test.go#L29

Assuming that the unit test actually reflect what the installer supports.
Would be good to understand a bit better what is going on for OpenStack.

Let's merge the OpenStack check and leave the issue open while we investigate the root cause.

Comment 12 Martin André 2020-07-28 10:45:05 UTC
The problem seems to be related to the keepalived configuration file when the cluster name contains a dot.

On the bootstrap node, the VRRP instance name correctly contains the cluster name, while it's truncated to the dot for master nodes, resulting in separate VRRP groups.

On bootstrap node:
VRRP_Instance(m.andre_API) Transition to MASTER STATE

On master node:
VRRP_Instance(m_API) Transition to MASTER STATE

Comment 13 Martin André 2020-07-28 11:56:41 UTC
It turns out we pass a `--cluster-config` argument to runtimecfg render keepalived config file for for the bootstrap node and not for the master nodes.
And it first tries to get the cluster name from cluster-config.yml, otherwise it calls the GetKubeconfigClusterNameAndDomain() function that splits on the dot.


If there is no reliable way of getting the cluster name, we should forbid the dot in the cluster name for all on-prem platforms, like we did for OpenStack in https://github.com/openshift/installer/pull/3934

Comment 14 Martin André 2020-07-30 14:52:57 UTC
*** Bug 1859290 has been marked as a duplicate of this bug. ***

Comment 15 David Sanz 2020-08-13 12:21:34 UTC
Verified on 4.6.0-0.nightly-2020-08-13-091737

08-13 14:20:12  level=fatal msg="failed to fetch Metadata: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"mrnd-134.6\": cluster name can't contain \".\" character"

Comment 17 errata-xmlrpc 2020-10-27 16:14:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.