Version: OCP: 4.10.0-0.nightly-2021-11-29-142540 ACM: 2.4.1-DOWNSTREAM-2021-11-22-20-58-05 Steps to reproduce: Try to deploy spoke compact cluster with 3 controllers (no workers). Only have the api and the ingress entry in DNS. Result: The spoke compact cluster deployment gets stuck during the bootstrap phase. The 2 masters don't complete starting all the containers. Looking at the started 2 masters the lat container (that is constantly restarting) is verify-api-int-resolvable. After about one hour, I added an entry for api-int to the DNS server used by the setup. Right after the verify-api-int-resolvable restarted - many more containers started. I re-attempted the spoke deployment on the same setup with the same config and it went smooth with the api-int entry in dns. oc get agentclusterinstalls.extensions.hive.openshift.io -o json|jq .items[].spec { "apiVIP": "192.168.123.106", "clusterDeploymentRef": { "name": "elvis2" }, "clusterMetadata": { "adminKubeconfigSecretRef": { "name": "elvis2-admin-kubeconfig" }, "adminPasswordSecretRef": { "name": "elvis2-admin-password" }, "clusterID": "dce6b348-e15e-4a1b-8c43-ca326e41efad", "infraID": "f76b1183-e76d-42b3-95f2-095cb7ebbbc7" }, "imageSetRef": { "name": "4.10" }, "ingressVIP": "192.168.123.105", "networking": { "clusterNetwork": [ { "cidr": "10.128.0.0/14", "hostPrefix": 23 } ], "machineNetwork": [ { "cidr": "192.168.123.0/24" } ], "serviceNetwork": [ "172.30.0.0/16" ] }, "provisionRequirements": { "controlPlaneAgents": 3 }, "sshPublicKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzwAz3fnZcrca7mY/kVFpQGS2yI1uGd/+t3PMJn/C7Ppj1uIG32ufHkTq+SXh8Zg3xcy9v/Uome1mo3FP7PoGsWms5B9wzbooGhbA3rdph0/NxSzrHO3qcudcJsBM4GVJhcbFfbkzJVCPZQ94O/Y17oKjKuaBz69clPD29BlzKF4xCWzzbJW5Q8Y9tvWvDpCdVBM7VorpAn3MaA95xL6e15douWwwlhdI4dIOk/+8HcfgJnZGyOeLTnLVpjxQaFzTj3ScEud/5yd5wHcICrHH8Fbq419nN7VWjxbMNWUn182mcCCs0RXx2eyYq27yJvgkJS86n09SyLynX6ySqkFXN" } oc get cd -o json|jq .items[].spec { "baseDomain": "qe.lab.redhat.com", "clusterInstallRef": { "group": "extensions.hive.openshift.io", "kind": "AgentClusterInstall", "name": "elvis2", "version": "v1beta1" }, "clusterMetadata": { "adminKubeconfigSecretRef": { "name": "elvis2-admin-kubeconfig" }, "adminPasswordSecretRef": { "name": "elvis2-admin-password" }, "clusterID": "dce6b348-e15e-4a1b-8c43-ca326e41efad", "infraID": "f76b1183-e76d-42b3-95f2-095cb7ebbbc7" }, "clusterName": "elvis2", "controlPlaneConfig": { "servingCertificates": {} }, "installed": true, "platform": { "agentBareMetal": { "agentSelector": { "matchLabels": { "bla": "aaa" } } } }, "pullSecretRef": { "name": "pull-secret" } } oc get infraenv -o json|jq .items[].spec { "clusterRef": { "name": "elvis2", "namespace": "elvis2" }, "nmStateConfigLabelSelector": { "matchLabels": { "nmstate_config_cluster_name": "ha-static" } }, "pullSecretRef": { "name": "pull-secret" }, "sshAuthorizedKey": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCzwAz3fnZcrca7mY/kVFpQGS2yI1uGd/+t3PMJn/C7Ppj1uIG32ufHkTq+SXh8Zg3xcy9v/Uome1mo3FP7PoGsWms5B9wzbooGhbA3rdph0/NxSzrHO3qcudcJsBM4GVJhcbFfbkzJVCPZQ94O/Y17oKjKuaBz69clPD29BlzKF4xCWzzbJW5Q8Y9tvWvDpCdVBM7VorpAn3MaA95xL6e15douWwwlhdI4dIOk/+8HcfgJnZGyOeLTnLVpjxQaFzTj3ScEud/5yd5wHcICrHH8Fbq419nN7VWjxbMNWUn182mcCCs0RXx2eyYq27yJvgkJS86n09SyLynX6ySqkFXN" } oc get nmstateconfig -o json|jq .items[].spec { "config": { "dns-resolver": { "config": { "server": [ "192.168.123.1" ] } }, "interfaces": [ { "ipv4": { "address": [ { "ip": "192.168.123.142", "prefix-length": 24 } ], "dhcp": false, "enabled": true }, "ipv6": { "enabled": false }, "name": "eth0", "state": "up", "type": "ethernet" } ], "routes": { "config": [ { "destination": "0.0.0.0/0", "next-hop-address": "192.168.123.1", "next-hop-interface": "eth0", "table-id": 254 } ] } }, "interfaces": [ { "macAddress": "52:54:00:f7:d4:d1", "name": "eth0" } ] } { "config": { "dns-resolver": { "config": { "server": [ "192.168.123.1" ] } }, "interfaces": [ { "ipv4": { "address": [ { "ip": "192.168.123.143", "prefix-length": 24 } ], "dhcp": false, "enabled": true }, "ipv6": { "enabled": false }, "name": "eth0", "state": "up", "type": "ethernet" } ], "routes": { "config": [ { "destination": "0.0.0.0/0", "next-hop-address": "192.168.123.1", "next-hop-interface": "eth0", "table-id": 254 } ] } }, "interfaces": [ { "macAddress": "52:54:00:f7:d4:d2", "name": "eth0" } ] } { "config": { "dns-resolver": { "config": { "server": [ "192.168.123.1" ] } }, "interfaces": [ { "ipv4": { "address": [ { "ip": "192.168.123.144", "prefix-length": 24 } ], "dhcp": false, "enabled": true }, "ipv6": { "enabled": false }, "name": "eth0", "state": "up", "type": "ethernet" } ], "routes": { "config": [ { "destination": "0.0.0.0/0", "next-hop-address": "192.168.123.1", "next-hop-interface": "eth0", "table-id": 254 } ] } }, "interfaces": [ { "macAddress": "52:54:00:f7:d4:d3", "name": "eth0" } ] }
[core@master-1-2 ~]$ sudo crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID 42aacf4f29b30 20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6 About a minute ago Exited verify-api-int-resolvable 5 3014109f5f9a5 fc88e276d90e0 20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6 4 minutes ago Running keepalived-monitor 0 98b4211786f3a 33c140ab88267 20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6 4 minutes ago Running coredns-monitor 0 af32fe20d7308 687f6f0cdb075 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3e96c1755163ecb2827bf4b4d1dfdabf2a125e6aeef620a0b8ba52d0c450432c 4 minutes ago Running keepalived 0 98b4211786f3a 8285d01ab5def quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f0c1b89092c1966baa30586089f8698f2768b346717194f925cd80dfd84ed040 4 minutes ago Running coredns 0 af32fe20d7308 19dfcb9eba661 20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6 4 minutes ago Exited render-config-keepalived 0 98b4211786f3a 3a669c85ef122 20f7156b7fc4037a90b04952dc8e23e9b88d085e88eeeededf2575c7f53390a6 4 minutes ago Exited render-config-coredns 0 af32fe20d7308 [core@master-1-2 ~]$ sudo crictl logs 42aacf4f29b30 Error in configuration: * unable to read client-cert /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory * unable to read client-key /var/lib/kubelet/pki/kubelet-client-current.pem for default-auth due to open /var/lib/kubelet/pki/kubelet-client-current.pem: no such file or directory
The problem doesn't happen in 4.9. When it happens the resolv.conf file does not contain a local IP nameserver. Since the local coredns returns the address for api-int DNS, if the nameserver does not exist in resolv.conf, the api-int DNS is not resolvable. In addition, if installed without nmstate config installation completes successfully. Seems like nmstate issue.
@phoracek is there anyone that can take a look at this issue?
The nmstate config here is going through the assisted installer, right? If so, you may want to direct this question to the assisted installer team on #forum-kni-assisted-deployment. There was a similar thread talking about nmstate, dispatcher script and DNS, updated yesterday https://coreos.slack.com/archives/CUPJTHQ5P/p1638483562237600.
I am from the assisted-installer team, we are using nmstate as is, we are validating the yaml format and just put it on the host. This is why i'm asking somone from nmsate to take a look
Adding @bnemec@bnemec for the dispatcher and @fge for nmstate. My team is only working on kubernetes-nmstate.
This sounds an awful lot like https://bugzilla.redhat.com/show_bug.cgi?id=2029438. I'm not sure why it's suddenly become a problem now, but we've had multiple reports of this problem with api-int resolution on the bootstrap that appear to be caused by the same thing.
*** This bug has been marked as a duplicate of bug 2029438 ***