Description of problem: ocp 4.3.15 on Bare Metal UPI. Environment deployes with no issues with the 4.3.9 installer. With 4.3.15 install is reporting Error while reconciling 4.3.15: the cluster operator openshift-apiserver has not yet successfully rolled out We also only have two of the 3 masters as active nodes with etcd running. Version-Release number of the following components: openshift-install-linux-4.3.15.tar.gz openshift-client-linux-4.3.15.tar.gz How reproducible: Consistent Steps to Reproduce: mkdir baremetal cp install-config-redpill.yaml baremetal/install-config.yaml openshift-install create manifests --dir=baremetal # We need the masters to be schedulable so we don't run this step #sed -i "s/mastersSchedulable: true/mastersSchedulable: false/" baremetal/manifests/cluster-scheduler-02-config.yml # Then generate the ign files openshift-install create ignition-configs --dir=baremetal openshift-install --dir=baremetal wait-for bootstrap-complete \ --log-level=info Boostrap completes oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.15 True False 31m Error while reconciling 4.3.15: the cluster operator monitoring is degraded Actual results: oc get nodes NAME STATUS ROLES AGE VERSION etcd-2.test.bionode.io Ready master,worker 142m v1.16.2 localhost Ready master,worker 142m v1.16.2 All the nodes should be etcd-[0-2].test.bionode.io No SRV records were requested during bootstrap Expected results: NAME STATUS ROLES AGE VERSION etcd-0.test.bionode.io Ready master,worker 7m19s v1.16.2 etcd-1.test.bionode.io Ready master,worker 7m41s v1.16.2 etcd-2.test.bionode.io Ready master,worker 7m20s v1.16.2 Additional info: Please attach logs from ansible-playbook with the -vvv flag
> Actual results: oc get nodes NAME STATUS ROLES AGE VERSION etcd-2.test.bionode.io Ready master,worker 142m v1.16.2 localhost Ready master,worker 142m v1.16.2 This makes it seem like the nodes are failing to get the correct hostnames/nodenames. So moving to rhcos tema to triage.
It looks like the problem occurs when either 1 - the dns server gets overloaded with queries, or 2 - reverse dns issues or 3 - IPV6 address resolution issues. I've moved my config to just using DNSMasq and tuned the config. Currently I can reliably deploy 4.3.19
Based on comment #4, it looks like this was related to DNS resolution problems. I don't think there is much that can be done on the RHCOS side for DNS resolution issues; closing as NOTABUG. If you think there is more that should be done here, please reopen.