When we first tried ovn-kubernetes on Azure, it failed reliably, and when I tried to debug why, I encountered a race condition between ovnkube-master and etcd, followed by ovnkube-master failing to retry. But then we never saw that particular bug again and it turned out that there was a much simpler reason why ovn-kubernetes-on-Azure was failing reliably. But now someone just saw the original race condition again, so we need to fix this too... +++ This bug was initially created as a clone of Bug #1749403 +++ OVN install on Azure is currently failing; the cluster does not fully come up, and the installer collects must-gather logs. clusteroperators.json shows, eg, "authentication" with the status: "message": "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)", "reason": "RouteStatusDegradedFailedCreate", "status": "True", "type": "Degraded" and "kube-controller-manager" with: "message": "StaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-1\" not found", "reason": "StaticPodsDegradedError", "status": "True", "type": "Degraded" (other failures may also be possible) Failure eventually comes down to the kube-controller-manager failure, where an error occurs when trying to assign the pod's address in ovnkube-master: time="2019-09-04T18:40:20Z" level=info msg="Setting annotations ovn={\\\"ip_address\\\":\\\"10.130.0.8/23\\\", \\\"mac_address\\\":\\\"da:23:cb:82:00:09\\\", \\\"gateway_ip\\\": \\\"10.130.0.1\\\"} on pod installer-2-dwinship-vkmwk-master-0" time="2019-09-04T18:40:27Z" level=error msg="Error in setting annotation on pod installer-2-dwinship-vkmwk-master-0/openshift-kube-controller-manager: etcdserver: request timed out" time="2019-09-04T18:40:27Z" level=error msg="Failed to set annotation on pod installer-2-dwinship-vkmwk-master-0 - etcdserver: request timed out" Because there is no retrying for failed pod setup in the master, this results in the pod being constantly retried forever on the node, but always failing. Presumably the error has something to do with bootstrap switchover. I am not sure why this happens in Azure but not AWS. IIRC in AWS the kube apiserver reaches etcd via an AWS loadbalancer, and it's possible that AWS is removing the bad etcd replica from the loadbalancer but the way things are set up in Azure, that doesn't happen, so it becomes more dependent on people retrying on failure, which ovnkube-master isn't doing here. --- Additional comment from Dan Winship on 2019-09-10 10:22:21 EDT --- OK, it looks like the failure mentioned above was just weird cosmic rays or something...
For the record, apiserver -> etcd communication doesn't go through any sort of loadbalancer. But yes, either way, we need to handle these sorts of failures.
Er... DNS name then maybe? It seems relevant that this has happened at least twice that we know of on Azure, and never that we know of on AWS. Though the fix is the same regardless.
dcbw states: The fix is upstream; he'll do a rebase today.
Yes, this is now fixed
OVN deployed successfully on Azure with 4.3.0-0.nightly-2019-12-04-124503 and 4.3.0-0.nightly-2019-12-04-214544
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062