OVN install on Azure is currently failing; the cluster does not fully come up, and the installer collects must-gather logs. clusteroperators.json shows, eg, "authentication" with the status: "message": "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)", "reason": "RouteStatusDegradedFailedCreate", "status": "True", "type": "Degraded" and "kube-controller-manager" with: "message": "StaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-1\" not found", "reason": "StaticPodsDegradedError", "status": "True", "type": "Degraded" (other failures may also be possible) Failure eventually comes down to the kube-controller-manager failure, where an error occurs when trying to assign the pod's address in ovnkube-master: time="2019-09-04T18:40:20Z" level=info msg="Setting annotations ovn={\\\"ip_address\\\":\\\"10.130.0.8/23\\\", \\\"mac_address\\\":\\\"da:23:cb:82:00:09\\\", \\\"gateway_ip\\\": \\\"10.130.0.1\\\"} on pod installer-2-dwinship-vkmwk-master-0" time="2019-09-04T18:40:27Z" level=error msg="Error in setting annotation on pod installer-2-dwinship-vkmwk-master-0/openshift-kube-controller-manager: etcdserver: request timed out" time="2019-09-04T18:40:27Z" level=error msg="Failed to set annotation on pod installer-2-dwinship-vkmwk-master-0 - etcdserver: request timed out" Because there is no retrying for failed pod setup in the master, this results in the pod being constantly retried forever on the node, but always failing. Presumably the error has something to do with bootstrap switchover. I am not sure why this happens in Azure but not AWS. IIRC in AWS the kube apiserver reaches etcd via an AWS loadbalancer, and it's possible that AWS is removing the bad etcd replica from the loadbalancer but the way things are set up in Azure, that doesn't happen, so it becomes more dependent on people retrying on failure, which ovnkube-master isn't doing here.
Thanks Dan for opening bug. For me it seems like co authentication and console is preventing it to come up. Attaching install logs. Don't have must-gather logs currently as seems like cluster failed in early stages.
Created attachment 1612072 [details] install logs
OK, it looks like the failure mentioned above was just weird cosmic rays or something... the *normal* failure is that anything that depends on a Route doesn't come up because of a difference between Azure and AWS and between ovn-kubernetes and kube-proxy. The issue is that when the AWS CloudProvider creates a load balancer, it creates one that rewrites the incoming traffic so that when you connect to ingress-ip:service-port, the node receives a connection to node-ip:service-node-port (just as though you had tried to connect to the NodePort service directly). But when the Azure CloudProvider creates a load balancer, it creates one that forwards the packet unchanged from the loadbalancer to the node, so that the node receives a connection to ingress-ip:service-port. If you're using kube-proxy, this will work anyway, because the iptables proxier adds a shortcut rule mapping ingress-ip:service-port to the service, so that local pod-to-loadbalancer connections don't actually go all the way to the load balancer and come back. The rule was not intended for what the Azure CloudProvider is doing, but it happens to work anyway. ovn-kubernetes's proxy code doesn't (currently) add this shortcut, so loadbalancer connections end up failing.
Looks good on 4.2.0-0.nightly-2019-09-11-074500. Thanks for the fix! $ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-76c57ddbd-tpv2s 4/4 Running 0 51m ovnkube-node-7w7fl 3/3 Running 0 51m ovnkube-node-bwgxr 3/3 Running 0 51m ovnkube-node-fs58c 3/3 Running 0 44m ovnkube-node-mghmn 3/3 Running 0 43m ovnkube-node-tlftc 3/3 Running 0 51m $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-09-11-074500 True False False 35m cloud-credential 4.2.0-0.nightly-2019-09-11-074500 True False False 50m cluster-autoscaler 4.2.0-0.nightly-2019-09-11-074500 True False False 44m console 4.2.0-0.nightly-2019-09-11-074500 True False False 38m dns 4.2.0-0.nightly-2019-09-11-074500 True False False 50m image-registry 4.2.0-0.nightly-2019-09-11-074500 True False False 41m ingress 4.2.0-0.nightly-2019-09-11-074500 True False False 41m insights 4.2.0-0.nightly-2019-09-11-074500 True False False 50m kube-apiserver 4.2.0-0.nightly-2019-09-11-074500 True False False 47m kube-controller-manager 4.2.0-0.nightly-2019-09-11-074500 True False False 47m kube-scheduler 4.2.0-0.nightly-2019-09-11-074500 True False False 48m machine-api 4.2.0-0.nightly-2019-09-11-074500 True False False 50m machine-config 4.2.0-0.nightly-2019-09-11-074500 True False False 49m marketplace 4.2.0-0.nightly-2019-09-11-074500 True False False 45m monitoring 4.2.0-0.nightly-2019-09-11-074500 True False False 36m network 4.2.0-0.nightly-2019-09-11-074500 True False False 49m node-tuning 4.2.0-0.nightly-2019-09-11-074500 True False False 47m openshift-apiserver 4.2.0-0.nightly-2019-09-11-074500 True False False 46m openshift-controller-manager 4.2.0-0.nightly-2019-09-11-074500 True False False 48m openshift-samples 4.2.0-0.nightly-2019-09-11-074500 True False False 43m operator-lifecycle-manager 4.2.0-0.nightly-2019-09-11-074500 True False False 49m operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-09-11-074500 True False False 49m operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-09-11-074500 True False False 48m service-ca 4.2.0-0.nightly-2019-09-11-074500 True False False 50m service-catalog-apiserver 4.2.0-0.nightly-2019-09-11-074500 True False False 47m service-catalog-controller-manager 4.2.0-0.nightly-2019-09-11-074500 True False False 47m storage 4.2.0-0.nightly-2019-09-11-074500 True False False 46m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922
(In reply to Dan Winship from comment #5) > OK, it looks like the failure mentioned above was just weird cosmic rays or > something... for reference, the original "master fails to set annotation and then never retries" bug is now bug 1764728