Bug 1764728

Summary: ovn-kubernetes on Azure fails due to ovnkube-master not retrying setting pod annotation
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: cdc, nagrawal, rbrattai
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1749403 Environment:
Last Closed: 2020-01-23 11:09:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Winship 2019-10-23 16:16:53 UTC
When we first tried ovn-kubernetes on Azure, it failed reliably, and when I tried to debug why, I encountered a race condition between ovnkube-master and etcd, followed by ovnkube-master failing to retry. But then we never saw that particular bug again and it turned out that there was a much simpler reason why ovn-kubernetes-on-Azure was failing reliably. But now someone just saw the original race condition again, so we need to fix this too...

+++ This bug was initially created as a clone of Bug #1749403 +++

OVN install on Azure is currently failing; the cluster does not fully come up, and the installer collects must-gather logs. clusteroperators.json shows, eg, "authentication" with the status:

        "message": "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)",
        "reason": "RouteStatusDegradedFailedCreate",
        "status": "True",
        "type": "Degraded"

and "kube-controller-manager" with:

        "message": "StaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-1\" not found",
        "reason": "StaticPodsDegradedError",
        "status": "True",
        "type": "Degraded"

(other failures may also be possible)

Failure eventually comes down to the kube-controller-manager failure, where an error occurs when trying to assign the pod's address in ovnkube-master:

    time="2019-09-04T18:40:20Z" level=info msg="Setting annotations ovn={\\\"ip_address\\\":\\\"10.130.0.8/23\\\", \\\"mac_address\\\":\\\"da:23:cb:82:00:09\\\", \\\"gateway_ip\\\": \\\"10.130.0.1\\\"} on pod installer-2-dwinship-vkmwk-master-0"
    time="2019-09-04T18:40:27Z" level=error msg="Error in setting annotation on pod installer-2-dwinship-vkmwk-master-0/openshift-kube-controller-manager: etcdserver: request timed out"
    time="2019-09-04T18:40:27Z" level=error msg="Failed to set annotation on pod installer-2-dwinship-vkmwk-master-0 - etcdserver: request timed out"

Because there is no retrying for failed pod setup in the master, this results in the pod being constantly retried forever on the node, but always failing.

Presumably the error has something to do with bootstrap switchover. I am not sure why this happens in Azure but not AWS. IIRC in AWS the kube apiserver reaches etcd via an AWS loadbalancer, and it's possible that AWS is removing the bad etcd replica from the loadbalancer but the way things are set up in Azure, that doesn't happen, so it becomes more dependent on people retrying on failure, which ovnkube-master isn't doing here.

--- Additional comment from Dan Winship on 2019-09-10 10:22:21 EDT ---

OK, it looks like the failure mentioned above was just weird cosmic rays or something...

Comment 1 Casey Callendrello 2019-10-24 08:10:35 UTC
For the record, apiserver -> etcd communication doesn't go through any sort of loadbalancer.

But yes, either way, we need to handle these sorts of failures.

Comment 2 Dan Winship 2019-10-24 12:14:05 UTC
Er... DNS name then maybe? It seems relevant that this has happened at least twice that we know of on Azure, and never that we know of on AWS. Though the fix is the same regardless.

Comment 3 Casey Callendrello 2019-11-15 15:40:38 UTC
dcbw states: The fix is upstream; he'll do a rebase today.

Comment 5 Dan Winship 2019-12-04 01:25:53 UTC
Yes, this is now fixed

Comment 7 Ross Brattain 2019-12-05 02:42:16 UTC
OVN deployed successfully on Azure with 4.3.0-0.nightly-2019-12-04-124503 and 4.3.0-0.nightly-2019-12-04-214544

Comment 9 errata-xmlrpc 2020-01-23 11:09:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062