1764728 – ovn-kubernetes on Azure fails due to ovnkube-master not retrying setting pod annotation

Bug 1764728 - ovn-kubernetes on Azure fails due to ovnkube-master not retrying setting pod annotation

Summary: ovn-kubernetes on Azure fails due to ovnkube-master not retrying setting pod ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Dan Williams
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-23 16:16 UTC by Dan Winship
Modified:	2020-01-23 11:09 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1749403
Environment:
Last Closed:	2020-01-23 11:09:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 63	0	'None'	closed	Upstream merge 2019-11-15 + latest hybrid overlay	2020-07-03 03:39:29 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:09:34 UTC

Description Dan Winship 2019-10-23 16:16:53 UTC

When we first tried ovn-kubernetes on Azure, it failed reliably, and when I tried to debug why, I encountered a race condition between ovnkube-master and etcd, followed by ovnkube-master failing to retry. But then we never saw that particular bug again and it turned out that there was a much simpler reason why ovn-kubernetes-on-Azure was failing reliably. But now someone just saw the original race condition again, so we need to fix this too...

+++ This bug was initially created as a clone of Bug #1749403 +++

OVN install on Azure is currently failing; the cluster does not fully come up, and the installer collects must-gather logs. clusteroperators.json shows, eg, "authentication" with the status:

        "message": "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)",
        "reason": "RouteStatusDegradedFailedCreate",
        "status": "True",
        "type": "Degraded"

and "kube-controller-manager" with:

        "message": "StaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-1\" not found",
        "reason": "StaticPodsDegradedError",
        "status": "True",
        "type": "Degraded"

(other failures may also be possible)

Failure eventually comes down to the kube-controller-manager failure, where an error occurs when trying to assign the pod's address in ovnkube-master:

    time="2019-09-04T18:40:20Z" level=info msg="Setting annotations ovn={\\\"ip_address\\\":\\\"10.130.0.8/23\\\", \\\"mac_address\\\":\\\"da:23:cb:82:00:09\\\", \\\"gateway_ip\\\": \\\"10.130.0.1\\\"} on pod installer-2-dwinship-vkmwk-master-0"
    time="2019-09-04T18:40:27Z" level=error msg="Error in setting annotation on pod installer-2-dwinship-vkmwk-master-0/openshift-kube-controller-manager: etcdserver: request timed out"
    time="2019-09-04T18:40:27Z" level=error msg="Failed to set annotation on pod installer-2-dwinship-vkmwk-master-0 - etcdserver: request timed out"

Because there is no retrying for failed pod setup in the master, this results in the pod being constantly retried forever on the node, but always failing.

Presumably the error has something to do with bootstrap switchover. I am not sure why this happens in Azure but not AWS. IIRC in AWS the kube apiserver reaches etcd via an AWS loadbalancer, and it's possible that AWS is removing the bad etcd replica from the loadbalancer but the way things are set up in Azure, that doesn't happen, so it becomes more dependent on people retrying on failure, which ovnkube-master isn't doing here.

--- Additional comment from Dan Winship on 2019-09-10 10:22:21 EDT ---

OK, it looks like the failure mentioned above was just weird cosmic rays or something...

Comment 1 Casey Callendrello 2019-10-24 08:10:35 UTC

For the record, apiserver -> etcd communication doesn't go through any sort of loadbalancer.

But yes, either way, we need to handle these sorts of failures.

Comment 2 Dan Winship 2019-10-24 12:14:05 UTC

Er... DNS name then maybe? It seems relevant that this has happened at least twice that we know of on Azure, and never that we know of on AWS. Though the fix is the same regardless.

Comment 3 Casey Callendrello 2019-11-15 15:40:38 UTC

dcbw states: The fix is upstream; he'll do a rebase today.

Comment 5 Dan Winship 2019-12-04 01:25:53 UTC

Yes, this is now fixed

Comment 7 Ross Brattain 2019-12-05 02:42:16 UTC

OVN deployed successfully on Azure with 4.3.0-0.nightly-2019-12-04-124503 and 4.3.0-0.nightly-2019-12-04-214544

Comment 9 errata-xmlrpc 2020-01-23 11:09:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.