1749403 – OVN on Azure install fails

Bug 1749403 - OVN on Azure install fails

Summary: OVN on Azure install fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Dan Winship
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-05 14:27 UTC by Dan Winship
Modified:	2019-10-23 16:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1764728 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:40:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
install logs (121.90 KB, text/plain) 2019-09-05 18:16 UTC, Anurag saxena	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 23	0	None	None	None	2019-09-10 18:30:30 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:40:44 UTC

Description Dan Winship 2019-09-05 14:27:24 UTC

OVN install on Azure is currently failing; the cluster does not fully come up, and the installer collects must-gather logs. clusteroperators.json shows, eg, "authentication" with the status:

        "message": "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)",
        "reason": "RouteStatusDegradedFailedCreate",
        "status": "True",
        "type": "Degraded"

and "kube-controller-manager" with:

        "message": "StaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-2\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-dwinship-vkmwk-master-1\" not found",
        "reason": "StaticPodsDegradedError",
        "status": "True",
        "type": "Degraded"

(other failures may also be possible)

Failure eventually comes down to the kube-controller-manager failure, where an error occurs when trying to assign the pod's address in ovnkube-master:

    time="2019-09-04T18:40:20Z" level=info msg="Setting annotations ovn={\\\"ip_address\\\":\\\"10.130.0.8/23\\\", \\\"mac_address\\\":\\\"da:23:cb:82:00:09\\\", \\\"gateway_ip\\\": \\\"10.130.0.1\\\"} on pod installer-2-dwinship-vkmwk-master-0"
    time="2019-09-04T18:40:27Z" level=error msg="Error in setting annotation on pod installer-2-dwinship-vkmwk-master-0/openshift-kube-controller-manager: etcdserver: request timed out"
    time="2019-09-04T18:40:27Z" level=error msg="Failed to set annotation on pod installer-2-dwinship-vkmwk-master-0 - etcdserver: request timed out"

Because there is no retrying for failed pod setup in the master, this results in the pod being constantly retried forever on the node, but always failing.

Presumably the error has something to do with bootstrap switchover. I am not sure why this happens in Azure but not AWS. IIRC in AWS the kube apiserver reaches etcd via an AWS loadbalancer, and it's possible that AWS is removing the bad etcd replica from the loadbalancer but the way things are set up in Azure, that doesn't happen, so it becomes more dependent on people retrying on failure, which ovnkube-master isn't doing here.

Comment 1 Anurag saxena 2019-09-05 18:15:18 UTC

Thanks Dan for opening bug. For me it seems like co authentication and console is preventing it to come up. Attaching install logs.
Don't have must-gather logs currently as seems like cluster failed in early stages.

Comment 2 Anurag saxena 2019-09-05 18:16:35 UTC

Created attachment 1612072 [details]
install logs

Comment 5 Dan Winship 2019-09-10 14:22:21 UTC

OK, it looks like the failure mentioned above was just weird cosmic rays or something... the *normal* failure is that anything that depends on a Route doesn't come up because of a difference between Azure and AWS and between ovn-kubernetes and kube-proxy.

The issue is that when the AWS CloudProvider creates a load balancer, it creates one that rewrites the incoming traffic so that when you connect to ingress-ip:service-port, the node receives a connection to node-ip:service-node-port (just as though you had tried to connect to the NodePort service directly). But when the Azure CloudProvider creates a load balancer, it creates one that forwards the packet unchanged from the loadbalancer to the node, so that the node receives a connection to ingress-ip:service-port.

If you're using kube-proxy, this will work anyway, because the iptables proxier adds a shortcut rule mapping ingress-ip:service-port to the service, so that local pod-to-loadbalancer connections don't actually go all the way to the load balancer and come back. The rule was not intended for what the Azure CloudProvider is doing, but it happens to work anyway.

ovn-kubernetes's proxy code doesn't (currently) add this shortcut, so loadbalancer connections end up failing.

Comment 7 Anurag saxena 2019-09-11 14:28:34 UTC

Looks good on 4.2.0-0.nightly-2019-09-11-074500. Thanks for the fix!

$ oc get pods -n openshift-ovn-kubernetes
NAME                             READY   STATUS    RESTARTS   AGE
ovnkube-master-76c57ddbd-tpv2s   4/4     Running   0          51m
ovnkube-node-7w7fl               3/3     Running   0          51m
ovnkube-node-bwgxr               3/3     Running   0          51m
ovnkube-node-fs58c               3/3     Running   0          44m
ovnkube-node-mghmn               3/3     Running   0          43m
ovnkube-node-tlftc               3/3     Running   0          51m

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-09-11-074500   True        False         False      35m
cloud-credential                           4.2.0-0.nightly-2019-09-11-074500   True        False         False      50m
cluster-autoscaler                         4.2.0-0.nightly-2019-09-11-074500   True        False         False      44m
console                                    4.2.0-0.nightly-2019-09-11-074500   True        False         False      38m
dns                                        4.2.0-0.nightly-2019-09-11-074500   True        False         False      50m
image-registry                             4.2.0-0.nightly-2019-09-11-074500   True        False         False      41m
ingress                                    4.2.0-0.nightly-2019-09-11-074500   True        False         False      41m
insights                                   4.2.0-0.nightly-2019-09-11-074500   True        False         False      50m
kube-apiserver                             4.2.0-0.nightly-2019-09-11-074500   True        False         False      47m
kube-controller-manager                    4.2.0-0.nightly-2019-09-11-074500   True        False         False      47m
kube-scheduler                             4.2.0-0.nightly-2019-09-11-074500   True        False         False      48m
machine-api                                4.2.0-0.nightly-2019-09-11-074500   True        False         False      50m
machine-config                             4.2.0-0.nightly-2019-09-11-074500   True        False         False      49m
marketplace                                4.2.0-0.nightly-2019-09-11-074500   True        False         False      45m
monitoring                                 4.2.0-0.nightly-2019-09-11-074500   True        False         False      36m
network                                    4.2.0-0.nightly-2019-09-11-074500   True        False         False      49m
node-tuning                                4.2.0-0.nightly-2019-09-11-074500   True        False         False      47m
openshift-apiserver                        4.2.0-0.nightly-2019-09-11-074500   True        False         False      46m
openshift-controller-manager               4.2.0-0.nightly-2019-09-11-074500   True        False         False      48m
openshift-samples                          4.2.0-0.nightly-2019-09-11-074500   True        False         False      43m
operator-lifecycle-manager                 4.2.0-0.nightly-2019-09-11-074500   True        False         False      49m
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-09-11-074500   True        False         False      49m
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-09-11-074500   True        False         False      48m
service-ca                                 4.2.0-0.nightly-2019-09-11-074500   True        False         False      50m
service-catalog-apiserver                  4.2.0-0.nightly-2019-09-11-074500   True        False         False      47m
service-catalog-controller-manager         4.2.0-0.nightly-2019-09-11-074500   True        False         False      47m
storage                                    4.2.0-0.nightly-2019-09-11-074500   True        False         False      46m

Comment 8 errata-xmlrpc 2019-10-16 06:40:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Comment 9 Dan Winship 2019-10-23 16:19:08 UTC

(In reply to Dan Winship from comment #5)
> OK, it looks like the failure mentioned above was just weird cosmic rays or
> something...

for reference, the original "master fails to set annotation and then never retries" bug is now bug 1764728

Note You need to log in before you can comment on or make changes to this bug.