2005967 – 4.9 AWS OVN Installs failing significantly more than SDN

Bug 2005967 - 4.9 AWS OVN Installs failing significantly more than SDN

Summary: 4.9 AWS OVN Installs failing significantly more than SDN

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-20 14:57 UTC by Devan Goodwin
Modified:	2021-09-30 13:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-30 13:42:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2021-09-20 14:57:13 UTC

TRT has determined that it appears AWS OVN installs are failing significantly more than AWS SDN. The problems are surfacing in 4.10 upgrade tests, but during the 4.9 install phase.

The difference can be seen here: (filter on test name contains "install-stable")

https://datastudio.google.com/reporting/519bfc5e-6996-46bf-9269-0e2ca6d95eb4/page/qKmaC

This data is presently showing that AWS SDN install success rate is 87%, AWS OVN is 50%.

Testgrid results: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade

Sample failure job showing network operator reporting healthy, but several other operators reporting network problems: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744

Please note this issue is blocking TRT efforts to make aws-ovn a blocking job on PRs.

Comment 1 jamo luhrsen 2021-09-20 21:14:54 UTC

Not sure what the problem is yet, but some notes from digging through some logs.

From 'oc get pods' output [0], I see:

  openshift-kube-apiserver                           kube-apiserver-ip-10-0-226-77.us-west-1.compute.internal              0/5     Init:0/1            0               82s     10.0.226.77    ip-10-0-226-77.us-west-1.compute.internal    <none>

In the pods.json [1] it looks like all containers in that pod are stuck at init. And there are no pod logs for
any of those containers in the pods log folder [2].

Will keep digging for more clues.

[0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/oc_cmds/pods
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods.json
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/

Comment 2 Dan Williams 2021-09-21 01:34:16 UTC

I noticed that the bootstrap took a really long time to bring up anything networking related:

time="2021-09-20T11:50:59Z" level=info msg="Waiting up to 20m0s for the Kubernetes API at https://api.ci-op-snjljbjh-a7433.aws-2.ci.openshift.org:6443..."

Yet the network operator only comes up 28 minutes later, at:

W0920 12:18:36.476397       1 cmd.go:204] Using insecure, self-signed certificates
I0920 12:18:36.723804       1 observer_polling.go:159] Starting file observer

even kube-scheduler only starts at 12:13:34.885483, 23 minutes after bootstrap start.

CNO sure can't start until scheduler does, and the scheduler diddles around for 4 minutes trying to get is lease from the apiserver and then doing who knows what:

I0920 12:13:55.687007       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1632140016\" [serving] validServingFor=[apiserver-loopbac
k-client] issuer=\"apiserver-loopback-client-ca@1632140016\" (2021-09-20 11:13:35 +0000 UTC to 2022-09-20 11:13:35 +0000 UTC (now=2021-09-20 12:13:55.686985976 +0000 UTC))"
I0920 12:13:56.687782       1 leaderelection.go:248] attempting to acquire leader lease kube-system/kube-scheduler...
I0920 12:14:13.068265       1 leaderelection.go:258] successfully acquired lease kube-system/kube-scheduler
I0920 12:17:37.444195       1 factory.go:377] "Unable to schedule pod; no nodes are registered to the cluster; waiting" pod="openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-57f4c75696-728t9"

The first ovn-kubernetes master is started 30 minutes into bootstrap:

+ echo 'I0920 12:20:06.427624287 - ovnkube-master - start ovnkube --init-master ip-10-0-138-121.us-west-1.compute.internal'

and then bootstrap is timed out at 33 minutes after start:

time="2021-09-20T12:23:20Z" level=info msg="Pulling debug logs from the bootstrap machine"

It's going to be pretty hard to bring up an entire cluster in 3 minutes....

So why isn't the network operator getting started by the CVO much, much, much earlier?

Comment 3 Jack Ottofaro 2021-09-21 14:15:19 UTC

Looking at [1] CVO startup order etc. looks fine until here:

E0920 12:28:46.948371       1 task.go:112] error running apply for namespace "openshift-user-workload-monitoring" (365 of 734): the server was unable to return a response in the time allotted, but may still be processing the request (get namespaces openshift-user-workload-monitoring)
E0920 12:28:46.957362       1 task.go:112] error running apply for clusterrolebinding "csi-snapshot-controller-operator-role" (300 of 734): the server was unable to return a response in the time allotted, but may still be processing the request (get clusterrolebindings.rbac.authorization.k8s.io csi-snapshot-controller-operator-role)
E0920 12:28:46.965015       1 task.go:112] error running apply for configmap "openshift-config-managed/grafana-dashboard-namespace-by-pod" (650 of 734): the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps grafana-dashboard-namespace-by-pod)
E0920 12:28:47.013743       1 task.go:112] error running apply for role "openshift-ingress-operator/ingress-operator" (339 of 734): the server was unable to return a response in the time allotted, but may still be processing the request (get roles.rbac.authorization.k8s.io ingress-operator)

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1439917064897695744/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5f9c9fcfcd-d878v_cluster-version-operator_previous.log

Comment 4 Dan Williams 2021-09-22 04:41:57 UTC

(In reply to Jack Ottofaro from comment #3)
> Looking at [1] CVO startup order etc. looks fine until here:
> 
> E0920 12:28:46.948371       1 task.go:112] error running apply for namespace
> "openshift-user-workload-monitoring" (365 of 734): the server was unable to
> return a response in the time allotted, but may still be processing the
> request (get namespaces openshift-user-workload-monitoring)
> E0920 12:28:46.957362       1 task.go:112] error running apply for
> clusterrolebinding "csi-snapshot-controller-operator-role" (300 of 734): the
> server was unable to return a response in the time allotted, but may still
> be processing the request (get clusterrolebindings.rbac.authorization.k8s.io
> csi-snapshot-controller-operator-role)
> E0920 12:28:46.965015       1 task.go:112] error running apply for configmap
> "openshift-config-managed/grafana-dashboard-namespace-by-pod" (650 of 734):
> the server was unable to return a response in the time allotted, but may
> still be processing the request (get configmaps
> grafana-dashboard-namespace-by-pod)
> E0920 12:28:47.013743       1 task.go:112] error running apply for role
> "openshift-ingress-operator/ingress-operator" (339 of 734): the server was
> unable to return a response in the time allotted, but may still be
> processing the request (get roles.rbac.authorization.k8s.io ingress-operator)
> 
> [1]
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-
> ovn-upgrade/1439917064897695744/artifacts/e2e-aws-ovn-upgrade/gather-extra/
> artifacts/pods/openshift-cluster-version_cluster-version-operator-5f9c9fcfcd-
> d878v_cluster-version-operator_previous.log

Also in this cluster, if you grep for 10.0.231.186 (the AWS API LB?) in the node journals, you'll see a lot of connection refused starting at 11:55 and continuing 12:13:49 on ip-10-0-138-121.us-west-1.compute.internal and ip-10-0-167-102.us-west-1.compute.internal, but also on the other nodes. That seems curious as well?

Comment 5 Devan Goodwin 2021-09-22 11:02:27 UTC

As of this morning the ovn tests have gotten a little worse, 47%, and svn got a little better, 84%

Comment 6 Devan Goodwin 2021-09-22 11:05:18 UTC

Sorry, SDN went from 87% -> 84%, so also a little worse.

Comment 7 Devan Goodwin 2021-09-28 16:10:15 UTC

After the node teams fix for static pods hit nightly, the results have slowly been improving. We're now up to 74% success ovn, and 83% success sdn.

Sending over to node team to explain what was done and close or dupe as needed.

Technically we do not really know why this was only showing up for ovn, only theory we had was that the ovn test jobs were being run in bursts of 10, rather than the normal route sdn was.

Comment 8 Devan Goodwin 2021-09-30 13:42:28 UTC

Up to 92% for both now. We consider this issue resolved.

Note You need to log in before you can comment on or make changes to this bug.