Bug 1788309

Summary:	e2e-gcp-ovn failing consistently
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	ovn-kubernetes	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	high
Priority:	high	CC:	aconstan, anbhat, bbennett, danw, dcbw, dmellado, eparis, ricarril, scuppett, vrutkovs
Version:	4.3.0
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	SDN-CI-IMPACT
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1812960 (view as bug list)		Environment:
Last Closed:	2020-07-13 14:16:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1812960

Description Ben Parees 2020-01-06 22:20:09 UTC

Description of problem:
job is failing consistently, install fails due to multus operator not being scheduled:

level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus-admission-controller\" is not yet scheduled on any nodes"

example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/259

Comment 4 Dan Winship 2020-01-08 15:57:27 UTC

OK, CNO needs to be doing a better job of reporting status here... the problem is that ovn-kubernetes is fully "deployed" (as in, pods are running) but it's not actually *working*, so all the nodes still have `NetworkUnavailable` and no other operators can get deployed. I guess we probably need to detect that.

Comment 5 Dan Williams 2020-01-13 15:52:06 UTC

It appears the master is unable to update the Node object to remove the NodeNetworkNotReady condition that the cloud routes controller adds:

time="2020-01-06T16:35:20Z" level=error msg="Error in updating status on node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope"
time="2020-01-06T16:35:20Z" level=error msg="status update failed for local node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope"

Not sure why; we have this in our ClusterRole for openshift-ovn-kubernetes:ovn-kubernetes-controller:

- apiGroups: [""]
  resources:
  - "nodes/status"
  verbs:
  - patch
  - update

and in the past that's been enough to make it work...

Comment 6 Alexander Constantinescu 2020-01-15 16:12:17 UTC

Hi Tomas 

So the node condition on GCP should be removed in "clearInitialNodeNetworkUnavailableCondition"  in master.go  in ovn-kubernetes upstreams (here: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/master.go#L712)

This PR coincides with the date creation of this bug: https://github.com/ovn-org/ovn-kubernetes/pull/994/commits/9554046134700edce05be1c98e8c8021dcc565f9. It could be that whatever was merged there broke us. 

I would thus say: 

- Familiarize yourself with ovnkube start up, what we do and how
- Have a look at the failing CI jobs
- Have a look at the PR I linked to and figure out if that could be the issue

Get back to me in case you have any questions

/Alex

Comment 8 Dan Winship 2020-03-03 20:50:21 UTC

The bug still exists in all releases.
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4/
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5/

Comment 11 Vadim Rutkovsky 2020-04-14 16:21:59 UTC

This is blocking GCP CI testing in OKD (as it uses OVN by default), see results in https://github.com/openshift/release/pull/8307

Comment 12 Vadim Rutkovsky 2020-05-04 11:24:48 UTC

This seems to be fixed now:
* https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/968
* https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4/1317
* https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5/675

Comment 14 Casey Callendrello 2020-05-19 13:03:07 UTC

I'm hoping that https://github.com/openshift/machine-config-operator/pull/1670 will result in a decent improvement to this; it should merge shortly.

Comment 15 Casey Callendrello 2020-05-25 11:32:48 UTC

That fix merged, and CI looks better, but we're not there yet. Not by a long shot.


Seeing some other CI flakes due to containers holding their ports open; One fix is https://github.com/openshift/cluster-kube-apiserver-operator/pull/864 but that might need to be applied to other containers.

But there are some definite issues. "templateinstance cross-namespace test should create and delete objects across namespaces" fails constantly. It flakes a bit for openshift-sdn, but fails consistently for ovn-k.

Comment 16 Casey Callendrello 2020-05-26 12:55:26 UTC

Just had a few successful runs. So, looking up, though we're not where we need to be.

PR 864 will help things a bit as well.

Comment 19 Stephen Cuppett 2020-06-11 12:18:49 UTC

This isn't a showstopper for 4.5.0 GA at this point. Setting target release to 4.6.0 (the current development branch). For fixes (if any) requested/required on prior versions, clones will be created targeting those z-stream releases as appropriate.