Bug 1788309

Summary: e2e-gcp-ovn failing consistently
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: ovn-kubernetes QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat, bbennett, danw, dcbw, dmellado, eparis, ricarril, scuppett, vrutkovs
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SDN-CI-IMPACT
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1812960 (view as bug list) Environment:
Last Closed: 2020-07-13 14:16:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1812960    

Description Ben Parees 2020-01-06 22:20:09 UTC
Description of problem:
job is failing consistently, install fails due to multus operator not being scheduled:

level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus-admission-controller\" is not yet scheduled on any nodes"

example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/259

Comment 4 Dan Winship 2020-01-08 15:57:27 UTC
OK, CNO needs to be doing a better job of reporting status here... the problem is that ovn-kubernetes is fully "deployed" (as in, pods are running) but it's not actually *working*, so all the nodes still have `NetworkUnavailable` and no other operators can get deployed. I guess we probably need to detect that.

Comment 5 Dan Williams 2020-01-13 15:52:06 UTC
It appears the master is unable to update the Node object to remove the NodeNetworkNotReady condition that the cloud routes controller adds:

time="2020-01-06T16:35:20Z" level=error msg="Error in updating status on node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope"
time="2020-01-06T16:35:20Z" level=error msg="status update failed for local node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope"

Not sure why; we have this in our ClusterRole for openshift-ovn-kubernetes:ovn-kubernetes-controller:

- apiGroups: [""]
  resources:
  - "nodes/status"
  verbs:
  - patch
  - update

and in the past that's been enough to make it work...

Comment 6 Alexander Constantinescu 2020-01-15 16:12:17 UTC
Hi Tomas 

So the node condition on GCP should be removed in "clearInitialNodeNetworkUnavailableCondition"  in master.go  in ovn-kubernetes upstreams (here: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/master.go#L712)

This PR coincides with the date creation of this bug: https://github.com/ovn-org/ovn-kubernetes/pull/994/commits/9554046134700edce05be1c98e8c8021dcc565f9. It could be that whatever was merged there broke us. 

I would thus say: 

- Familiarize yourself with ovnkube start up, what we do and how
- Have a look at the failing CI jobs
- Have a look at the PR I linked to and figure out if that could be the issue

Get back to me in case you have any questions

/Alex

Comment 11 Vadim Rutkovsky 2020-04-14 16:21:59 UTC
This is blocking GCP CI testing in OKD (as it uses OVN by default), see results in https://github.com/openshift/release/pull/8307

Comment 14 Casey Callendrello 2020-05-19 13:03:07 UTC
I'm hoping that https://github.com/openshift/machine-config-operator/pull/1670 will result in a decent improvement to this; it should merge shortly.

Comment 15 Casey Callendrello 2020-05-25 11:32:48 UTC
That fix merged, and CI looks better, but we're not there yet. Not by a long shot.


Seeing some other CI flakes due to containers holding their ports open; One fix is https://github.com/openshift/cluster-kube-apiserver-operator/pull/864 but that might need to be applied to other containers.

But there are some definite issues. "templateinstance cross-namespace test should create and delete objects across namespaces" fails constantly. It flakes a bit for openshift-sdn, but fails consistently for ovn-k.

Comment 16 Casey Callendrello 2020-05-26 12:55:26 UTC
Just had a few successful runs. So, looking up, though we're not where we need to be.

PR 864 will help things a bit as well.

Comment 19 Stephen Cuppett 2020-06-11 12:18:49 UTC
This isn't a showstopper for 4.5.0 GA at this point. Setting target release to 4.6.0 (the current development branch). For fixes (if any) requested/required on prior versions, clones will be created targeting those z-stream releases as appropriate.