Bug 1788309
Summary: | e2e-gcp-ovn failing consistently | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | |
Component: | Networking | Assignee: | Ben Bennett <bbennett> | |
Networking sub component: | ovn-kubernetes | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED WORKSFORME | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aconstan, anbhat, bbennett, danw, dcbw, dmellado, eparis, ricarril, scuppett, vrutkovs | |
Version: | 4.3.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | SDN-CI-IMPACT | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1812960 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 14:16:04 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1812960 |
Description
Ben Parees
2020-01-06 22:20:09 UTC
OK, CNO needs to be doing a better job of reporting status here... the problem is that ovn-kubernetes is fully "deployed" (as in, pods are running) but it's not actually *working*, so all the nodes still have `NetworkUnavailable` and no other operators can get deployed. I guess we probably need to detect that. It appears the master is unable to update the Node object to remove the NodeNetworkNotReady condition that the cloud routes controller adds: time="2020-01-06T16:35:20Z" level=error msg="Error in updating status on node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope" time="2020-01-06T16:35:20Z" level=error msg="status update failed for local node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope" Not sure why; we have this in our ClusterRole for openshift-ovn-kubernetes:ovn-kubernetes-controller: - apiGroups: [""] resources: - "nodes/status" verbs: - patch - update and in the past that's been enough to make it work... Hi Tomas So the node condition on GCP should be removed in "clearInitialNodeNetworkUnavailableCondition" in master.go in ovn-kubernetes upstreams (here: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/master.go#L712) This PR coincides with the date creation of this bug: https://github.com/ovn-org/ovn-kubernetes/pull/994/commits/9554046134700edce05be1c98e8c8021dcc565f9. It could be that whatever was merged there broke us. I would thus say: - Familiarize yourself with ovnkube start up, what we do and how - Have a look at the failing CI jobs - Have a look at the PR I linked to and figure out if that could be the issue Get back to me in case you have any questions /Alex The bug still exists in all releases. https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/ https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4/ https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5/ This is blocking GCP CI testing in OKD (as it uses OVN by default), see results in https://github.com/openshift/release/pull/8307 This seems to be fixed now: * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/968 * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4/1317 * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5/675 I'm hoping that https://github.com/openshift/machine-config-operator/pull/1670 will result in a decent improvement to this; it should merge shortly. That fix merged, and CI looks better, but we're not there yet. Not by a long shot. Seeing some other CI flakes due to containers holding their ports open; One fix is https://github.com/openshift/cluster-kube-apiserver-operator/pull/864 but that might need to be applied to other containers. But there are some definite issues. "templateinstance cross-namespace test should create and delete objects across namespaces" fails constantly. It flakes a bit for openshift-sdn, but fails consistently for ovn-k. Just had a few successful runs. So, looking up, though we're not where we need to be. PR 864 will help things a bit as well. This isn't a showstopper for 4.5.0 GA at this point. Setting target release to 4.6.0 (the current development branch). For fixes (if any) requested/required on prior versions, clones will be created targeting those z-stream releases as appropriate. |