Description of problem: job is failing consistently, install fails due to multus operator not being scheduled: level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus-admission-controller\" is not yet scheduled on any nodes" example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/259
OK, CNO needs to be doing a better job of reporting status here... the problem is that ovn-kubernetes is fully "deployed" (as in, pods are running) but it's not actually *working*, so all the nodes still have `NetworkUnavailable` and no other operators can get deployed. I guess we probably need to detect that.
It appears the master is unable to update the Node object to remove the NodeNetworkNotReady condition that the cloud routes controller adds: time="2020-01-06T16:35:20Z" level=error msg="Error in updating status on node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope" time="2020-01-06T16:35:20Z" level=error msg="status update failed for local node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope" Not sure why; we have this in our ClusterRole for openshift-ovn-kubernetes:ovn-kubernetes-controller: - apiGroups: [""] resources: - "nodes/status" verbs: - patch - update and in the past that's been enough to make it work...
Hi Tomas So the node condition on GCP should be removed in "clearInitialNodeNetworkUnavailableCondition" in master.go in ovn-kubernetes upstreams (here: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/master.go#L712) This PR coincides with the date creation of this bug: https://github.com/ovn-org/ovn-kubernetes/pull/994/commits/9554046134700edce05be1c98e8c8021dcc565f9. It could be that whatever was merged there broke us. I would thus say: - Familiarize yourself with ovnkube start up, what we do and how - Have a look at the failing CI jobs - Have a look at the PR I linked to and figure out if that could be the issue Get back to me in case you have any questions /Alex
The bug still exists in all releases. https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/ https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4/ https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5/
This is blocking GCP CI testing in OKD (as it uses OVN by default), see results in https://github.com/openshift/release/pull/8307
This seems to be fixed now: * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.3/968 * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.4/1317 * https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.5/675
I'm hoping that https://github.com/openshift/machine-config-operator/pull/1670 will result in a decent improvement to this; it should merge shortly.
That fix merged, and CI looks better, but we're not there yet. Not by a long shot. Seeing some other CI flakes due to containers holding their ports open; One fix is https://github.com/openshift/cluster-kube-apiserver-operator/pull/864 but that might need to be applied to other containers. But there are some definite issues. "templateinstance cross-namespace test should create and delete objects across namespaces" fails constantly. It flakes a bit for openshift-sdn, but fails consistently for ovn-k.
Just had a few successful runs. So, looking up, though we're not where we need to be. PR 864 will help things a bit as well.
This isn't a showstopper for 4.5.0 GA at this point. Setting target release to 4.6.0 (the current development branch). For fixes (if any) requested/required on prior versions, clones will be created targeting those z-stream releases as appropriate.