Description of problem:
job is failing consistently, install fails due to multus operator not being scheduled:
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/multus-admission-controller\" is not yet scheduled on any nodes"
OK, CNO needs to be doing a better job of reporting status here... the problem is that ovn-kubernetes is fully "deployed" (as in, pods are running) but it's not actually *working*, so all the nodes still have `NetworkUnavailable` and no other operators can get deployed. I guess we probably need to detect that.
It appears the master is unable to update the Node object to remove the NodeNetworkNotReady condition that the cloud routes controller adds:
time="2020-01-06T16:35:20Z" level=error msg="Error in updating status on node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope"
time="2020-01-06T16:35:20Z" level=error msg="status update failed for local node ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal: nodes \"ci-op-694p6-m-1.c.openshift-gce-devel-ci.internal\" is forbidden: User \"system:serviceaccount:openshift-ovn-kubernetes:ovn-kubernetes-controller\" cannot update resource \"nodes/status\" in API group \"\" at the cluster scope"
Not sure why; we have this in our ClusterRole for openshift-ovn-kubernetes:ovn-kubernetes-controller:
- apiGroups: [""]
and in the past that's been enough to make it work...
So the node condition on GCP should be removed in "clearInitialNodeNetworkUnavailableCondition" in master.go in ovn-kubernetes upstreams (here: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/master.go#L712)
This PR coincides with the date creation of this bug: https://github.com/ovn-org/ovn-kubernetes/pull/994/commits/9554046134700edce05be1c98e8c8021dcc565f9. It could be that whatever was merged there broke us.
I would thus say:
- Familiarize yourself with ovnkube start up, what we do and how
- Have a look at the failing CI jobs
- Have a look at the PR I linked to and figure out if that could be the issue
Get back to me in case you have any questions
The bug still exists in all releases.
This is blocking GCP CI testing in OKD (as it uses OVN by default), see results in https://github.com/openshift/release/pull/8307
This seems to be fixed now:
I'm hoping that https://github.com/openshift/machine-config-operator/pull/1670 will result in a decent improvement to this; it should merge shortly.
That fix merged, and CI looks better, but we're not there yet. Not by a long shot.
Seeing some other CI flakes due to containers holding their ports open; One fix is https://github.com/openshift/cluster-kube-apiserver-operator/pull/864 but that might need to be applied to other containers.
But there are some definite issues. "templateinstance cross-namespace test should create and delete objects across namespaces" fails constantly. It flakes a bit for openshift-sdn, but fails consistently for ovn-k.
Just had a few successful runs. So, looking up, though we're not where we need to be.
PR 864 will help things a bit as well.
This isn't a showstopper for 4.5.0 GA at this point. Setting target release to 4.6.0 (the current development branch). For fixes (if any) requested/required on prior versions, clones will be created targeting those z-stream releases as appropriate.