The CNO's StatusManager is not thread-safe, but is used from multiple reconcile loops, resulting in occasional: 2019/08/02 06:56:13 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again Of course there are potentially worse problems with the way that it thread-unsafely accesses various fields. It's not a completely trivial fix since various public StatusManager methods call various other public methods, so you can't just grab a mutex at the start of each of them. Set() could probably be made internal (and so it assumes that you already hold the mutex when you call it) but there are still other issues like SetFromPods() calling SetNotDegraded(). (Hm... except nothing ever calls "SetDegraded(PodDeployment)" any more anyway, so I guess that could just go away.)
Hi PR on-going: https://github.com/openshift/cluster-network-operator/pull/279 Setting status to POST Brgds, Alex
@Alexander: you have to wait for QE to move the bug past ON_QA
hi Alexander I think it's difficult to verify this bug for QE, Could you give some suggestion for steps. thanks.
I tried on 4.2.0-0.nightly-2019-08-27-222856 still see these error in CNO pod logs: 2019/08/28 07:21:30 Reconciling update to openshift-multus/multus-admission-controller 2019/08/28 07:21:31 Updated ClusterOperator with conditions: - lastTransitionTime: "2019-08-28T05:54:46Z" status: "False" type: Degraded - lastTransitionTime: "2019-08-28T05:54:46Z" status: "True" type: Upgradeable - lastTransitionTime: "2019-08-28T07:21:30Z" status: "False" type: Progressing - lastTransitionTime: "2019-08-28T07:21:30Z" status: "True" type: Available 2019/08/28 07:21:31 Reconciling update to openshift-sdn/sdn 2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/08/28 07:21:31 Reconciling update to openshift-sdn/sdn-controller 2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/08/28 07:21:31 Reconciling update to openshift-multus/multus 2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/08/28 07:21:31 Reconciling update to openshift-sdn/ovs 2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/08/28 07:21:33 Reconciling proxy 'cluster' 2019/08/28 07:21:33 httpProxy, httpsProxy and noProxy not defined for proxy 'cluster'; validation will be skipped 2019/08/28 07:21:33 Reconciling Network.config.openshift.io cluster 2019/08/28 07:21:33 Reconciling proxy 'cluster' complete 2019/08/28 07:21:34 Reconciling additional trust bundle configmap 'openshift-config/etcd-serving-ca' 2019/08/28 07:21:34 configmap 'openshift-config/etcd-serving-ca' name differs from trustedCA of proxy 'cluster' or trusted but at last the clusteroperator network work: # oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE network 4.2.0-0.nightly-2019-08-27-222856 True False False 9m19s [root@dhcp-140-66 ~]#
Hi The logs you are seeing do not arise from the status manager being thread-unsafe. It arises from the fact that we are retrieving a stale version of the ClusterOperator object in the API server cache when we are trying to update it. I can understand how this is confusing however. And I am currently working on submitting a PR which should fix logs the issue concerning stale objects and avoid having logs like that printed. You can flag this bug as not fixed. I will update it with a new PR Best regards Alexander
Hi Concerning comment #4: one way to test this is to chaos-monkey the network components, i.e: delete tons of sdn/multus pods and perform updates to the networks.operator.openshift.io/networks.config.openshift.io. When a lot of things happen at the same time, the different reconciliation loops we have governing the state of the ClusterNetworkOperator should try to perform a lot of updates, thus increasing the likelihood of seeing logs like that printed. You seem to already be triggering a behavior like that given your comment #5, but contact me in case you have issues doing that. Best regards, Alexander
hi, Alexander Constantinescu Thanks the information from https://bugzilla.redhat.com/show_bug.cgi?id=1737026#c7 I tried to verify this bug on 4.2.0-0.nightly-2019-09-05-234433 with steps, this issue still can be reproduced. see: 1. Delete the ds in multus and sdn oc delete ds multus -n openshift-multus oc delete ds sdn -n openshift-sdn 2. Check the logs of CNO pod: # oc logs network-operator-7f7db68f8f-hbp7r -n openshift-network-operator | grep "Operation cannot be fulfilled on cluster" 2019/09/06 02:53:17 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/09/06 03:34:54 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/09/06 03:35:05 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again 2019/09/06 03:50:30 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
Hi I really don't understand how you can be seeing that now. There is only one thread performing changes to this object at any given moment in time, and this thread also assures himself that the object is updated correctly in the API server. I tested locally on my computer this morning with version: network-operator: Container ID: cri-o://f5ef956079188b0f4e2a73cc44c306506b7e8cecdaa4779981484561a838f55f Image: registry.svc.ci.openshift.org/origin/4.2-2019-09-06-042801@sha256:974be7ecf74a04a1bee218d263867ad534032c5882e483c1df4ae5fcc3772444 Image ID: registry.svc.ci.openshift.org/origin/4.2-2019-09-06-042801@sha256:974be7ecf74a04a1bee218d263867ad534032c5882e483c1df4ae5fcc3772444 I don't see the problem happening. Could you please tell me which version of the network-operator you are running? So that I can debug locally, or provide me with a KUBECONFIG? Best regards Alexander
Hi again Ok, I should have noticed this before: but it's impossible that this nightly version is the version incorporating my changes from PR: https://github.com/openshift/cluster-network-operator/pull/303/files. The log lines: "Failed to update ClusterOperator..." were removed with my PR, please see the file in the link above. The only logs we print now, in case we fail, have the format: "Failed to set ClusterOperator..." I would say, try again on Monday with the new nightly version release. If you still see logs such as: "Failed to update ClusterOperator...", then it means we have a problem with our CI for github.com/openshift/cluster-network-operator Best regards, Alexander
Thanks Alexander, you are right. I tried this on latest nightly build(4.2.0-0.nightly-2019-09-08-180038). the logs above cannot be shown following the same steps.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922