Bug 1737026 - CNO StatusManager is not thread-safe
Summary: CNO StatusManager is not thread-safe
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Alexander Constantinescu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-02 11:06 UTC by Dan Winship
Modified: 2019-10-16 06:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:34:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 279 0 None closed Bug 1737026: Making set internal to the statusManager and adding mutex for thread safety 2021-02-15 08:58:17 UTC
Github openshift cluster-network-operator pull 303 0 None closed Bug 1737026: Use retry mechanism for fixing updates on stale objects in cache 2021-02-15 08:58:17 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:42 UTC

Description Dan Winship 2019-08-02 11:06:04 UTC
The CNO's StatusManager is not thread-safe, but is used from multiple reconcile loops, resulting in occasional:

2019/08/02 06:56:13 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again

Of course there are potentially worse problems with the way that it thread-unsafely accesses various fields.

It's not a completely trivial fix since various public StatusManager methods call various other public methods, so you can't just grab a mutex at the start of each of them. Set() could probably be made internal (and so it assumes that you already hold the mutex when you call it) but there are still other issues like SetFromPods() calling SetNotDegraded(). (Hm... except nothing ever calls "SetDegraded(PodDeployment)" any more anyway, so I guess that could just go away.)

Comment 1 Alexander Constantinescu 2019-08-08 17:04:12 UTC
Hi

PR on-going: https://github.com/openshift/cluster-network-operator/pull/279 

Setting status to POST

Brgds,
Alex

Comment 3 Dan Winship 2019-08-26 12:07:23 UTC
@Alexander: you have to wait for QE to move the bug past ON_QA

Comment 4 zhaozhanqi 2019-08-28 03:09:10 UTC
hi Alexander
 I think it's difficult to verify this bug for QE, Could you give some suggestion for steps. thanks.

Comment 5 zhaozhanqi 2019-08-28 07:31:44 UTC
I tried on 4.2.0-0.nightly-2019-08-27-222856

still see these error in CNO pod logs:

2019/08/28 07:21:30 Reconciling update to openshift-multus/multus-admission-controller
2019/08/28 07:21:31 Updated ClusterOperator with conditions:
- lastTransitionTime: "2019-08-28T05:54:46Z"
  status: "False"
  type: Degraded
- lastTransitionTime: "2019-08-28T05:54:46Z"
  status: "True"
  type: Upgradeable
- lastTransitionTime: "2019-08-28T07:21:30Z"
  status: "False"
  type: Progressing
- lastTransitionTime: "2019-08-28T07:21:30Z"
  status: "True"
  type: Available
2019/08/28 07:21:31 Reconciling update to openshift-sdn/sdn
2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/08/28 07:21:31 Reconciling update to openshift-sdn/sdn-controller
2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/08/28 07:21:31 Reconciling update to openshift-multus/multus
2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/08/28 07:21:31 Reconciling update to openshift-sdn/ovs
2019/08/28 07:21:31 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/08/28 07:21:33 Reconciling proxy 'cluster'
2019/08/28 07:21:33 httpProxy, httpsProxy and noProxy not defined for proxy 'cluster'; validation will be skipped
2019/08/28 07:21:33 Reconciling Network.config.openshift.io cluster
2019/08/28 07:21:33 Reconciling proxy 'cluster' complete
2019/08/28 07:21:34 Reconciling additional trust bundle configmap 'openshift-config/etcd-serving-ca'
2019/08/28 07:21:34 configmap 'openshift-config/etcd-serving-ca' name differs from trustedCA of proxy 'cluster' or trusted



but at last the clusteroperator network work: 

# oc get co network
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
network   4.2.0-0.nightly-2019-08-27-222856   True        False         False      9m19s
[root@dhcp-140-66 ~]#

Comment 6 Alexander Constantinescu 2019-08-28 11:50:30 UTC
Hi 

The logs you are seeing do not arise from the status manager being thread-unsafe. It arises from the fact that we are retrieving a stale version of the ClusterOperator object in the API server cache when we  are trying to update it. 

I can understand how this is confusing however. And I am currently working on submitting a PR which should fix logs the issue concerning stale objects and avoid having logs like that printed. 

You can flag this bug as not fixed. I will update it with a new PR

Best regards
Alexander

Comment 7 Alexander Constantinescu 2019-08-28 11:54:35 UTC
Hi

Concerning comment #4: one way to test this is to chaos-monkey the network components, i.e: delete tons of sdn/multus pods and perform updates to the networks.operator.openshift.io/networks.config.openshift.io. When a lot of things happen at the same time, the different reconciliation loops we have governing the state of the ClusterNetworkOperator should try to perform a lot of updates, thus increasing the likelihood of seeing logs like that printed. 

You seem to already be triggering a behavior like that given your comment #5, but contact me in case you have issues doing that.

Best regards,
Alexander

Comment 9 zhaozhanqi 2019-09-06 05:08:12 UTC
hi, Alexander Constantinescu 

Thanks the information from https://bugzilla.redhat.com/show_bug.cgi?id=1737026#c7

I tried to verify this bug on 4.2.0-0.nightly-2019-09-05-234433 with steps, this issue still can be reproduced. see:

1. Delete the ds in multus and sdn 
  oc delete ds multus -n openshift-multus
  oc delete ds sdn -n openshift-sdn

2. Check the logs of CNO pod:
# oc logs network-operator-7f7db68f8f-hbp7r -n openshift-network-operator | grep "Operation cannot be fulfilled on cluster"
2019/09/06 02:53:17 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/09/06 03:34:54 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/09/06 03:35:05 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again
2019/09/06 03:50:30 Failed to update ClusterOperator "network": Operation cannot be fulfilled on clusteroperators.config.openshift.io "network": the object has been modified; please apply your changes to the latest version and try again

Comment 10 Alexander Constantinescu 2019-09-06 09:11:35 UTC
Hi 

I really don't understand how you can be seeing that now. There is only one thread performing changes to this object at any given moment in time, and this thread also assures himself that the object is updated correctly in the API server. 

I tested locally on my computer this morning with version:

network-operator:
    Container ID:  cri-o://f5ef956079188b0f4e2a73cc44c306506b7e8cecdaa4779981484561a838f55f
    Image:         registry.svc.ci.openshift.org/origin/4.2-2019-09-06-042801@sha256:974be7ecf74a04a1bee218d263867ad534032c5882e483c1df4ae5fcc3772444
    Image ID:      registry.svc.ci.openshift.org/origin/4.2-2019-09-06-042801@sha256:974be7ecf74a04a1bee218d263867ad534032c5882e483c1df4ae5fcc3772444

I don't see the problem happening. 

Could you please tell me which version of the network-operator you are running? So that I can debug locally, or provide me with a KUBECONFIG? 

Best regards
Alexander

Comment 11 Alexander Constantinescu 2019-09-06 15:42:30 UTC
Hi again

Ok, I should have noticed this before: but it's impossible that this nightly version is the version incorporating my changes from PR: https://github.com/openshift/cluster-network-operator/pull/303/files. 

The log lines: "Failed to update ClusterOperator..." were removed with my PR, please see the file in the link above. 

The only logs we print now, in case we fail, have the format: "Failed to set ClusterOperator..."

I would say, try again on Monday with the new nightly version release. If you still see logs such as: "Failed to update ClusterOperator...", then it means we have a problem with our CI for github.com/openshift/cluster-network-operator

Best regards,
Alexander

Comment 12 zhaozhanqi 2019-09-09 02:18:36 UTC
Thanks Alexander, you are right. I tried this on latest nightly build(4.2.0-0.nightly-2019-09-08-180038). the logs above cannot be shown following the same steps.

Comment 13 errata-xmlrpc 2019-10-16 06:34:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.