Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1830809

Summary: Console operator may not report accurate status due to an update race
Product: OpenShift Container Platform Reporter: Maru Newby <mnewby>
Component: Management ConsoleAssignee: Jakub Hadvig <jhadvig>
Status: CLOSED ERRATA QA Contact: Yadan Pei <yapei>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.5CC: aos-bugs, bpeterse, jhadvig, jokerman, sttts, xiaocwan, yapei
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Console operator is directly writing its conditions to the its config. Consequence: Race conditions in the console operator's status conditions in it's config. Fix: Use helperv1.UpdateStatus() from the library-go for updating operators status. Result: Console operator reports accurate status in its config.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:34:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
console cluster operator status Available
none
version is successfully deployed none

Description Maru Newby 2020-05-03 23:45:43 UTC
While debugging ci failures for the 1.18.2 rebase, the console operator was observed in an available=false/progressing=true state for an extended period of time (30+m) despite the managed resources appearing healthy. I checked the operator log and the following entry was repeated approximately once per second:

E0503 16:17:08.883797       1 status.go:121] status update error: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again

A survey of the operator code revealed 3 separate sync loops (clidownloads, service, operator) writing status via the SyncStatus method: 

https://github.com/openshift/console-operator/blob/master/pkg/console/status/status.go#L116

It would appear that these 3 controllers are racing each other. I ran a modified operator against the same cluster - with the calls to SyncStatus in the service and clidownloads controllers commented out - and the operator status was updated to available=true/progressing=false in ~20s.

To minimize the potential for conflicting updates prevent the timely reporting of accurate status, consider updating the controllers to set operator status via library-go's UpdateStatus method:

https://github.com/openshift/library-go/blob/master/pkg/operator/v1helpers/helpers.go#L154

UpdateStatus accepts a set of functions that apply the conditions computed by a sync loop, and repeatedly tries to apply them to the current state of the status resource. Its use is recommended when multiple actors in an operator need to set status. The changes required would involve collecting condition changes rather than setting them directly on the resource to be updated, as per a similar change recently merged to the auth operator:

https://github.com/openshift/cluster-authentication-operator/pull/269 

It also appears that no caller of SyncStatus checks the returned error. This suggests the addition of a golint check to the verify suite to avoid this and other common sources of mechanical error.

Comment 3 Yadan Pei 2020-05-12 05:37:52 UTC
> While debugging ci failures for the 1.18.2 rebase

Hi, I want to confirm where did you find the failure log of console-operator?

Comment 4 Maru Newby 2020-05-12 05:42:30 UTC
(In reply to Yadan Pei from comment #3)
> > While debugging ci failures for the 1.18.2 rebase
> 
> Hi, I want to confirm where did you find the failure log of console-operator?

The reported failures were observed in the pod log of the operator. Note that verification of this bz requires that the 1.18.2 rebase be completed, but it is still in progress. If a 4.5 cluster based on 1.18.2 is deployed healthy, then the problem reported in the bz has been fixed.

Comment 5 XiaochuanWang 2020-05-15 08:32:32 UTC
Created attachment 1688809 [details]
console cluster operator status Available

Comment 6 XiaochuanWang 2020-05-15 08:33:06 UTC
Created attachment 1688810 [details]
version is successfully deployed

Comment 7 XiaochuanWang 2020-05-15 08:35:52 UTC
Thank you Maru!
Now the latest 4.5 cluster based on 1.18.2 is deployed healthy, attached the screenshot for refer.
So this bug could be Verified on below version:

OpenShift Version    4.5.0-0.nightly-2020-05-15-011814
Kubernetes Version    v1.18.2
Channel    stable-4.5

Comment 8 errata-xmlrpc 2020-07-13 17:34:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409