1830809 – Console operator may not report accurate status due to an update race

Bug 1830809 - Console operator may not report accurate status due to an update race

Summary: Console operator may not report accurate status due to an update race

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Management Console
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Jakub Hadvig
QA Contact:	Yadan Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-03 23:45 UTC by Maru Newby
Modified:	2020-07-13 17:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Console operator is directly writing its conditions to the its config. Consequence: Race conditions in the console operator's status conditions in it's config. Fix: Use helperv1.UpdateStatus() from the library-go for updating operators status. Result: Console operator reports accurate status in its config.
Clone Of:
Environment:
Last Closed:	2020-07-13 17:34:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
console cluster operator status Available (123.92 KB, image/png) 2020-05-15 08:32 UTC, XiaochuanWang	no flags	Details
version is successfully deployed (72.32 KB, image/png) 2020-05-15 08:33 UTC, XiaochuanWang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift console-operator pull 424	0	None	closed	Bug 1830809: Use helperv1.UpdateStatus() for updating operators status	2020-06-24 01:34:07 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:34:34 UTC

Description Maru Newby 2020-05-03 23:45:43 UTC

While debugging ci failures for the 1.18.2 rebase, the console operator was observed in an available=false/progressing=true state for an extended period of time (30+m) despite the managed resources appearing healthy. I checked the operator log and the following entry was repeated approximately once per second:

E0503 16:17:08.883797       1 status.go:121] status update error: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again

A survey of the operator code revealed 3 separate sync loops (clidownloads, service, operator) writing status via the SyncStatus method: 

https://github.com/openshift/console-operator/blob/master/pkg/console/status/status.go#L116

It would appear that these 3 controllers are racing each other. I ran a modified operator against the same cluster - with the calls to SyncStatus in the service and clidownloads controllers commented out - and the operator status was updated to available=true/progressing=false in ~20s.

To minimize the potential for conflicting updates prevent the timely reporting of accurate status, consider updating the controllers to set operator status via library-go's UpdateStatus method:

https://github.com/openshift/library-go/blob/master/pkg/operator/v1helpers/helpers.go#L154

UpdateStatus accepts a set of functions that apply the conditions computed by a sync loop, and repeatedly tries to apply them to the current state of the status resource. Its use is recommended when multiple actors in an operator need to set status. The changes required would involve collecting condition changes rather than setting them directly on the resource to be updated, as per a similar change recently merged to the auth operator:

https://github.com/openshift/cluster-authentication-operator/pull/269 

It also appears that no caller of SyncStatus checks the returned error. This suggests the addition of a golint check to the verify suite to avoid this and other common sources of mechanical error.

Comment 3 Yadan Pei 2020-05-12 05:37:52 UTC

> While debugging ci failures for the 1.18.2 rebase

Hi, I want to confirm where did you find the failure log of console-operator?

Comment 4 Maru Newby 2020-05-12 05:42:30 UTC

(In reply to Yadan Pei from comment #3)
> > While debugging ci failures for the 1.18.2 rebase
> 
> Hi, I want to confirm where did you find the failure log of console-operator?

The reported failures were observed in the pod log of the operator. Note that verification of this bz requires that the 1.18.2 rebase be completed, but it is still in progress. If a 4.5 cluster based on 1.18.2 is deployed healthy, then the problem reported in the bz has been fixed.

Comment 5 XiaochuanWang 2020-05-15 08:32:32 UTC

Created attachment 1688809 [details]
console cluster operator status Available

Comment 6 XiaochuanWang 2020-05-15 08:33:06 UTC

Created attachment 1688810 [details]
version is successfully deployed

Comment 7 XiaochuanWang 2020-05-15 08:35:52 UTC

Thank you Maru!
Now the latest 4.5 cluster based on 1.18.2 is deployed healthy, attached the screenshot for refer.
So this bug could be Verified on below version:

OpenShift Version    4.5.0-0.nightly-2020-05-15-011814
Kubernetes Version    v1.18.2
Channel    stable-4.5

Comment 8 errata-xmlrpc 2020-07-13 17:34:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.