1719188 – Processing and Degraded status should be False when openshiftcontrollermanager operator is unmanaged

Bug 1719188 - Processing and Degraded status should be False when openshiftcontrollermanager operator is unmanaged

Summary: Processing and Degraded status should be False when openshiftcontrollermanage...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-controller-manager
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-11 09:30 UTC by wewang
Modified:	2019-10-16 06:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Setting the OpenShift Controller Manager Operator to Unmanaged or Removed in fact is not supported / advised. Consequence: The conditions on the corresponding ClusterOperator object would go to an Unknown status, which confused the cluster administrator. Fix: The OpenShift Controller Manager Operator now ignores the unmanaged/removed settings for management state, and will place a reason/message explaining as much in its status conditions. Result: Unsupported unmanaged/removed settings for the OpenShift Controller Manager Operator are fully ignored and properly explained in the ClusterOperator conditions.
Clone Of:
Environment:
Last Closed:	2019-10-16 06:31:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-openshift-controller-manager-operator pull 116	0	'None'	closed	Bug 1719188: set library-go bits to declare we do not support unmanaged/removed; allow all other reconciliaition as part...	2021-01-22 17:53:35 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:32:10 UTC

Description wewang 2019-06-11 09:30:32 UTC

Description of problem:
When setting openshiftcontrollermanagers/cluster to unmanaged, Processing and Degraded status should not change, still be False



Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-06-06-122735

How reproducible:
always

Steps to Reproduce:
1.Update openshiftcontrollermanager/cluster to unmanaged
  managementState: Unmanaged

2.Check the openshift-controller-manager cluster
 [wewang@Desktop ~]$ oc get clusteroperator |grep openshift-controller-manager

NAME                                       AVAILABLE    PROGRESSING  DEGRADED
openshift-controller-manager 
                                            Unknown     Unknown       Unknown    92s

3.

Actual results:
PROGRESSING and DEGRADED are changed to Unknown status

Expected results:
PROGRESSING and DEGRADED  should keep False status

Additional info:
older bug: https://bugzilla.redhat.com/show_bug.cgi?id=1667376

https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/operator.go#L92
https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/operator.go#L98

Comment 1 wewang 2019-06-11 09:36:25 UTC

And how about upgradable status? I saw code did not deal with it when setting operator to unmanaged. refer to:

https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/operator.go#L81

Comment 2 Adam Kaplan 2019-06-11 14:29:16 UTC

@wewang per recent discussions Upgradeable should be `True` if the operator is Unmanaged or Removed.

We should not block upgrades for Unmanaged - however customers will risk the upgrade failing.

Comment 3 Gabe Montero 2019-08-22 19:36:39 UTC

So this is a weird one.

The operator code is trying to set when unmanaged:
- available to unknown
- progressing to false
- degraded to false

if you look at https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/operator.go#L84-L101

I've added debug that confirms those are the setting prior to the UpdateStatus call at https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/operator.go#L104

But I then see this event coming across immediately after the UpdateStatus code, where in fact progressing and degraded are also Unknown (which is also what I see when running oc get clusteroperator openshift-controller-manager -o yaml ) :

I0822 19:20:23.116996       1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"7b9429af-c4f4-11e9-9995-0201d59f5a8e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator openshift-controller-manager changed: {"conditions":[{"type":"Available","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:

A: 07Z","reason":"Unmanaged"},{"type":"Progressing","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:07Z","reason":"Unmanaged"},{"type":"Degraded","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:07Z","reason":"Unmanaged"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:07Z","reason":"Unmanaged"}],"extension":null}


B: 23Z","reason":"Unmanaged"},{"type":"Progressing","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:23Z","reason":"Unmanaged"},{"type":"Degraded","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:23Z","reason":"Unmanaged"},{"type":"Upgradeable","status":"Unknown","lastTransitionTime":"2019-08-22T19:20:23Z","reason":"Unmanaged"}],"extension":null}


That would imply the UpdateStatus call at https://github.com/openshift/cluster-openshift-controller-manager-operator/blob/master/pkg/operator/operator.go#L104 is misbehaving somehow.  Some vendoring incompatibility / mismatch perhaps ?

Continuing to debug, but posting this in case folks on the CC: have seen this before.

Comment 4 Gabe Montero 2019-08-22 19:47:14 UTC

Ah .... some confusion in my last comment  .... those lines I noted update the openshiftcontrollermanager type, which happens to inline OperatorStatus ... the oc get openshiftcontrollermanager -o yaml does in fact reflect unmanaged resulting in available unknown, progressing/degraded false in that object.

still need to find the code that updates what I see with oc get clusteroperator openshift-controller-manager -o yaml

Comment 5 Gabe Montero 2019-08-22 20:05:05 UTC

Believe I found it.  It is this code in the common openshift/library-go package:  https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L127-L143

The fact that it is being done by this common code path potentially implies that other controllers are setting all the conditions to unknown when unmanaged, at least with management.IsOperatorAlwaysManaged() is false.

Also see https://github.com/openshift/library-go/blob/master/pkg/operator/management/management_state.go#L36-L39 and https://github.com/openshift/library-go/blob/master/pkg/operator/management/management_state.go#L22-L27

// SetOperatorAlwaysManaged is one time choice when an operator want to opt-out from supporting the "unmanaged" state.
// This is a case of control plane operators or operators that are required to always run otherwise the cluster will
// get into unstable state or critical components will stop working.

So in looks like the implication is that OCM does NOT support going unmanaged.

Though I'm still trying to reconcile all the booleans and nots between the management_state.go code and https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L127

Comment 6 Gabe Montero 2019-08-22 21:22:15 UTC

Yep so Michal's https://github.com/openshift/library-go/pull/297 is what dropped this stuff in back in March.

The description of that PR at least says UPDATED: management.SetOperatorAlwaysManaged() and management.SetOperatorNotRemovable() will make the operator ignore the "unmanaged" and "removed" state, so you can opt-in.

But the comment in the code says SetOperatorAlwaysManaged is one time choice when an operator want to opt-out from supporting the "unmanaged" state.
// This is a case of control plane operators or operators that are required to always run otherwise the cluster will
// get into unstable state or critical components will stop working.

If per the comment in management_state.go:  IsOperatorAlwaysManaged means the operator can't be set to unmanaged state.

then that would imply that the line which results in setting the cluster operator conditions to unknown

if detailedSpec.ManagementState == operatorv1.Unmanaged && !management.IsOperatorAlwaysManaged() {

meaning that we've set management state to unmanaged and that we are *NOT* always managed ... meanging we allow unmanaged

should be

if detailedSpec.ManagementState == operatorv1.Unmanaged && management.IsOperatorAlwaysManaged() {

meaning that we've set management state to unmanaged and we always must be managed ... can't be set to unmanaged, as all the logic and comments in management_state.go suggest.

When I made that change in status_controller.go, switching to if detailedSpec.ManagementState == operatorv1.Unmanaged && management.IsOperatorAlwaysManaged() {
I no longer see the clusteroperator conditions for openshift-controller-manager set to unknown

Now, what about the controller that do NOT want to be unmanaged ... right now, assuming they are not calling SetOperatorAlwaysManaged(), are getting the unmanaged protection by accident 
so we have a chicken/egg situation between updating all those operators to call SetOperatorAlwaysManaged() as we update openshift/library-go to fix https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L127

Or we leave the behavior as is, and simply add comments to clarify things, and say the OCM controller, we call SetOperatorAlwaysManaged(), so that we can set to unmanaged.

I left https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L127 and added that call to SetOperatorAlwaysManaged() in the OCM operator and that worked too.

Before crafting any openshift/library or openshift/cluster-openshift-controller-manager-operator, I'd like to see if Michal has the cycles digest this comment and #comment 5
and what his take on which way we should go wrt the chicken/egg scenario I mentioned here, my interpretation of names/comments/logic, etc.

Comment 7 Gabe Montero 2019-08-23 18:33:20 UTC

Opened https://github.com/openshift/library-go/pull/507 with one of the proposed solutions.

We can continue the which way to go discussion there.

Removing needinfo

Comment 8 Gabe Montero 2019-08-23 18:44:14 UTC

Reminder - even if the library-go PR merges, we have to bump library-go in the cluster-openshift-controller-manager-operator repo to pick up the fix

Comment 9 Gabe Montero 2019-08-23 18:56:27 UTC

feedback provided in https://github.com/openshift/library-go/pull/507

we are going with the other approach

Comment 10 Gabe Montero 2019-08-23 20:12:59 UTC

We are still discussing internally, but the belief / understanding is growing on our end that unmanaged/removed should not be supported for ocm-o.

In that sense, unknown on the clusteroperator conditions would be acceptable.

That said, there is some cleanup on the ocm-o and openshiftcontrollermanager object sides that we are entertaining.

Comment 11 Gabe Montero 2019-08-24 18:31:20 UTC

OK, where we landed on this:

1) ocm-o declares to library-go that it only supports managed, and NOT unmanaged or removed, management state
2) if you set ocm-o to unmanaged or removed, a reason/message is posted on all the conditions that this is NOT supported, but otherwise all conditions are left unchanged
3) with unmanaged/removed being IGNORED, the rest of the sync loop is still processed (vs. the abort that used to occur for unmanaged) ... 
4) ... if the rest of the sync loop will make any additional changes to the conditions based on the other inputs to the system, and if errors are detected, the reason/message for the condition(s) in question will be updated accordingly, and overwrite the "unmanaged not supported" messages if present
5) if you switch an operator from unmanaged/removed back to managed, the reason/message around "unmanged/removed not supported" are removed from the conditions.

Comment 13 wewang 2019-08-27 08:25:16 UTC

tested in version:
4.2.0-0.nightly-2019-08-26-163938

1.Set ocm-o to Unmanaged, will not affect status of Available, Degraded and Progressing
 $oc get co openshift-controller-manager -o yaml
  conditions:
  - lastTransitionTime: "2019-08-27T01:05:55Z"
    message: |-
      ConfigObservationDegraded: the controller manager spec was set to Unmanaged state, but that is unsupported, and has no effect on this condition
      WorkloadDegraded: the controller manager spec was set to Unmanaged state, but that is unsupported, and has no effect on this condition
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-08-27T08:15:27Z"
    message: 'Progressing: the controller manager spec was set to Unmanaged state,
      but that is unsupported, and has no effect on this condition'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-08-27T01:07:16Z"
    message: 'Available: the controller manager spec was set to Unmanaged state, but
      that is unsupported, and has no effect on this condition'
    reason: AsExpected
    status: "True"
    type: Available
  - lastTransitionTime: "2019-08-27T01:05:55Z"
    reason: NoData
    status: Unknown
    type: Upgradeable

 $oc get co |grep openshift-controller-manager
openshift-controller-manager               4.2.0-0.nightly-2019-08-26-163938   True        False         False      7h17m

Comment 15 Gabe Montero 2019-08-27 15:21:51 UTC

Going to live with this duplication @Wen

There are other cases where the messages there could vary, and filtering on this specific message feels kludgy

It is on par with us duplicating the message across available/progressing/etc.... we are making sure the situation is reported

Comment 17 errata-xmlrpc 2019-10-16 06:31:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.