Bug 1877316

Summary:	CSO stuck on message "the cluster operator storage has not yet successfully rolled out" while downgrading from 4.6 -> 4.5
Product:	OpenShift Container Platform	Reporter:	pmali
Component:	Storage	Assignee:	aos-storage-staff <aos-storage-staff>
Storage sub component:	Operators	QA Contact:	Qin Ping <piqin>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	medium	CC:	aos-bugs, eparis, fbertina, jsafrane, sdodson, wking
Version:	4.5	Keywords:	TestBlocker
Target Milestone:	---
Target Release:	4.6.0
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1882394 (view as bug list)		Environment:
Last Closed:	2020-09-25 07:44:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1882394

Description pmali 2020-09-09 11:37:28 UTC

Description of problem:
While Downgrading from cluster version 4.6.0-0.nightly-2020-09-09-003430 to 4.5.0-0.nightly-2020-09-08-123650, Downgrade process stuck on storage operator.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-09-003430   True        True          3h24m   Unable to apply 4.5.0-0.nightly-2020-09-08-123650: the cluster operator storage has not yet successfully rolled out


Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Installed cluster version 4.6.0-0.nightly-2020-09-09-003430 and then Downgrade to 4.5.0-0.nightly-2020-09-08-123650
2.
3.

Actual results:
Downgrade stuck on the storage operator with message "Unable to apply 4.5.0-0.nightly-2020-09-08-123650: the cluster operator storage has not yet successfully rolled out"


Expected results:
Downgrade should be successful without any issue.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
$ oc get co storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.6.0-0.nightly-2020-09-09-003430   True        False         False      5h24m

Comment 2 Jan Safranek 2020-09-09 15:17:49 UTC

In the cluster-storage-operator logs I can see:

2020-09-09T08:18:16.788751731Z {"level":"info","ts":1599639496.7886891,"logger":"cmd","msg":"Go Version: go1.13.4"}
2020-09-09T08:18:16.788895430Z {"level":"info","ts":1599639496.788878,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
2020-09-09T08:18:16.788923983Z {"level":"info","ts":1599639496.7889135,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0"}
2020-09-09T08:18:16.789571987Z {"level":"info","ts":1599639496.789465,"logger":"leader","msg":"Trying to become the leader."}
2020-09-09T08:18:17.187376554Z {"level":"info","ts":1599639497.1873038,"logger":"leader","msg":"Not the leader. Waiting."}
...
2020-09-09T11:40:29.805119287Z {"level":"info","ts":1599651629.8050728,"logger":"leader","msg":"Not the leader. Waiting."}

I.e. the operator was not able to acquire leadership lock for 3 hours.

Comment 3 Jan Safranek 2020-09-09 15:28:18 UTC

And the lock config map looks like this: 

- apiVersion: v1
  kind: ConfigMap
  metadata:
    annotations:
      control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"","leaseDurationSeconds":0,"acquireTime":null,"renewTime":null,"leaderTransitions":2}'
    creationTimestamp: "2020-09-09T05:47:28Z"
    name: cluster-storage-operator-lock
    namespace: openshift-cluster-storage-operator


Not sure what zero / null values mean.

Comment 7 Scott Dodson 2020-09-17 15:44:18 UTC

*** Bug 1880053 has been marked as a duplicate of this bug. ***

Comment 8 Scott Dodson 2020-09-17 15:46:44 UTC

Given this happens 100% of the time I'm raising the severity back to medium.

Workaround, delete the lock configmap.

$ oc delete cm -n openshift-cluster-storage-operator cluster-storage-operator-lock
configmap "cluster-storage-operator-lock" deleted

Comment 9 Eric Paris 2020-09-18 13:35:45 UTC

Is it possible that the lock was explicitly yielded, which is how we get the 'null' thing, but either yielded incorrectly or the 'take the lock code' incorrectly doesn't know how to handle a yielded lock? We'd need to dig into that code a bit to see where the mis-match is.

Comment 10 Scott Dodson 2020-09-18 14:09:44 UTC

I've moved this back to 4.6.0 because I've been asked to get the upgrade variant of CI back on target and I cannot do that with one of the jobs failing at 100% rate. If this were simply flaky I'd have left it at 4.7.

https://sippy.ci.openshift.org/?release=4.6

Comment 11 Fabio Bertinatto 2020-09-18 17:36:58 UTC

I think I found the reason why this is happening.

The ConfigMap is created by CSO 4.6 (library-go) and it doesn't have any ownerReference set (as per must-gather data).

However, CSO 4.5 uses operator-sdk, and it expects that the ConfigMap has an ownerReference in order to know whether the lock belongs to it [1] or not. Since it doesn't find any, it tries to create one, but it gets back an error stating the the object already exists [2].

So CSO 4.5 thinks that the lock belongs to somebody else and it never starts.

Monday I'll check if the fix should go to operator-sdk or to library-go.

[1] https://github.com/openshift/cluster-storage-operator/blob/release-4.5/vendor/github.com/operator-framework/operator-sdk/pkg/leader/leader.go#L83-L102
[2] https://github.com/openshift/cluster-storage-operator/blob/release-4.5/vendor/github.com/operator-framework/operator-sdk/pkg/leader/leader.go#L119-L125

Comment 12 Fabio Bertinatto 2020-09-23 13:55:39 UTC

I confirmed this is happening because of the leader-for-life and leader-with-lease mismatch between CSO 4.6 and CSO 4.5.

I've submitted a PR to address this bug at [1]. The patch basically switches the leader-for-life approach used in CSO 4.5 to the leader-with-lease one.

While the patch does fix the downgrade problem from 4.6 to 4.5.z (with [1]), it also will introduce the same problem when downgrading 4.5.z (with [1]) to 4.5.z-1 (without [1]).

That being said, I have a few questions (@Eric, please help):

1. Do we want to go ahead with [1], even with the trade-off I described above? I suppose we want 4.5.z -> 4.5.z-1 rollbacks to work as well?
2. Manually deleting the ConfigMap also unblocks the downgrade, could this be fixed via documentation?

[1] https://github.com/openshift/cluster-storage-operator/pull/91.

Comment 13 Eric Paris 2020-09-23 14:05:49 UTC

I think breaking a z-downgrade would be worst option.
Doc-ing the workaround would be barely acceptable.
Could we potentially do a hack to make the 4.5 code recognize a yielded 4.6 leader-with-lease and respond (?maybe delete the configmap?) so it can switch back to the 4.5 leader-for-life model? Would that be a ton of unsupportable work to have the 4.5 code recognize and respond to the 4.6 lock difference?

Comment 14 Fabio Bertinatto 2020-09-24 09:03:20 UTC

(In reply to Eric Paris from comment #13)
> I think breaking a z-downgrade would be worst option.
> Doc-ing the workaround would be barely acceptable.
> Could we potentially do a hack to make the 4.5 code recognize a yielded 4.6
> leader-with-lease and respond (?maybe delete the configmap?) so it can
> switch back to the 4.5 leader-for-life model? Would that be a ton of
> unsupportable work to have the 4.5 code recognize and respond to the 4.6
> lock difference?

I've changed the PR above to test this approach. Will update this bug when I have more information.

Comment 15 Fabio Bertinatto 2020-09-24 16:20:58 UTC

For reference: I cloned this into bug 1882394 because the patch goes to 4.5. Eventually I'll close this one.

Comment 16 Fabio Bertinatto 2020-09-24 16:23:38 UTC

PR is under review.

I tested the changes with:

4.6.0-0.nightly-2020-09-21-030155 -> 4.5.z (with my patch)
4.5.patch (with my patch) -> 4.6.0-0.nightly-2020-09-22-073212

Worked OK.

Comment 17 Fabio Bertinatto 2020-09-25 07:44:52 UTC

OK, so the GH bot wants this ticket (which is the parent of the 4.5 bug) closed.

Please check bug 1882394 for new updates on this.