Description of problem: While Downgrading from cluster version 4.6.0-0.nightly-2020-09-09-003430 to 4.5.0-0.nightly-2020-09-08-123650, Downgrade process stuck on storage operator. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-09-003430 True True 3h24m Unable to apply 4.5.0-0.nightly-2020-09-08-123650: the cluster operator storage has not yet successfully rolled out Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Installed cluster version 4.6.0-0.nightly-2020-09-09-003430 and then Downgrade to 4.5.0-0.nightly-2020-09-08-123650 2. 3. Actual results: Downgrade stuck on the storage operator with message "Unable to apply 4.5.0-0.nightly-2020-09-08-123650: the cluster operator storage has not yet successfully rolled out" Expected results: Downgrade should be successful without any issue. Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: $ oc get co storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE storage 4.6.0-0.nightly-2020-09-09-003430 True False False 5h24m
In the cluster-storage-operator logs I can see: 2020-09-09T08:18:16.788751731Z {"level":"info","ts":1599639496.7886891,"logger":"cmd","msg":"Go Version: go1.13.4"} 2020-09-09T08:18:16.788895430Z {"level":"info","ts":1599639496.788878,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} 2020-09-09T08:18:16.788923983Z {"level":"info","ts":1599639496.7889135,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0"} 2020-09-09T08:18:16.789571987Z {"level":"info","ts":1599639496.789465,"logger":"leader","msg":"Trying to become the leader."} 2020-09-09T08:18:17.187376554Z {"level":"info","ts":1599639497.1873038,"logger":"leader","msg":"Not the leader. Waiting."} ... 2020-09-09T11:40:29.805119287Z {"level":"info","ts":1599651629.8050728,"logger":"leader","msg":"Not the leader. Waiting."} I.e. the operator was not able to acquire leadership lock for 3 hours.
And the lock config map looks like this: - apiVersion: v1 kind: ConfigMap metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"","leaseDurationSeconds":0,"acquireTime":null,"renewTime":null,"leaderTransitions":2}' creationTimestamp: "2020-09-09T05:47:28Z" name: cluster-storage-operator-lock namespace: openshift-cluster-storage-operator Not sure what zero / null values mean.
*** Bug 1880053 has been marked as a duplicate of this bug. ***
Given this happens 100% of the time I'm raising the severity back to medium. Workaround, delete the lock configmap. $ oc delete cm -n openshift-cluster-storage-operator cluster-storage-operator-lock configmap "cluster-storage-operator-lock" deleted
Is it possible that the lock was explicitly yielded, which is how we get the 'null' thing, but either yielded incorrectly or the 'take the lock code' incorrectly doesn't know how to handle a yielded lock? We'd need to dig into that code a bit to see where the mis-match is.
I've moved this back to 4.6.0 because I've been asked to get the upgrade variant of CI back on target and I cannot do that with one of the jobs failing at 100% rate. If this were simply flaky I'd have left it at 4.7. https://sippy.ci.openshift.org/?release=4.6
I think I found the reason why this is happening. The ConfigMap is created by CSO 4.6 (library-go) and it doesn't have any ownerReference set (as per must-gather data). However, CSO 4.5 uses operator-sdk, and it expects that the ConfigMap has an ownerReference in order to know whether the lock belongs to it [1] or not. Since it doesn't find any, it tries to create one, but it gets back an error stating the the object already exists [2]. So CSO 4.5 thinks that the lock belongs to somebody else and it never starts. Monday I'll check if the fix should go to operator-sdk or to library-go. [1] https://github.com/openshift/cluster-storage-operator/blob/release-4.5/vendor/github.com/operator-framework/operator-sdk/pkg/leader/leader.go#L83-L102 [2] https://github.com/openshift/cluster-storage-operator/blob/release-4.5/vendor/github.com/operator-framework/operator-sdk/pkg/leader/leader.go#L119-L125
I confirmed this is happening because of the leader-for-life and leader-with-lease mismatch between CSO 4.6 and CSO 4.5. I've submitted a PR to address this bug at [1]. The patch basically switches the leader-for-life approach used in CSO 4.5 to the leader-with-lease one. While the patch does fix the downgrade problem from 4.6 to 4.5.z (with [1]), it also will introduce the same problem when downgrading 4.5.z (with [1]) to 4.5.z-1 (without [1]). That being said, I have a few questions (@Eric, please help): 1. Do we want to go ahead with [1], even with the trade-off I described above? I suppose we want 4.5.z -> 4.5.z-1 rollbacks to work as well? 2. Manually deleting the ConfigMap also unblocks the downgrade, could this be fixed via documentation? [1] https://github.com/openshift/cluster-storage-operator/pull/91.
I think breaking a z-downgrade would be worst option. Doc-ing the workaround would be barely acceptable. Could we potentially do a hack to make the 4.5 code recognize a yielded 4.6 leader-with-lease and respond (?maybe delete the configmap?) so it can switch back to the 4.5 leader-for-life model? Would that be a ton of unsupportable work to have the 4.5 code recognize and respond to the 4.6 lock difference?
(In reply to Eric Paris from comment #13) > I think breaking a z-downgrade would be worst option. > Doc-ing the workaround would be barely acceptable. > Could we potentially do a hack to make the 4.5 code recognize a yielded 4.6 > leader-with-lease and respond (?maybe delete the configmap?) so it can > switch back to the 4.5 leader-for-life model? Would that be a ton of > unsupportable work to have the 4.5 code recognize and respond to the 4.6 > lock difference? I've changed the PR above to test this approach. Will update this bug when I have more information.
For reference: I cloned this into bug 1882394 because the patch goes to 4.5. Eventually I'll close this one.
PR is under review. I tested the changes with: 4.6.0-0.nightly-2020-09-21-030155 -> 4.5.z (with my patch) 4.5.patch (with my patch) -> 4.6.0-0.nightly-2020-09-22-073212 Worked OK.
OK, so the GH bot wants this ticket (which is the parent of the 4.5 bug) closed. Please check bug 1882394 for new updates on this.