Bug 1877316
| Summary: | CSO stuck on message "the cluster operator storage has not yet successfully rolled out" while downgrading from 4.6 -> 4.5 | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | pmali | |
| Component: | Storage | Assignee: | aos-storage-staff <aos-storage-staff> | |
| Storage sub component: | Operators | QA Contact: | Qin Ping <piqin> | |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
| Severity: | high | |||
| Priority: | medium | CC: | aos-bugs, eparis, fbertina, jsafrane, sdodson, wking | |
| Version: | 4.5 | Keywords: | TestBlocker | |
| Target Milestone: | --- | |||
| Target Release: | 4.6.0 | |||
| Hardware: | All | |||
| OS: | All | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1882394 (view as bug list) | Environment: | ||
| Last Closed: | 2020-09-25 07:44:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1882394 | |||
|
Description
pmali
2020-09-09 11:37:28 UTC
In the cluster-storage-operator logs I can see:
2020-09-09T08:18:16.788751731Z {"level":"info","ts":1599639496.7886891,"logger":"cmd","msg":"Go Version: go1.13.4"}
2020-09-09T08:18:16.788895430Z {"level":"info","ts":1599639496.788878,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
2020-09-09T08:18:16.788923983Z {"level":"info","ts":1599639496.7889135,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0"}
2020-09-09T08:18:16.789571987Z {"level":"info","ts":1599639496.789465,"logger":"leader","msg":"Trying to become the leader."}
2020-09-09T08:18:17.187376554Z {"level":"info","ts":1599639497.1873038,"logger":"leader","msg":"Not the leader. Waiting."}
...
2020-09-09T11:40:29.805119287Z {"level":"info","ts":1599651629.8050728,"logger":"leader","msg":"Not the leader. Waiting."}
I.e. the operator was not able to acquire leadership lock for 3 hours.
And the lock config map looks like this:
- apiVersion: v1
kind: ConfigMap
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"","leaseDurationSeconds":0,"acquireTime":null,"renewTime":null,"leaderTransitions":2}'
creationTimestamp: "2020-09-09T05:47:28Z"
name: cluster-storage-operator-lock
namespace: openshift-cluster-storage-operator
Not sure what zero / null values mean.
*** Bug 1880053 has been marked as a duplicate of this bug. *** Given this happens 100% of the time I'm raising the severity back to medium. Workaround, delete the lock configmap. $ oc delete cm -n openshift-cluster-storage-operator cluster-storage-operator-lock configmap "cluster-storage-operator-lock" deleted Is it possible that the lock was explicitly yielded, which is how we get the 'null' thing, but either yielded incorrectly or the 'take the lock code' incorrectly doesn't know how to handle a yielded lock? We'd need to dig into that code a bit to see where the mis-match is. I've moved this back to 4.6.0 because I've been asked to get the upgrade variant of CI back on target and I cannot do that with one of the jobs failing at 100% rate. If this were simply flaky I'd have left it at 4.7. https://sippy.ci.openshift.org/?release=4.6 I think I found the reason why this is happening. The ConfigMap is created by CSO 4.6 (library-go) and it doesn't have any ownerReference set (as per must-gather data). However, CSO 4.5 uses operator-sdk, and it expects that the ConfigMap has an ownerReference in order to know whether the lock belongs to it [1] or not. Since it doesn't find any, it tries to create one, but it gets back an error stating the the object already exists [2]. So CSO 4.5 thinks that the lock belongs to somebody else and it never starts. Monday I'll check if the fix should go to operator-sdk or to library-go. [1] https://github.com/openshift/cluster-storage-operator/blob/release-4.5/vendor/github.com/operator-framework/operator-sdk/pkg/leader/leader.go#L83-L102 [2] https://github.com/openshift/cluster-storage-operator/blob/release-4.5/vendor/github.com/operator-framework/operator-sdk/pkg/leader/leader.go#L119-L125 I confirmed this is happening because of the leader-for-life and leader-with-lease mismatch between CSO 4.6 and CSO 4.5. I've submitted a PR to address this bug at [1]. The patch basically switches the leader-for-life approach used in CSO 4.5 to the leader-with-lease one. While the patch does fix the downgrade problem from 4.6 to 4.5.z (with [1]), it also will introduce the same problem when downgrading 4.5.z (with [1]) to 4.5.z-1 (without [1]). That being said, I have a few questions (@Eric, please help): 1. Do we want to go ahead with [1], even with the trade-off I described above? I suppose we want 4.5.z -> 4.5.z-1 rollbacks to work as well? 2. Manually deleting the ConfigMap also unblocks the downgrade, could this be fixed via documentation? [1] https://github.com/openshift/cluster-storage-operator/pull/91. I think breaking a z-downgrade would be worst option. Doc-ing the workaround would be barely acceptable. Could we potentially do a hack to make the 4.5 code recognize a yielded 4.6 leader-with-lease and respond (?maybe delete the configmap?) so it can switch back to the 4.5 leader-for-life model? Would that be a ton of unsupportable work to have the 4.5 code recognize and respond to the 4.6 lock difference? (In reply to Eric Paris from comment #13) > I think breaking a z-downgrade would be worst option. > Doc-ing the workaround would be barely acceptable. > Could we potentially do a hack to make the 4.5 code recognize a yielded 4.6 > leader-with-lease and respond (?maybe delete the configmap?) so it can > switch back to the 4.5 leader-for-life model? Would that be a ton of > unsupportable work to have the 4.5 code recognize and respond to the 4.6 > lock difference? I've changed the PR above to test this approach. Will update this bug when I have more information. For reference: I cloned this into bug 1882394 because the patch goes to 4.5. Eventually I'll close this one. PR is under review. I tested the changes with: 4.6.0-0.nightly-2020-09-21-030155 -> 4.5.z (with my patch) 4.5.patch (with my patch) -> 4.6.0-0.nightly-2020-09-22-073212 Worked OK. OK, so the GH bot wants this ticket (which is the parent of the 4.5 bug) closed. Please check bug 1882394 for new updates on this. |