Bug 1986215
| Summary: | cluster-storage-operator needs to handle API server downtime gracefully in SNO | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> | |
| Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> | |
| Storage sub component: | Operators | QA Contact: | Rohit Patil <ropatil> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | high | CC: | aos-bugs, jsafrane, nelluri, ropatil, vgrinber | |
| Version: | 4.9 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Linux | |||
| Whiteboard: | chaos | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1992255 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:41:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1984730, 1992255 | |||
|
Description
Naga Ravi Chaitanya Elluri
2021-07-27 00:02:40 UTC
Reused this BZ to fix cluster-csi-snapshot-controller-operator leader election timeouts. CSO + CSI driver operators still need to be fixed. Tested on build: 4.9.0-0.nightly-2021-08-07-175228
Failed: There is increase in the number of restarts for csi-snap-shot-controller after triggering the patch command and moving to leaderElection.
1) Created sno cluster
rohitpatil@ropatil-mac cluster % oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-189-194.us-west-1.compute.internal Ready master,worker 29m v1.21.1+8268f88
2) Observed the logs:
NAME READY STATUS RESTARTS AGE
cluster-storage-operator-6bbf5f9d9d-g2mt9 1/1 Running 1 30m
csi-snapshot-controller-b95b686f9-fq24t 1/1 Running 5 28m
csi-snapshot-controller-operator-565bf56b7-4fc8k 1/1 Running 0 30m
csi-snapshot-webhook-5f8d9b45f-cp9g5 1/1 Running 0 28m
#Event log observation before triggering the oc patch command:
21m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
13m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
11m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
3) Apply the patch
rohitpatil@ropatil-mac cluster % oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATION1"}}'
kubeapiserver.operator.openshift.io/cluster patched
4) Check the pod status and restart counts
NAME READY STATUS RESTARTS AGE
cluster-storage-operator-6bbf5f9d9d-g2mt9 1/1 Running 1 34m
csi-snapshot-controller-b95b686f9-fq24t 1/1 Running 7 33m
csi-snapshot-controller-operator-565bf56b7-4fc8k 1/1 Running 0 34m
csi-snapshot-webhook-5f8d9b45f-cp9g5 1/1 Running 0 33m
#Event logs
18m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
16m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
58s Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
*** Bug 1960120 has been marked as a duplicate of this bug. *** The restarts are happening in csi-snapshot-controller, which is an upstream project. I've submitted a PR to address this in upstream [1]. Once that's merged in upstream we'll be able to pull those changes in our downstream fork. [1] https://github.com/kubernetes-csi/external-snapshotter/pull/575 @Rohit, can we address only CSO in this ticket? I created bug 1992255 to cover the csi-snapshot-controller issue, since this ticket was originally created for CSO. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |