Bug 1986215
Summary: | cluster-storage-operator needs to handle API server downtime gracefully in SNO | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Naga Ravi Chaitanya Elluri <nelluri> | |
Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> | |
Storage sub component: | Operators | QA Contact: | Rohit Patil <ropatil> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | aos-bugs, jsafrane, nelluri, ropatil, vgrinber | |
Version: | 4.9 | |||
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Linux | |||
Whiteboard: | chaos | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1992255 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:41:26 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1984730, 1992255 |
Description
Naga Ravi Chaitanya Elluri
2021-07-27 00:02:40 UTC
Reused this BZ to fix cluster-csi-snapshot-controller-operator leader election timeouts. CSO + CSI driver operators still need to be fixed. Tested on build: 4.9.0-0.nightly-2021-08-07-175228 Failed: There is increase in the number of restarts for csi-snap-shot-controller after triggering the patch command and moving to leaderElection. 1) Created sno cluster rohitpatil@ropatil-mac cluster % oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-189-194.us-west-1.compute.internal Ready master,worker 29m v1.21.1+8268f88 2) Observed the logs: NAME READY STATUS RESTARTS AGE cluster-storage-operator-6bbf5f9d9d-g2mt9 1/1 Running 1 30m csi-snapshot-controller-b95b686f9-fq24t 1/1 Running 5 28m csi-snapshot-controller-operator-565bf56b7-4fc8k 1/1 Running 0 30m csi-snapshot-webhook-5f8d9b45f-cp9g5 1/1 Running 0 28m #Event log observation before triggering the oc patch command: 21m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 13m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 11m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 3) Apply the patch rohitpatil@ropatil-mac cluster % oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATION1"}}' kubeapiserver.operator.openshift.io/cluster patched 4) Check the pod status and restart counts NAME READY STATUS RESTARTS AGE cluster-storage-operator-6bbf5f9d9d-g2mt9 1/1 Running 1 34m csi-snapshot-controller-b95b686f9-fq24t 1/1 Running 7 33m csi-snapshot-controller-operator-565bf56b7-4fc8k 1/1 Running 0 34m csi-snapshot-webhook-5f8d9b45f-cp9g5 1/1 Running 0 33m #Event logs 18m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 16m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 58s Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader *** Bug 1960120 has been marked as a duplicate of this bug. *** The restarts are happening in csi-snapshot-controller, which is an upstream project. I've submitted a PR to address this in upstream [1]. Once that's merged in upstream we'll be able to pull those changes in our downstream fork. [1] https://github.com/kubernetes-csi/external-snapshotter/pull/575 @Rohit, can we address only CSO in this ticket? I created bug 1992255 to cover the csi-snapshot-controller issue, since this ticket was originally created for CSO. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |