Description of problem: cluster-storage-operator is crashing/restarting and is going though leader elections during the kube-apiserver rollout which takes around ~60 seconds with shutdown-delay-duration and gracefulTerminationDuration is now set to 0 and 15 seconds ( https://github.com/openshift/cluster-kube-apiserver-operator/pull/1168 and https://github.com/openshift/library-go/pull/1104 ). cluster-storage-operator leader election timeout should be set to > 60 seconds to handle the downtime gracefully in SNO. Recommended lease duration values to be considered for reference as noted in https://github.com/openshift/enhancements/pull/832/files#diff-2e28754e69aa417e5b6d89e99e42f05bfb6330800fa823753383db1d170fbc2fR183: LeaseDuration=137s, RenewDealine=107s, RetryPeriod=26s. These are the configurable values in k8s.io/client-go based leases and controller-runtime exposes them. This gives us 1. clock skew tolerance == 30s 2. kube-apiserver downtime tolerance == 78s 3. worst non-graceful lease reacquisition == 163s 4. worst graceful lease reacquisition == 26s Here is the trace of the events during the rollout: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/cluster-storage-operator/cerberus_cluster_state.log. As we can see for the log: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/cluster-storage-operator/cluster-storage-operator-pod.log, there are leader lease failures. The leader election can also be disabled given that there's no HA in SNO. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-26-031621 How reproducible: Always Steps to Reproduce: 1. Install a SNO cluster using the latest nightly payload. 2. Trigger kube-apiserver rollout/outage - $ oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATIONX"}}' where X can be 1,2...n 3. Observe the state of cluster-storage-operator. Actual results: cluster-storage-operator is crashing/restarting and going through leader elections. Expected results: cluster-storage-operator should handle the API rollout/outage gracefully. Additional info: Logs including must-gather: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/cluster-storage-operator/.
Reused this BZ to fix cluster-csi-snapshot-controller-operator leader election timeouts.
CSO + CSI driver operators still need to be fixed.
Tested on build: 4.9.0-0.nightly-2021-08-07-175228 Failed: There is increase in the number of restarts for csi-snap-shot-controller after triggering the patch command and moving to leaderElection. 1) Created sno cluster rohitpatil@ropatil-mac cluster % oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-189-194.us-west-1.compute.internal Ready master,worker 29m v1.21.1+8268f88 2) Observed the logs: NAME READY STATUS RESTARTS AGE cluster-storage-operator-6bbf5f9d9d-g2mt9 1/1 Running 1 30m csi-snapshot-controller-b95b686f9-fq24t 1/1 Running 5 28m csi-snapshot-controller-operator-565bf56b7-4fc8k 1/1 Running 0 30m csi-snapshot-webhook-5f8d9b45f-cp9g5 1/1 Running 0 28m #Event log observation before triggering the oc patch command: 21m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 13m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 11m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 3) Apply the patch rohitpatil@ropatil-mac cluster % oc patch kubeapiserver/cluster --type merge -p '{"spec":{"forceRedeploymentReason":"ITERATION1"}}' kubeapiserver.operator.openshift.io/cluster patched 4) Check the pod status and restart counts NAME READY STATUS RESTARTS AGE cluster-storage-operator-6bbf5f9d9d-g2mt9 1/1 Running 1 34m csi-snapshot-controller-b95b686f9-fq24t 1/1 Running 7 33m csi-snapshot-controller-operator-565bf56b7-4fc8k 1/1 Running 0 34m csi-snapshot-webhook-5f8d9b45f-cp9g5 1/1 Running 0 33m #Event logs 18m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 16m Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader 58s Normal LeaderElection lease/snapshot-controller-leader csi-snapshot-controller-b95b686f9-fq24t became leader
*** Bug 1960120 has been marked as a duplicate of this bug. ***
The restarts are happening in csi-snapshot-controller, which is an upstream project. I've submitted a PR to address this in upstream [1]. Once that's merged in upstream we'll be able to pull those changes in our downstream fork. [1] https://github.com/kubernetes-csi/external-snapshotter/pull/575
@Rohit, can we address only CSO in this ticket? I created bug 1992255 to cover the csi-snapshot-controller issue, since this ticket was originally created for CSO.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759