Bug 1960120
Summary: | Cluster storage operator pods restart many times on SNO clusters | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vitaly Grinberg <vgrinber> | ||||
Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> | ||||
Storage sub component: | Operators | QA Contact: | Wei Duan <wduan> | ||||
Status: | CLOSED DUPLICATE | Docs Contact: | |||||
Severity: | medium | ||||||
Priority: | unspecified | CC: | aos-bugs, jsafrane, vgrinber | ||||
Version: | 4.8 | ||||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-08-09 18:08:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
What's wrong with restart? In the logs I can see that leader election refresh has failed or the operator has failed to establish delegated authentication. Especially in the leader election case, I think that exit() is the best solution, there could be other aspiring leader that can access the API server and it could get elected soon. We definitely don't want several operator leaders running in the cluster. *** This bug has been marked as a duplicate of bug 1986215 *** |
Created attachment 1782610 [details] Logs of failing pods Cluster storage operator pods restart many times on SNO cluster Version-Release number of selected component (if applicable): Client Version: 4.8.0-0.nightly-2021-04-30-154520 Server Version: 4.8.0-0.nightly-2021-04-30-201824 Kubernetes Version: v1.21.0-rc.0+aa1dc1f How reproducible: Always, every 6 hours Steps to Reproduce: 1. Install SNO cluster and wait for at least 6 hours 2. Run command: $ oc get po -n openshift-cluster-storage-operator NAME READY STATUS RESTARTS AGE cluster-storage-operator-58dcf95d48-cj6bv 1/1 Running 4 7d20h csi-snapshot-controller-5f5bcf6bd9-pcrvb 1/1 Running 2 7d20h csi-snapshot-controller-operator-64c499f646-nfjst 1/1 Running 4 7d20h csi-snapshot-webhook-59b54f69cd-ngg2f 1/1 Running 0 7d20h 3. Get the log before restart: $ oc -n openshift-cluster-storage-operator logs {pod name} -p Actual results: The pods are failing because of temporary API unavailability caused by certificate rotation Expected results: As part of SNO stabilization, we want to ensure that all the operators can handle API unavailability of 60s. Node Log (of failed PODs): cluster-storage-operator-logs.tar.gz Additional info: