Created attachment 1782610 [details] Logs of failing pods Cluster storage operator pods restart many times on SNO cluster Version-Release number of selected component (if applicable): Client Version: 4.8.0-0.nightly-2021-04-30-154520 Server Version: 4.8.0-0.nightly-2021-04-30-201824 Kubernetes Version: v1.21.0-rc.0+aa1dc1f How reproducible: Always, every 6 hours Steps to Reproduce: 1. Install SNO cluster and wait for at least 6 hours 2. Run command: $ oc get po -n openshift-cluster-storage-operator NAME READY STATUS RESTARTS AGE cluster-storage-operator-58dcf95d48-cj6bv 1/1 Running 4 7d20h csi-snapshot-controller-5f5bcf6bd9-pcrvb 1/1 Running 2 7d20h csi-snapshot-controller-operator-64c499f646-nfjst 1/1 Running 4 7d20h csi-snapshot-webhook-59b54f69cd-ngg2f 1/1 Running 0 7d20h 3. Get the log before restart: $ oc -n openshift-cluster-storage-operator logs {pod name} -p Actual results: The pods are failing because of temporary API unavailability caused by certificate rotation Expected results: As part of SNO stabilization, we want to ensure that all the operators can handle API unavailability of 60s. Node Log (of failed PODs): cluster-storage-operator-logs.tar.gz Additional info:
What's wrong with restart? In the logs I can see that leader election refresh has failed or the operator has failed to establish delegated authentication. Especially in the leader election case, I think that exit() is the best solution, there could be other aspiring leader that can access the API server and it could get elected soon. We definitely don't want several operator leaders running in the cluster.
*** This bug has been marked as a duplicate of bug 1986215 ***