Bug 1960120

Summary: Cluster storage operator pods restart many times on SNO clusters
Product: OpenShift Container Platform Reporter: Vitaly Grinberg <vgrinber>
Component: StorageAssignee: Fabio Bertinatto <fbertina>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, jsafrane, vgrinber
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-09 18:08:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs of failing pods none

Description Vitaly Grinberg 2021-05-13 05:41:02 UTC
Created attachment 1782610 [details]
Logs of failing pods

Cluster storage operator pods restart many times on SNO cluster

Version-Release number of selected component (if applicable):
Client Version: 4.8.0-0.nightly-2021-04-30-154520
Server Version: 4.8.0-0.nightly-2021-04-30-201824
Kubernetes Version: v1.21.0-rc.0+aa1dc1f


How reproducible:
Always, every 6 hours

Steps to Reproduce:
1. Install SNO cluster and wait for at least 6 hours

2. Run command:
$ oc get po -n openshift-cluster-storage-operator
NAME                                                READY   STATUS    RESTARTS   AGE
cluster-storage-operator-58dcf95d48-cj6bv           1/1     Running   4          7d20h
csi-snapshot-controller-5f5bcf6bd9-pcrvb            1/1     Running   2          7d20h
csi-snapshot-controller-operator-64c499f646-nfjst   1/1     Running   4          7d20h
csi-snapshot-webhook-59b54f69cd-ngg2f               1/1     Running   0          7d20h

3. Get the log before restart:
$ oc -n openshift-cluster-storage-operator logs {pod name} -p

Actual results:
The pods are failing because of temporary API unavailability caused by certificate rotation


Expected results:
As part of SNO stabilization, we want to ensure that all the operators can handle API unavailability of 60s.

Node Log (of failed PODs):
cluster-storage-operator-logs.tar.gz


Additional info:

Comment 1 Jan Safranek 2021-05-14 11:54:28 UTC
What's wrong with restart? In the logs I can see that leader election refresh has failed or the operator has failed to establish delegated authentication. Especially in the leader election case, I think that exit() is the best solution, there could be other aspiring leader that can access the API server and it could get elected soon. We definitely don't want several operator leaders running in the cluster.

Comment 4 Fabio Bertinatto 2021-08-09 18:08:34 UTC

*** This bug has been marked as a duplicate of bug 1986215 ***