Bug 1960120 - Cluster storage operator pods restart many times on SNO clusters
Summary: Cluster storage operator pods restart many times on SNO clusters
Keywords:
Status: CLOSED DUPLICATE of bug 1986215
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Fabio Bertinatto
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-13 05:41 UTC by Vitaly Grinberg
Modified: 2021-08-09 18:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-09 18:08:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs of failing pods (20.62 KB, application/gzip)
2021-05-13 05:41 UTC, Vitaly Grinberg
no flags Details

Description Vitaly Grinberg 2021-05-13 05:41:02 UTC
Created attachment 1782610 [details]
Logs of failing pods

Cluster storage operator pods restart many times on SNO cluster

Version-Release number of selected component (if applicable):
Client Version: 4.8.0-0.nightly-2021-04-30-154520
Server Version: 4.8.0-0.nightly-2021-04-30-201824
Kubernetes Version: v1.21.0-rc.0+aa1dc1f


How reproducible:
Always, every 6 hours

Steps to Reproduce:
1. Install SNO cluster and wait for at least 6 hours

2. Run command:
$ oc get po -n openshift-cluster-storage-operator
NAME                                                READY   STATUS    RESTARTS   AGE
cluster-storage-operator-58dcf95d48-cj6bv           1/1     Running   4          7d20h
csi-snapshot-controller-5f5bcf6bd9-pcrvb            1/1     Running   2          7d20h
csi-snapshot-controller-operator-64c499f646-nfjst   1/1     Running   4          7d20h
csi-snapshot-webhook-59b54f69cd-ngg2f               1/1     Running   0          7d20h

3. Get the log before restart:
$ oc -n openshift-cluster-storage-operator logs {pod name} -p

Actual results:
The pods are failing because of temporary API unavailability caused by certificate rotation


Expected results:
As part of SNO stabilization, we want to ensure that all the operators can handle API unavailability of 60s.

Node Log (of failed PODs):
cluster-storage-operator-logs.tar.gz


Additional info:

Comment 1 Jan Safranek 2021-05-14 11:54:28 UTC
What's wrong with restart? In the logs I can see that leader election refresh has failed or the operator has failed to establish delegated authentication. Especially in the leader election case, I think that exit() is the best solution, there could be other aspiring leader that can access the API server and it could get elected soon. We definitely don't want several operator leaders running in the cluster.

Comment 4 Fabio Bertinatto 2021-08-09 18:08:34 UTC

*** This bug has been marked as a duplicate of bug 1986215 ***


Note You need to log in before you can comment on or make changes to this bug.