Bug 1960120

Summary:

Cluster storage operator pods restart many times on SNO clusters

Product:

OpenShift Container Platform

Reporter:

Vitaly Grinberg <vgrinber>

Component:

Storage

Assignee:

Fabio Bertinatto <fbertina>

Storage sub component:

Operators

QA Contact:

Wei Duan <wduan>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

aos-bugs, jsafrane, vgrinber

Version:

4.8

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-08-09 18:08:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Logs of failing pods	none

Description Vitaly Grinberg 2021-05-13 05:41:02 UTC

Created attachment 1782610 [details]
Logs of failing pods

Cluster storage operator pods restart many times on SNO cluster

Version-Release number of selected component (if applicable):
Client Version: 4.8.0-0.nightly-2021-04-30-154520
Server Version: 4.8.0-0.nightly-2021-04-30-201824
Kubernetes Version: v1.21.0-rc.0+aa1dc1f


How reproducible:
Always, every 6 hours

Steps to Reproduce:
1. Install SNO cluster and wait for at least 6 hours

2. Run command:
$ oc get po -n openshift-cluster-storage-operator
NAME                                                READY   STATUS    RESTARTS   AGE
cluster-storage-operator-58dcf95d48-cj6bv           1/1     Running   4          7d20h
csi-snapshot-controller-5f5bcf6bd9-pcrvb            1/1     Running   2          7d20h
csi-snapshot-controller-operator-64c499f646-nfjst   1/1     Running   4          7d20h
csi-snapshot-webhook-59b54f69cd-ngg2f               1/1     Running   0          7d20h

3. Get the log before restart:
$ oc -n openshift-cluster-storage-operator logs {pod name} -p

Actual results:
The pods are failing because of temporary API unavailability caused by certificate rotation


Expected results:
As part of SNO stabilization, we want to ensure that all the operators can handle API unavailability of 60s.

Node Log (of failed PODs):
cluster-storage-operator-logs.tar.gz


Additional info:

Comment 1 Jan Safranek 2021-05-14 11:54:28 UTC

What's wrong with restart? In the logs I can see that leader election refresh has failed or the operator has failed to establish delegated authentication. Especially in the leader election case, I think that exit() is the best solution, there could be other aspiring leader that can access the API server and it could get elected soon. We definitely don't want several operator leaders running in the cluster.

Comment 4 Fabio Bertinatto 2021-08-09 18:08:34 UTC


*** This bug has been marked as a duplicate of bug 1986215 ***