Bug 1909268 - OCS 4.7 UI install -All OCS operator pods respin after storagecluster creation
Summary: OCS 4.7 UI install -All OCS operator pods respin after storagecluster creation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.7.0
Assignee: umanga
QA Contact: Itzhak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-18 18:49 UTC by Neha Berry
Modified: 2023-09-15 00:56 UTC (History)
8 users (show)

Fixed In Version: 4.7.0-723.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-19 09:17:08 UTC
Embargoed:


Attachments (Terms of Use)
csv pods, ceph storage cluster resources, obc BS (1.08 MB, text/plain)
2021-04-14 12:02 UTC, Itzhak
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1022 0 None closed Upgrade rook to v1.5.6 and K8s to v0.20.2 2021-02-14 12:04:11 UTC
Github openshift ocs-operator pull 1045 0 None closed Bug 1902192: [release-4.7] Upgrade rook to v1.5.6 and K8s to 0.20.2 2021-02-17 07:36:43 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:17:59 UTC

Description Neha Berry 2020-12-18 18:49:46 UTC
Description of problem (please be detailed as possible and provide log
snippests):
----------------------------------------------------------------
Following 2 new behaviors are seen in OCS 4.7 install:

a) After installing the OCS operator, it is seen that for few seconds, we had 2 noobaa-operator and 2 rook-ceph-operator pods. but ultimately we had 4 operators as expected

Fri Dec 18 17:21:11 UTC 2020
--------------
========CSV ======
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-199.ci   OpenShift Container Storage   4.7.0-199.ci              Installing
--------------
=======PODS ======
NAME                                    READY   STATUS        RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
noobaa-operator-68789479f9-crfs9        0/1     Terminating   0          42s   10.131.0.12   ip-10-0-129-108.us-east-2.compute.internal   <none>           <none>
noobaa-operator-79647874ff-jr6m6        1/1     Running       0          33s   10.129.2.30   ip-10-0-175-26.us-east-2.compute.internal    <none>           <none>
ocs-metrics-exporter-85479869bc-lr46n   1/1     Running       0          33s   10.131.0.13   ip-10-0-129-108.us-east-2.compute.internal   <none>           <none>
ocs-operator-548bfccdd9-gx6bf           0/1     Terminating   0          42s   10.129.2.28   ip-10-0-175-26.us-east-2.compute.internal    <none>           <none>
rook-ceph-operator-5465bd45c4-gtjn8     0/1     Terminating   0          42s   10.128.2.17   ip-10-0-192-119.us-east-2.compute.internal   <none>           <none>
rook-ceph-operator-7b6d685f58-j96hm     1/1     Running       0          33s   10.131.0.14   ip-10-0-129-108.us-east-2.compute.internal   <none>           <none>

b) On installing Storage Cluster, these operator pods re-spinned again on their own. never seen this in any OCS  verison before 4.7

noobaa-operator-68789479f9-lljjv                            1/1     Running             0          20s   10.129.2.39    ip-10-0-175-26.us-east-2.compute.internal    <none>           <none>
noobaa-operator-79647874ff-jr6m6                            1/1     Terminating         0          15m   10.129.2.30    ip-10-0-175-26.us-east-2.compute.internal    <none>           <none>

ocs-metrics-exporter-6f94c4fb96-7djnn                       1/1     Running             0          20s   10.129.2.38    ip-10-0-175-26.us-east-2.compute.internal    <none>           <none>
ocs-metrics-exporter-85479869bc-lr46n                       1/1     Terminating         0          15m   10.131.0.13    ip-10-0-129-108.us-east-2.compute.internal   <none>           <none>

ocs-operator-6bccd6f885-5fss7                               0/1     Terminating         0          13m   10.129.2.31    ip-10-0-175-26.us-east-2.compute.internal    <none>           <none>


rook-ceph-operator-5465bd45c4-cptdv                         1/1     Running             0          20s   10.131.0.16    ip-10-0-129-108.us-east-2.compute.internal   <none>           <none>
rook-ceph-operator-7b6d685f58-j96hm                         0/1     Terminating         0          15m   10.131.0.14    ip-10-0-129-108.us-east-2.compute.internal   <none>           <none>







Version of all relevant components (if applicable):
======================================================
$ oc get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.0-199.ci   OpenShift Container Storage   4.7.0-199.ci              Succeeded
[nberry@localhost nberry-aws-199.ci]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-12-18-120350   True        False         75m     Cluster version is 4.7.0-0.nightly-2020-12-18-120350




Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
------------------------------------------------------------------
No. But pod restarts should be for a reason.

Is there any workaround available to the best of your knowledge?
---------------------------------------------------
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
----------------------------------------------------------------
4

Can this issue reproducible?
-----------------------------
Yes . Seen on Bare-metal LSO, VMware and AWS clusters

Can this issue reproduce from the UI?
-------------------------------------
Yes

If this is a regression, please provide more details to justify this:
----------------------------------------------------------------
Yes. The operator pods never re-spinned in previous releases on their own while creating storage cluster

Steps Performed
=====================

1. Installed OCS Operator ocs-operator.v4.7.0-199.ci on OCP 4.7.0-0.nightly-2020-12-18-120350

** The namespace is created but monitoring label is not added, as per fix Bug 1866298

2. It is seen that the pods  

Checked the 4 operator pods and they are finally in Running state (age >14m)

3. Installed Storage cluster and ultimately all OCS pods are created

** The monitoring label is still not added to the namespace openshift-storage, see Bug 1866298#c7

4. Checked that all operator pods have voluntarily restarted 



Actual results:
-------------------
Operator pods are re-spinning on their own, after Operator install and also after storage cluster creation


Expected results:
-------------------
No unexpected restarts of the pods

Comment 18 umanga 2021-02-09 06:37:08 UTC
Fixed via https://github.com/openshift/ocs-operator/pull/1022.

Comment 19 Mudit Agarwal 2021-02-09 13:01:24 UTC
Umanga, this not there in 4.7 yet. right?

Comment 20 umanga 2021-02-17 07:38:19 UTC
Fix is merged and backported to 4.7.
Clearing needinfo.

Comment 22 Itzhak 2021-04-14 11:50:09 UTC
To reproduce the BZ, I performed the steps below:

1. Install an OCP 4.7 cluster on vSphere without OCS.

2. Run the script below in the cli to check the operator pods status and the status of the resources: 
while true;  do date --utc;  echo --------------; echo ========"CSV" ======;  oc get csv -n openshift-storage;  echo --------------; echo ======="PODS" ======; oc get pods -o wide -n openshift-storage; echo --------------; echo ======= "PVC" ==========;  oc get pvc -n openshift-storage ; echo --------------; echo ======= "storagecluster" ==========;  oc get storagecluster -n openshift-storage; echo --------------; echo ======= "cephcluster" ==========;  oc get cephcluster -n openshift-storage; echo ======= "backingstore" ==========;  oc get backingstore -n openshift-storage; echo ======= "bucketclass" ==========;  oc get bucketclass -n openshift-storage; echo ======= "bucketclaim" ==========;  oc get obc -n openshift-storage; sleep 10; done | tee csv-pods-pvc-ceph-storage-cluster-obc-BS.txt

3. After approximately 12 minutes I created the OCS 4.7 storage cluster. (The script is still running in the background) 

4. After approximately 30 minutes I created an LSO operator, and created a new storage cluster using LSO. (The script is still running in the background)

5. Wait another 40 minutes. 

After looking at the file "csv-pods-pvc-ceph-storage-cluster-obc-BS.txt" where the script wrote all the results - I saw that all the operator pods never restarted or deleted. Also, all the other resources look fine.

Comment 23 Itzhak 2021-04-14 12:02:41 UTC
Created attachment 1771885 [details]
csv pods, ceph storage cluster resources, obc BS

Comment 24 Itzhak 2021-04-14 12:06:08 UTC
Versions:

OCP version:
Client Version: 4.7.0-0.nightly-2021-04-10-082109
Server Version: 4.7.0-0.nightly-2021-04-10-082109
Kubernetes Version: v1.20.0+c8905da

OCS verison:
ocs-operator.v4.7.0-324.ci   OpenShift Container Storage   4.7.0-324.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-04-10-082109   True        False         20h     Cluster version is 4.7.0-0.nightly-2021-04-10-082109

Rook version
rook: 4.7-121.436d4ed74.release_4.7
go: go1.15.7

Ceph version
ceph version 14.2.11-138.el8cp (18a95d26e01b87abf3e47e9f01f615b8d2dd03c4) nautilus (stable)

Comment 26 errata-xmlrpc 2021-05-19 09:17:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 27 Red Hat Bugzilla 2023-09-15 00:56:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.